  • 关于文件流Seek以及Read操作的一点不满


    对于读取文件某指定位置开始的一段数据的操作, 我们一般可以用如下的代码来实现:

    Read File Stream Content
    1. private static string ReadContent(string fileName, int position, int length)
    2. {
    3.     if (!File.Exists(fileName))
    4.     {
    5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
    6.     }
    8.     using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    9.     using (StreamReader reader = new StreamReader(stream))
    10.     {
    11.         reader.BaseStream.Seek(position, SeekOrigin.Begin);
    12.         char[] buffer = new char[length];
    13.         reader.Read(buffer, 0, length);
    15.         return new string(buffer, 0, length);
    16.     }
    17. }

    这样的操作在代码上看来比较直观也易于理解。 如果想在同一个文件中读取多个这样的内容段, 一般可以写成如下(指定多个位置和多个需要对应读取的长度,参数列表仅为示意):

    Read Content With Seeking
    1. private static string[] ReadContents(string fileName, int[] positions, int[] lengths)
    2. {
    3.     if (!File.Exists(fileName))
    4.     {
    5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
    6.     }
    8.     using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    9.     using (StreamReader reader = new StreamReader(stream))
    10.     {
    11.         string[] contents = new string[positions.Length];
    13.         for (int i = 0; i < positions.Length; i++)
    14.         {
    15.             reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);
    16.             char[] buffer = new char[lengths[i]];
    17.             reader.Read(buffer, 0, lengths[i]);
    18.             contents[i] = new string(buffer, 0, lengths[i]);
    19.         }
    21.         return contents;
    22.     }
    23. }

    这看起来也没有什么问题。 但是如果我们提供一段测试程序, 就会发现出乎意料的结果:

    Test App
    1. static void Main(string[] args)
    2. {
    3.     string fileName = @"text.txt";
    5.     using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
    6.     using (StreamWriter writer = new StreamWriter(stream))
    7.     {
    8.         writer.Write("ABCDEFGHIJKLMNOPQ");
    9.     }
    12.     Console.WriteLine(ReadContent(fileName, 4, 2));
    13.     Console.WriteLine(ReadContent(fileName, 10, 2));
    14.     Console.WriteLine(ReadContent(fileName, 7, 2));
    15.     Console.WriteLine();
    17.     string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });
    18.     foreach (var item in contents)
    19.     {
    20.         Console.WriteLine(item);
    21.     }
    23.     Console.ReadKey();
    24. }



    所以当我们在同一个流中尝试定位的时候, 类库API并没有按照我们预想的那样, 取出对应的内容。 而看起来像是, 在一个文件流对象发生第一次Seek之后, 其后的所有Seek操作都失效了!这是为什么呢?


    事实上, StreamReader为了性能的考虑, 在自己的内部内置并维护了一个byte buffer。 如果在声明StreamReader对象的时候没有指定这个buffer的尺寸, 那么它的默认大小是1k。 如果是文件流, 那么这个buffer的默认大小是4K。 所有Read操作,都直接或间接转换为了对这个buffer的操作。

    Buffer Size
    1. // Using a 1K byte buffer and a 4K FileStream buffer works out pretty well
    2. // perf-wise.  On even a 40 MB text file, any perf loss by using a 4K
    3. // buffer is negated by the win of allocating a smaller byte[], which
    4. // saves construction time.  This does break adaptive buffering,
    5. // but this is slightly faster.
    6. internal const int DefaultBufferSize = 1024;  // Byte buffer size
    7. private const int DefaultFileStreamBufferSize = 4096;
    8. private const int MinBufferSize = 128;
    Read Buffer
    1.         // This version has a perf optimization to decode data DIRECTLY into the
    2.         // user's buffer, bypassing StreamWriter's own buffer.
    3.         // This gives a > 20% perf improvement for our encodings across the board,
    4.         // but only when asking for at least the number of characters that one
    5.         // buffer's worth of bytes could produce.
    6.         // This optimization, if run, will break SwitchEncoding, so we must not do
    7.         // this on the first call to ReadBuffer.
    8.         private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {
    9.             charLen = 0;
    10.             charPos = 0;
    11.             if (!_checkPreamble)
    12.                 byteLen = 0;
    13.             int charsRead = 0;
    14.             // As a perf optimization, we can decode characters DIRECTLY into a
    15.             // user's char[].  We absolutely must not write more characters
    16.             // into the user's buffer than they asked for.  Calculating
    17.             // encoding.GetMaxCharCount(byteLen) each time is potentially very
    18.             // expensive - instead, cache the number of chars a full buffer's
    19.             // worth of data may produce.  Yes, this makes the perf optimization
    20.             // less aggressive, in that all reads that asked for fewer than AND
    21.             // returned fewer than _maxCharsPerBuffer chars won't get the user
    22.             // buffer optimization.  This affects reads where the end of the
    23.             // Stream comes in the middle somewhere, and when you ask for
    24.             // fewer chars than than your buffer could produce.
    25.             readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
    26.             do {
    27.                 if (_checkPreamble) {
    28.                     BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble.  Are two threads using this StreamReader at the same time?");
    29.                     int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);
    30.                     BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
    32.                     if (len == 0) {
    33.                         // EOF but we might have buffered bytes from previous
    34.                         // attempts to detecting preamble that needs to decoded now
    35.                         if (byteLen > 0) {
    36.                             if (readToUserBuffer) {
    37.                                 charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
    38.                                 charLen = 0;  // StreamReader's buffer is empty.
    39.                             }
    40.                             else {
    41.                                 charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
    42.                                 charLen += charsRead;  // Number of chars in StreamReader's buffer.
    43.                             }
    44.                         }
    45.                         return charsRead;
    46.                     }
    48.                     byteLen += len;
    49.                 }
    50.                 else {
    51.                     BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble.  Are two threads using this StreamReader at the same time?");
    52.                     byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length);
    53.                     BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
    54.                     if (byteLen == 0)  // EOF
    55.                         return charsRead;
    56.                 }
    58.                 // _isBlocked == whether we read fewer bytes than we asked for.
    59.                 // Note we must check it here because CompressBuffer or
    60.                 // DetectEncoding will ---- with byteLen.
    61.                 _isBlocked = (byteLen < byteBuffer.Length);
    62.                 // Check for preamble before detect encoding. This is not to override the
    63.                 // user suppplied Encoding for the one we implicitly detect. The user could
    64.                 // customize the encoding which we will loose, such as ThrowOnError on UTF8
    65.                 // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble
    66.                 // doesn't change the encoding or affect _maxCharsPerBuffer
    67.                 if (IsPreamble())
    68.                     continue;
    70.                 // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it.
    71.                 if (_detectEncoding && byteLen >= 2) {
    72.                     DetectEncoding();
    73.                     // DetectEncoding changes some buffer state.  Recompute this.
    74.                     readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
    75.                 }
    77.                 charPos = 0;
    78.                 if (readToUserBuffer) {
    79.                     charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
    80.                     charLen = 0;  // StreamReader's buffer is empty.
    81.                 }
    82.                 else {
    83.                     charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
    84.                     charLen += charsRead;  // Number of chars in StreamReader's buffer.
    85.                 }
    86.             } while (charsRead == 0);
    88.             _isBlocked &= charsRead < desiredChars;
    89.             //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+"  readToUserBuffer: "+readToUserBuffer);
    90.             return charsRead;
    91.         }

    所以问题就转化为, 当第二次调用BaseStream.Seek的时候, 对应的buffer的内容并没有重新读取!所以第二次读取的时候, 对应读取的内容其实是第一次seek后, 对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了(或者完全不在缓存中)。

    如果想要在第二次seek前刷新缓存, 必须显式调用DiscardBufferedData():

    Code Snippet
    1. // DiscardBufferedData tells StreamReader to throw away its internal
    2. // buffer contents.  This is useful if the user needs to seek on the
    3. // underlying stream to a known location then wants the StreamReader
    4. // to start reading from this new point.  This method should be called
    5. // very sparingly, if ever, since it can lead to very poor performance.
    6. // However, it may be the only way of handling some scenarios where
    7. // users need to re-read the contents of a StreamReader a second time.
    8. public void DiscardBufferedData() {
    9.     byteLen = 0;
    10.     charLen = 0;
    11.     charPos = 0;
    12.     decoder = encoding.GetDecoder();
    13.     _isBlocked = false;
    14. }


    记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.


    如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.

    如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.


    最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为, 争论是每一种语言前进的动力。 想说点什么, 突然想起了上面的这个小例子。 其实作为程序员, 可能既不关注究竟是语言支撑模式,也不关注是不是类库支撑模式。 唯希望在类库设计中,少一点上面这个例子中的灵机一动, 多一点实实在在。

    作者:Jeffrey Sun

