zoukankan      html  css  js  c++  java
  • 关于文件流Seek以及Read操作的一点不满

    问题

    对于读取文件某指定位置开始的一段数据的操作, 我们一般可以用如下的代码来实现:

    Read File Stream Content
    1. private static string ReadContent(string fileName, int position, int length)
    2. {
    3.     if (!File.Exists(fileName))
    4.     {
    5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
    6.     }
    7.  
    8.     using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    9.     using (StreamReader reader = new StreamReader(stream))
    10.     {
    11.         reader.BaseStream.Seek(position, SeekOrigin.Begin);
    12.         char[] buffer = new char[length];
    13.         reader.Read(buffer, 0, length);
    14.  
    15.         return new string(buffer, 0, length);
    16.     }
    17. }

    这样的操作在代码上看来比较直观也易于理解。 如果想在同一个文件中读取多个这样的内容段, 一般可以写成如下(指定多个位置和多个需要对应读取的长度,参数列表仅为示意):

    Read Content With Seeking
    1. private static string[] ReadContents(string fileName, int[] positions, int[] lengths)
    2. {
    3.     if (!File.Exists(fileName))
    4.     {
    5.         throw new FileNotFoundException("The specified file is not found : " + fileName);
    6.     }
    7.  
    8.     using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
    9.     using (StreamReader reader = new StreamReader(stream))
    10.     {
    11.         string[] contents = new string[positions.Length];
    12.  
    13.         for (int i = 0; i < positions.Length; i++)
    14.         {
    15.             reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);
    16.             char[] buffer = new char[lengths[i]];
    17.             reader.Read(buffer, 0, lengths[i]);
    18.             contents[i] = new string(buffer, 0, lengths[i]);
    19.         }
    20.  
    21.         return contents;
    22.     }
    23. }

    这看起来也没有什么问题。 但是如果我们提供一段测试程序, 就会发现出乎意料的结果:

    Test App
    1. static void Main(string[] args)
    2. {
    3.     string fileName = @"text.txt";
    4.  
    5.     using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
    6.     using (StreamWriter writer = new StreamWriter(stream))
    7.     {
    8.         writer.Write("ABCDEFGHIJKLMNOPQ");
    9.     }
    10.  
    11.  
    12.     Console.WriteLine(ReadContent(fileName, 4, 2));
    13.     Console.WriteLine(ReadContent(fileName, 10, 2));
    14.     Console.WriteLine(ReadContent(fileName, 7, 2));
    15.     Console.WriteLine();
    16.  
    17.     string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });
    18.     foreach (var item in contents)
    19.     {
    20.         Console.WriteLine(item);
    21.     }
    22.  
    23.     Console.ReadKey();
    24. }

    输出是:

    Capture

    所以当我们在同一个流中尝试定位的时候, 类库API并没有按照我们预想的那样, 取出对应的内容。 而看起来像是, 在一个文件流对象发生第一次Seek之后, 其后的所有Seek操作都失效了!这是为什么呢?

    分析

    事实上, StreamReader为了性能的考虑, 在自己的内部内置并维护了一个byte buffer。 如果在声明StreamReader对象的时候没有指定这个buffer的尺寸, 那么它的默认大小是1k。 如果是文件流, 那么这个buffer的默认大小是4K。 所有Read操作,都直接或间接转换为了对这个buffer的操作。

    Buffer Size
    1. // Using a 1K byte buffer and a 4K FileStream buffer works out pretty well
    2. // perf-wise.  On even a 40 MB text file, any perf loss by using a 4K
    3. // buffer is negated by the win of allocating a smaller byte[], which
    4. // saves construction time.  This does break adaptive buffering,
    5. // but this is slightly faster.
    6. internal const int DefaultBufferSize = 1024;  // Byte buffer size
    7. private const int DefaultFileStreamBufferSize = 4096;
    8. private const int MinBufferSize = 128;
    Read Buffer
    1.         // This version has a perf optimization to decode data DIRECTLY into the
    2.         // user's buffer, bypassing StreamWriter's own buffer.
    3.         // This gives a > 20% perf improvement for our encodings across the board,
    4.         // but only when asking for at least the number of characters that one
    5.         // buffer's worth of bytes could produce.
    6.         // This optimization, if run, will break SwitchEncoding, so we must not do
    7.         // this on the first call to ReadBuffer.
    8.         private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {
    9.             charLen = 0;
    10.             charPos = 0;
    11.             if (!_checkPreamble)
    12.                 byteLen = 0;
    13.             int charsRead = 0;
    14.             // As a perf optimization, we can decode characters DIRECTLY into a
    15.             // user's char[].  We absolutely must not write more characters
    16.             // into the user's buffer than they asked for.  Calculating
    17.             // encoding.GetMaxCharCount(byteLen) each time is potentially very
    18.             // expensive - instead, cache the number of chars a full buffer's
    19.             // worth of data may produce.  Yes, this makes the perf optimization
    20.             // less aggressive, in that all reads that asked for fewer than AND
    21.             // returned fewer than _maxCharsPerBuffer chars won't get the user
    22.             // buffer optimization.  This affects reads where the end of the
    23.             // Stream comes in the middle somewhere, and when you ask for
    24.             // fewer chars than than your buffer could produce.
    25.             readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
    26.             do {
    27.                 if (_checkPreamble) {
    28.                     BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble.  Are two threads using this StreamReader at the same time?");
    29.                     int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);
    30.                     BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
    31.  
    32.                     if (len == 0) {
    33.                         // EOF but we might have buffered bytes from previous
    34.                         // attempts to detecting preamble that needs to decoded now
    35.                         if (byteLen > 0) {
    36.                             if (readToUserBuffer) {
    37.                                 charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
    38.                                 charLen = 0;  // StreamReader's buffer is empty.
    39.                             }
    40.                             else {
    41.                                 charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
    42.                                 charLen += charsRead;  // Number of chars in StreamReader's buffer.
    43.                             }
    44.                         }
    45.                         return charsRead;
    46.                     }
    47.  
    48.                     byteLen += len;
    49.                 }
    50.                 else {
    51.                     BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble.  Are two threads using this StreamReader at the same time?");
    52.                     byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length);
    53.                     BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number!  This is a bug in your stream class.");
    54.                     if (byteLen == 0)  // EOF
    55.                         return charsRead;
    56.                 }
    57.  
    58.                 // _isBlocked == whether we read fewer bytes than we asked for.
    59.                 // Note we must check it here because CompressBuffer or
    60.                 // DetectEncoding will ---- with byteLen.
    61.                 _isBlocked = (byteLen < byteBuffer.Length);
    62.                 // Check for preamble before detect encoding. This is not to override the
    63.                 // user suppplied Encoding for the one we implicitly detect. The user could
    64.                 // customize the encoding which we will loose, such as ThrowOnError on UTF8
    65.                 // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble
    66.                 // doesn't change the encoding or affect _maxCharsPerBuffer
    67.                 if (IsPreamble())
    68.                     continue;
    69.  
    70.                 // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it.
    71.                 if (_detectEncoding && byteLen >= 2) {
    72.                     DetectEncoding();
    73.                     // DetectEncoding changes some buffer state.  Recompute this.
    74.                     readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
    75.                 }
    76.  
    77.                 charPos = 0;
    78.                 if (readToUserBuffer) {
    79.                     charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
    80.                     charLen = 0;  // StreamReader's buffer is empty.
    81.                 }
    82.                 else {
    83.                     charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
    84.                     charLen += charsRead;  // Number of chars in StreamReader's buffer.
    85.                 }
    86.             } while (charsRead == 0);
    87.  
    88.             _isBlocked &= charsRead < desiredChars;
    89.             //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+"  readToUserBuffer: "+readToUserBuffer);
    90.             return charsRead;
    91.         }



    所以问题就转化为, 当第二次调用BaseStream.Seek的时候, 对应的buffer的内容并没有重新读取!所以第二次读取的时候, 对应读取的内容其实是第一次seek后, 对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了(或者完全不在缓存中)。

    如果想要在第二次seek前刷新缓存, 必须显式调用DiscardBufferedData():

    Code Snippet
    1. // DiscardBufferedData tells StreamReader to throw away its internal
    2. // buffer contents.  This is useful if the user needs to seek on the
    3. // underlying stream to a known location then wants the StreamReader
    4. // to start reading from this new point.  This method should be called
    5. // very sparingly, if ever, since it can lead to very poor performance.
    6. // However, it may be the only way of handling some scenarios where
    7. // users need to re-read the contents of a StreamReader a second time.
    8. public void DiscardBufferedData() {
    9.     byteLen = 0;
    10.     charLen = 0;
    11.     charPos = 0;
    12.     decoder = encoding.GetDecoder();
    13.     _isBlocked = false;
    14. }

    一点抱怨

    记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.

    我们来揣度一下如何设计.

    如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.

    如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.

    总结

    最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为, 争论是每一种语言前进的动力。 想说点什么, 突然想起了上面的这个小例子。 其实作为程序员, 可能既不关注究竟是语言支撑模式,也不关注是不是类库支撑模式。 唯希望在类库设计中,少一点上面这个例子中的灵机一动, 多一点实实在在。

    作者:Jeffrey Sun
    出处:http://sun.cnblogs.com/
    本文以“现状”提供且没有任何担保,同时也没有授予任何权利。本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

  • 相关阅读:
    hdu 1017 A Mathematical Curiosity 解题报告
    hdu 2069 Coin Change 解题报告
    hut 1574 组合问题 解题报告
    hdu 2111 Saving HDU 解题报
    hut 1054 Jesse's Code 解题报告
    hdu1131 Count the Trees解题报告
    hdu 2159 FATE 解题报告
    hdu 1879 继续畅通工程 解题报告
    oracle的系统和对象权限
    oracle 自定义函数 返回一个表类型
  • 原文地址:https://www.cnblogs.com/sun/p/1775311.html
Copyright © 2011-2022 走看看