问题
对于读取文件某指定位置开始的一段数据的操作, 我们一般可以用如下的代码来实现:
- private static string ReadContent(string fileName, int position, int length)
- {
- if (!File.Exists(fileName))
- {
- throw new FileNotFoundException("The specified file is not found : " + fileName);
- }
- using(FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
- using (StreamReader reader = new StreamReader(stream))
- {
- reader.BaseStream.Seek(position, SeekOrigin.Begin);
- char[] buffer = new char[length];
- reader.Read(buffer, 0, length);
- return new string(buffer, 0, length);
- }
- }
这样的操作在代码上看来比较直观也易于理解。 如果想在同一个文件中读取多个这样的内容段, 一般可以写成如下(指定多个位置和多个需要对应读取的长度,参数列表仅为示意):
- private static string[] ReadContents(string fileName, int[] positions, int[] lengths)
- {
- if (!File.Exists(fileName))
- {
- throw new FileNotFoundException("The specified file is not found : " + fileName);
- }
- using (FileStream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
- using (StreamReader reader = new StreamReader(stream))
- {
- string[] contents = new string[positions.Length];
- for (int i = 0; i < positions.Length; i++)
- {
- reader.BaseStream.Seek(positions[i], SeekOrigin.Begin);
- char[] buffer = new char[lengths[i]];
- reader.Read(buffer, 0, lengths[i]);
- contents[i] = new string(buffer, 0, lengths[i]);
- }
- return contents;
- }
- }
这看起来也没有什么问题。 但是如果我们提供一段测试程序, 就会发现出乎意料的结果:
- static void Main(string[] args)
- {
- string fileName = @"text.txt";
- using(FileStream stream = new FileStream(fileName, FileMode.Create, FileAccess.Write, FileShare.None))
- using (StreamWriter writer = new StreamWriter(stream))
- {
- writer.Write("ABCDEFGHIJKLMNOPQ");
- }
- Console.WriteLine(ReadContent(fileName, 4, 2));
- Console.WriteLine(ReadContent(fileName, 10, 2));
- Console.WriteLine(ReadContent(fileName, 7, 2));
- Console.WriteLine();
- string[] contents = ReadContents(fileName, new int[] { 4, 10, 7 }, new int[] { 2, 2, 2 });
- foreach (var item in contents)
- {
- Console.WriteLine(item);
- }
- Console.ReadKey();
- }
输出是:
所以当我们在同一个流中尝试定位的时候, 类库API并没有按照我们预想的那样, 取出对应的内容。 而看起来像是, 在一个文件流对象发生第一次Seek之后, 其后的所有Seek操作都失效了!这是为什么呢?
分析
事实上, StreamReader为了性能的考虑, 在自己的内部内置并维护了一个byte buffer。 如果在声明StreamReader对象的时候没有指定这个buffer的尺寸, 那么它的默认大小是1k。 如果是文件流, 那么这个buffer的默认大小是4K。 所有Read操作,都直接或间接转换为了对这个buffer的操作。
- // Using a 1K byte buffer and a 4K FileStream buffer works out pretty well
- // perf-wise. On even a 40 MB text file, any perf loss by using a 4K
- // buffer is negated by the win of allocating a smaller byte[], which
- // saves construction time. This does break adaptive buffering,
- // but this is slightly faster.
- internal const int DefaultBufferSize = 1024; // Byte buffer size
- private const int DefaultFileStreamBufferSize = 4096;
- private const int MinBufferSize = 128;
- // This version has a perf optimization to decode data DIRECTLY into the
- // user's buffer, bypassing StreamWriter's own buffer.
- // This gives a > 20% perf improvement for our encodings across the board,
- // but only when asking for at least the number of characters that one
- // buffer's worth of bytes could produce.
- // This optimization, if run, will break SwitchEncoding, so we must not do
- // this on the first call to ReadBuffer.
- private int ReadBuffer(char[] userBuffer, int userOffset, int desiredChars, out bool readToUserBuffer) {
- charLen = 0;
- charPos = 0;
- if (!_checkPreamble)
- byteLen = 0;
- int charsRead = 0;
- // As a perf optimization, we can decode characters DIRECTLY into a
- // user's char[]. We absolutely must not write more characters
- // into the user's buffer than they asked for. Calculating
- // encoding.GetMaxCharCount(byteLen) each time is potentially very
- // expensive - instead, cache the number of chars a full buffer's
- // worth of data may produce. Yes, this makes the perf optimization
- // less aggressive, in that all reads that asked for fewer than AND
- // returned fewer than _maxCharsPerBuffer chars won't get the user
- // buffer optimization. This affects reads where the end of the
- // Stream comes in the middle somewhere, and when you ask for
- // fewer chars than than your buffer could produce.
- readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
- do {
- if (_checkPreamble) {
- BCLDebug.Assert(bytePos <= _preamble.Length, "possible bug in _compressPreamble. Are two threads using this StreamReader at the same time?");
- int len = stream.Read(byteBuffer, bytePos, byteBuffer.Length - bytePos);
- BCLDebug.Assert(len >= 0, "Stream.Read returned a negative number! This is a bug in your stream class.");
- if (len == 0) {
- // EOF but we might have buffered bytes from previous
- // attempts to detecting preamble that needs to decoded now
- if (byteLen > 0) {
- if (readToUserBuffer) {
- charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
- charLen = 0; // StreamReader's buffer is empty.
- }
- else {
- charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
- charLen += charsRead; // Number of chars in StreamReader's buffer.
- }
- }
- return charsRead;
- }
- byteLen += len;
- }
- else {
- BCLDebug.Assert(bytePos == 0, "bytePos can be non zero only when we are trying to _checkPreamble. Are two threads using this StreamReader at the same time?");
- byteLen = stream.Read(byteBuffer, 0, byteBuffer.Length);
- BCLDebug.Assert(byteLen >= 0, "Stream.Read returned a negative number! This is a bug in your stream class.");
- if (byteLen == 0) // EOF
- return charsRead;
- }
- // _isBlocked == whether we read fewer bytes than we asked for.
- // Note we must check it here because CompressBuffer or
- // DetectEncoding will ---- with byteLen.
- _isBlocked = (byteLen < byteBuffer.Length);
- // Check for preamble before detect encoding. This is not to override the
- // user suppplied Encoding for the one we implicitly detect. The user could
- // customize the encoding which we will loose, such as ThrowOnError on UTF8
- // Note: we don't need to recompute readToUserBuffer optimization as IsPreamble
- // doesn't change the encoding or affect _maxCharsPerBuffer
- if (IsPreamble())
- continue;
- // On the first call to ReadBuffer, if we're supposed to detect the encoding, do it.
- if (_detectEncoding && byteLen >= 2) {
- DetectEncoding();
- // DetectEncoding changes some buffer state. Recompute this.
- readToUserBuffer = desiredChars >= _maxCharsPerBuffer;
- }
- charPos = 0;
- if (readToUserBuffer) {
- charsRead += decoder.GetChars(byteBuffer, 0, byteLen, userBuffer, userOffset + charsRead);
- charLen = 0; // StreamReader's buffer is empty.
- }
- else {
- charsRead = decoder.GetChars(byteBuffer, 0, byteLen, charBuffer, charsRead);
- charLen += charsRead; // Number of chars in StreamReader's buffer.
- }
- } while (charsRead == 0);
- _isBlocked &= charsRead < desiredChars;
- //Console.WriteLine("ReadBuffer: charsRead: "+charsRead+" readToUserBuffer: "+readToUserBuffer);
- return charsRead;
- }
所以问题就转化为, 当第二次调用BaseStream.Seek的时候, 对应的buffer的内容并没有重新读取!所以第二次读取的时候, 对应读取的内容其实是第一次seek后, 对应的Seek位置以后4K长度的内容。这对应的缓存的起始位置已经完全不同了(或者完全不在缓存中)。
如果想要在第二次seek前刷新缓存, 必须显式调用DiscardBufferedData():
- // DiscardBufferedData tells StreamReader to throw away its internal
- // buffer contents. This is useful if the user needs to seek on the
- // underlying stream to a known location then wants the StreamReader
- // to start reading from this new point. This method should be called
- // very sparingly, if ever, since it can lead to very poor performance.
- // However, it may be the only way of handling some scenarios where
- // users need to re-read the contents of a StreamReader a second time.
- public void DiscardBufferedData() {
- byteLen = 0;
- charLen = 0;
- charPos = 0;
- decoder = encoding.GetDecoder();
- _isBlocked = false;
- }
一点抱怨
记得《Framework Design》中讲到一些.NET类库设计时的一些遗憾, 我不知道这个算不算. 我觉得自己最少算是一个熟手, 但是我遇到这个问题的时候第一感觉是很奇怪. 看到了代码的时候, 觉得代码充满tricky和smelly的味道. 类库的设计者显然恶意揣度了程序员的意图和编程能力. 设计者觉得自己在性能和可用性上找到了一个巧妙的平衡点, 但实际上不但造成了API歧义, 而且显然会导致错误的结果. 诚然, 按照统计学原理, 内容读取多发生在相近的地方; 或者说被缓存的内容有继续被读取的较大可能. 但是性能永远是建立在正确性的基础上的. 这个API令人遗憾的地方, 就是忽视了多次Seek这种需求.
我们来揣度一下如何设计.
如果想要做得大而全, 完全可以保持这样的一个缓存, 但是显然不能仅仅依赖于BaseStream的Seek, 而是要在StreamReader类, 或者其基类TextReader中提供Seek API来封装对BaseStream的定位操作同时也包括对缓存数据的定位操作. 这样的API是不是对程序员更友好? 我觉得是, 至少不会产生误解吧.
如果想要做得小而精, 完全可以去掉这样的缓存机制. 取而代之, 使用程序员提供的缓存. 完全由语言的使用者来决定是否实现自己的缓存机制. 这样的语言或者类库, 同样也是健壮的, 也是可以被程序员接受的.
总结
最近园子里面关于C#语言自身及.NET类库的讨论深入而热烈。我私下以为, 争论是每一种语言前进的动力。 想说点什么, 突然想起了上面的这个小例子。 其实作为程序员, 可能既不关注究竟是语言支撑模式,也不关注是不是类库支撑模式。 唯希望在类库设计中,少一点上面这个例子中的灵机一动, 多一点实实在在。