zoukankan      html  css  js  c++  java
  • UTF-8 BOM adventures in C#

    UTF-8 BOM adventures in C#

    stream writer的源码里面做了事情,把preamble写入了

    private void Flush(bool flushStream, bool flushEncoder)
            {
                if (this.stream == null)
                {
                    __Error.WriterClosed();
                }
                if (this.charPos == 0 && ((!flushStream && !flushEncoder) || CompatibilitySwitches.IsAppEarlierThanWindowsPhone8))
                {
                    return;
                }
                if (!this.haveWrittenPreamble)
                {
                    this.haveWrittenPreamble = true;
                    byte[] preamble = this.encoding.GetPreamble();
                    if (preamble.Length != 0)
                    {
                        this.stream.Write(preamble, 0, preamble.Length);
                    }
                }
                int bytes = this.encoder.GetBytes(this.charBuffer, 0, this.charPos, this.byteBuffer, 0, flushEncoder);
                this.charPos = 0;
                if (bytes > 0)
                {
                    this.stream.Write(this.byteBuffer, 0, bytes);
                }
                if (flushStream)
                {
                    this.stream.Flush();
                }
            }

    并且是否添加bom,还根据文件是否新建决定

    if (this.stream.CanSeek && this.stream.Position > 0L)
                {
                    this.haveWrittenPreamble = true;
                }
    [SecuritySafeCritical]
            private void Init(Stream streamArg, Encoding encodingArg, int bufferSize, bool shouldLeaveOpen)
            {
                this.stream = streamArg;
                this.encoding = encodingArg;
                this.encoder = this.encoding.GetEncoder();
                if (bufferSize < 128)
                {
                    bufferSize = 128;
                }
                this.charBuffer = new char[bufferSize];
                this.byteBuffer = new byte[this.encoding.GetMaxByteCount(bufferSize)];
                this.charLen = bufferSize;
                if (this.stream.CanSeek && this.stream.Position > 0L)
                {
                    this.haveWrittenPreamble = true;
                }
                this.closable = !shouldLeaveOpen;
                if (Mda.StreamWriterBufferedDataLost.Enabled)
                {
                    string cs = null;
                    if (Mda.StreamWriterBufferedDataLost.CaptureAllocatedCallStack)
                    {
                        cs = Environment.GetStackTrace(null, false);
                    }
                    this.mdaHelper = new StreamWriter.MdaHelper(this, cs);
                }
            }

    Time for a quick look at UTF-8 encoding and byte order marker (BOM). Lets jump right into some code. You are probably going to nail this as you most likely will be alert now, given the title and all, but would you have expected this test to pass?

    [Fact]
    public void Utf8Strings()
    {
        var initial = "Hello world!";
    
        using var ms = new MemoryStream();
        using var writer = new StreamWriter(ms, Encoding.UTF8);
    
        writer.Write(initial);
        writer.Flush();
    
        Assert.Equal(
            initial,
            Encoding.UTF8.GetString(ms.ToArray()));
    }

    So, what is happening here? Lets take a look at a second test to make it a bit more clear.

    What are those extra bytes?

    It's the byte order marker (BOM) and when it comes to UTF-8, it's essentially indicating that the stream consists of UTF-8 encoded bytes. It can also be used to tell if the byte order is in little- or big-endian order. Here's a good place to read about it in a somewhat understandable way: https://www.unicode.org/faq/utf_bom.html#bom1

    Here are some extracted parts from Unicode.Org's FAQ:

    Q: What does ‘endian’ mean?

    A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian, the latter little-endian...

    (https://www.unicode.org/faq/utf_bom.html#bom3)

    Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?

    Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order...

    (https://www.unicode.org/faq/utf_bom.html#bom5)

    Can we find the BOM for UTF-8 in .NET?

    Yes. It's located in the Encoding.Preamble or Encoding.GetPreamble():

    [Fact]
    public void ItIsTheBom()
    {
        Assert.Equal(
            new[] { 0xEF, 0xBB, 0xBF },
            new[] { 239, 187, 191 });
    
        Assert.Equal(
            new byte[] { 239, 187, 191 },
            Encoding.UTF8.GetPreamble());
    }

    The docs (https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding.getpreamble?view=netcore-3.1) says:

    When overridden in a derived class, returns a sequence of bytes that specifies the encoding used.

    Looking in specifications for UTF-8 in particular, it's actually not required (See D95 under 3.10 Unicode Encoding Schemes).

    Can we get rid of it?

    Yes, just don't use Encoding.UTF8 but instead create an instance of it and define that it should not include the indicator: new UTF8Encoding(false)

    [Fact]
    public void Utf8StringsWithoutBom()
    {
        var initial = "Hello world!";
    
        using var ms = new MemoryStream();
        using var writer = new StreamWriter(ms, new UTF8Encoding(false));
    
        writer.Write(initial);
        writer.Flush();
    
        Assert.Equal(
            initial,
            Encoding.UTF8.GetString(ms.ToArray()));
    }

    Great! But then I don't really need a Stream and a StreamWriter? I can just use an encoding instance that excludes the preamble. Right?

    [Fact]
    public void Outsmarted()
    {
        var initial = "Hello world!";
        var encWithBom = new UTF8Encoding(true);
        var encWithoutBom = new UTF8Encoding(false);
    
        var rWithBome = encWithBom.GetBytes(initial);
        var rWithoutBom = encWithoutBom.GetBytes(initial);
    
        Assert.NotEqual(
            rWithBome,
            rWithoutBom);
    }

    No, it's the StreamWriter that makes use of the Preamble for the encoding. And when creating an Encoding instance with false, it just makes the Preamble consist of an empty array of bytes.

    That's all for this post. Hope I clarified something.

    Cheers,

    //Daniel

  • 相关阅读:
    关于搭建系统直播和Thinkphp的杂谈(持续更新)
    linux下phpstudy的搭建以及网站的搭建
    java大文件读写操作,java nio 之MappedByteBuffer,高效文件/内存映射
    IntelliJ IDEA 破解
    遍历表格
    Ajax简单示例
    [转shasiqq]@Param 注解在Mybatis中的使用 以及传递参数的三种方式
    一些python学习的链接
    python Scrapy安装错误解决
    SEVERE: Error configuring application listener of class org.springframework.web.context.ContextLoade
  • 原文地址:https://www.cnblogs.com/chucklu/p/14648153.html
Copyright © 2011-2022 走看看