How to GetBytes() in C# with UTF8 encoding with BOM?

回答1

Try like this:

public ActionResult Download()
{
    var data = Encoding.UTF8.GetBytes("some data");
    var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
    return File(result, "application/csv", "foo.csv");
}

The reason is that the UTF8Encoding constructor that takes a boolean parameter doesn't do what you would expect:

byte[] bytes = new UTF8Encoding(true).GetBytes("a");

The resulting array would contain a single byte with the value of 97. There's no BOM because UTF8 doesn't require a BOM.

回答2

I created a simple extension to convert any string in any encoding to its representation of byte array when it is written to a file or stream:

public static class StreamExtensions
{
    public static byte[] ToBytes(this string value, Encoding encoding)
    {
        using (var stream = new MemoryStream())
        using (var sw = new StreamWriter(stream, encoding))
        {
            sw.Write(value);
            sw.Flush();
            return stream.ToArray();
        }
    }
}

Usage:

stringValue.ToBytes(Encoding.UTF8)

This will work also for other encodings like UTF-16 which requires the BOM.

Microsoft Excel mangles Diacritics in .csv files?

diacritics音调符号

A correctly formatted UTF8 file can have a Byte Order Mark as its first three octets. These are the hex values 0xEF, 0xBB, 0xBF. These octets serve to mark the file as UTF8 (since they are not relevant as "byte order" information).1 If this BOM does not exist, the consumer/reader is left to infer the encoding type of the text. Readers that are not UTF8 capable will read the bytes as some other encoding such as Windows-1252 and display the characters ï»¿ at the start of the file.

There is a known bug where Excel, upon opening UTF8 CSV files via file association, assumes that they are in a single-byte encoding, disregarding the presence of the UTF8 BOM. This can not be fixed by any system default codepage or language setting. The BOM will not clue in Excel - it just won't work. (A minority report claims that the BOM sometimes triggers the "Import Text" wizard.) This bug appears to exist in Excel 2003 and earlier. Most reports (amidst the answers here) say that this is fixed in Excel 2007 and newer.

Note that you can always* correctly open UTF8 CSV files in Excel using the "Import Text" wizard, which allows you to specify the encoding of the file you're opening. Of course this is much less convenient.

Readers of this answer are most likely in a situation where they don't particularly support Excel < 2007, but are sending raw UTF8 text to Excel, which is misinterpreting it and sprinkling your text with Ã and other similar Windows-1252 characters. Adding the UTF8 BOM is probably your best and quickest fix.

If you are stuck with users on older Excels, and Excel is the only consumer of your CSVs, you can work around this by exporting UTF16 instead of UTF8. Excel 2000 and 2003 will double-click-open these correctly. (Some other text editors can have issues with UTF16, so you may have to weigh your options carefully.)

_{* Except when you can't, (at least) Excel 2011 for Mac's Import Wizard does not actually always work with all encodings, regardless of what you tell it. </anecdotal-evidence> :)}

CsvHelper写csv带bom

 public class Foo
    {
        public int Id { get; set; }
        public string Name { get; set; }
    }
    class CsvHelperTest
    {
        [Test]
        public void Test20210410()
        {
            var records = new List<Foo>
            {
                new Foo { Id = 1, Name = "one" },
                new Foo { Id = 2, Name = "aîn" },
            };
            using (var writer = new StreamWriter("C:\workspace\Edenred\LISA\Troubleshooting\Daily Sales Report_2021_04_10.csv",false,Encoding.UTF8))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                csv.WriteHeader<Foo>();
                csv.NextRecord();
                foreach (var record in records)
                {
                    csv.WriteRecord(record);
                    csv.NextRecord();
                }
            }
        }
    }

奇怪的事情，没有指定utf8的true和false，下面这个写法也会自带bom

   [Test]
        public void Test20210412003()
        {
            var path = "C:\workspace\Edenred\LISA\Troubleshooting\Daily Sales Report_2021_04_12-003.csv";
            var fs = File.Create(path);
            StreamWriter streamWriter = new StreamWriter(fs, Encoding.UTF8);
            streamWriter.Write("aîn");
            streamWriter.Flush();
            streamWriter.Close();
            Console.WriteLine();
        }

StreamWriter and UTF-8 Byte Order Marks

这个指定true就有bom，指定false就没有bom

  [Test]
        public void Test20210412003()
        {
            var path = "C:\workspace\Edenred\LISA\Troubleshooting\Daily Sales Report_2021_04_12-003.csv";
            var fs = File.Create(path);
            StreamWriter streamWriter = new StreamWriter(fs, new UTF8Encoding(true));
            streamWriter.Write("aîn");
            streamWriter.Flush();
            streamWriter.Close();
            Console.WriteLine();
        }

Create Text File Without BOM

Well it writes the BOM because you are instructing it to, in the line

Encoding utf8WithoutBom = new UTF8Encoding(true);

true means that the BOM should be emitted, using

Encoding utf8WithoutBom = new UTF8Encoding(false);

writes no BOM.

My objective is create a file using UTF-8 as Encoding and 8859-1 as CharSet

Sadly, this is not possible, either you write UTF-8 or not. I.e. as long as the characters you are writing are present in ISO Latin-1 it will look like a ISO 8859-1 file, however as soon as you output a character that is not covered by ISO 8859-1 (e.g. ä,ö, ü) these characters will be written as a multibyte character.

To write true ISO-8859-1 use:

Encoding isoLatin1Encoding = Encoding.GetEncoding("ISO-8859-1");

Edit: After balexandre's comment

I used the following code for testing ...

var filePath = @"c:	emp	est.txt";
var sb = new StringBuilder();
sb.Append("dsfaskd jlsadfj laskjdflasjdf asdkfjalksjdf lkjdsfljas dddd jflasjdflkjasdlfkjasldfl asääääjdflkaslj d f");

Encoding isoLatin1Encoding = Encoding.GetEncoding("ISO-8859-1");

TextWriter tw = new StreamWriter(filePath, false, isoLatin1Encoding);
tw.WriteLine(sb.ToString());
tw.Close();

And the file looks perfectly well. Obviously, you should use the same encoding when reading the file.

最终使用的方案

参考https://www.cyotek.com/blog/manually-writing-the-byte-order-mark-bom-for-an-encoding-into-a-stream

 var fs = File.Create(newFilePath);

                    if (exportFileName.EndsWith(".csv", StringComparison.InvariantCultureIgnoreCase))
                    {
                        var bytes = Encoding.UTF8.GetPreamble();
                        fs.Write(bytes, 0, bytes.Length);
                    }

                    var buffer = new byte[exportStream.Length];
                    exportStream.Read(buffer, 0, buffer.Length);
                    fs.Write(buffer, 0, buffer.Length);

                    fs.Close();

FileStream fs = File.Open(filePath, FileMode.Append);
            BinaryWriter writer = new BinaryWriter(fs);
            if (filePath.EndsWith(".csv", StringComparison.InvariantCultureIgnoreCase))
            {
                var bytes = Encoding.UTF8.GetPreamble();
                writer.Write(bytes);
            }
            writer.Write(data);
            writer.Close();