c# 小叙 Encoding(二）

zoukankan html css js c++ java

c# 小叙 Encoding(二）
Encoding用法

Encoding用法比较简单，如果只是字节和字符的互相转换，GetBytes()和GetChars()这两个方法及它们的重载基本上会满足你所有要求。

GetByteCount()及其重载是得到一个字符串转换成字节时实际的字节个数。

GetCharCount()及其重载是得到一个字节数组转换成字符串的大小。

要注意这两个方法：int GetMaxByteCount(int charCount); int GetMaxCharCount(int byteCount);

它并不是你期望的那样，如果是单字节就返回charCount，如果是双字节就返回chartCount*2，而是chartCount+1，(chartCount+1)*2。
Console.WriteLine("The max byte count is {0}.", Encoding.Unicode.GetMaxByteCount(10)); Console.WriteLine("The max byte count is {0}.", Encoding.ASCII.GetMaxByteCount(10));
上面的结果分别是22和11，而不是20，10。我在一篇英文博客里找到了原因，我英语不好，没有弄明白什么是high surrogate和low surrogate：http://blogs.msdn.com/b/shawnste/archive/2005/03/02/383903.aspx

For example, Encoding.GetEncoding(1252).GetMaxByteCount(1) returns 2. 1252 is a single byte code page (encoding), so generally one would expect that GetMaxByteCount(n) would return n, but it doesn't, it usually returns n+1.

One reason for this oddity is that an Encoder could store a high surrogate on one call to GetBytes(), hoping that the next call is a low surrogate. This allows the fallback mechanism to provide a fallback for a complete surrogate pair, even if that pair is split between calls to GetBytes(). If the fallback returns a ? for each surrogate half, or if the next call doesn't have a surrogate, then 2 characters could be output for that surrogate pair. So in this case, calling Encoder.GetBytes() with a high surrogate would return 0 bytes and then following that with another call with only the low surrogate would return 2 bytes.

下面代码是Encoding的简单应用，大家可以打印一下结果，然后结合上篇讲的，会有所收获的。
static void Output(Encoding encoding,string t) { Console.WriteLine(encoding.ToString()); byte[] buffer = encoding.GetBytes(t); foreach (byte b in buffer) { Console.Write(b + "-"); } string s = encoding.GetString(buffer); Console.WriteLine(s); }
string strTest = "test我镕a有κ"; Console.WriteLine(strTest); Output(Encoding.GetEncoding("gb18030"), strTest); Output(Encoding.Default, strTest); Output(Encoding.UTF32, strTest); Output(Encoding.UTF8, strTest); Output(Encoding.Unicode, strTest); Output(Encoding.ASCII, strTest); Output(Encoding.UTF7, strTest);
关于BOM

BOM全称是Byte Order Mark，即字节顺序标记，是一段二进制，用于标识一个文本是用什么编码的，比如当用Notepad打开一个文本时，如果文本里包括这一段BOM，那么它就能判断是采用哪一种编码方式，并用相应的解码方式，就会正确打开文本不会有乱码。如果没有这一段BOM，Notepad会默认以ANSI打开，这种会有乱码的可能性。可以通过Encoding的方法GetPreamble()来判断这编码有没有BOM，目前CLR中只有下面5个Encoding有BOM。
UTF-8: EF BB BF

UTF-16 big endian: FE FF

UTF-16 little endian: FF FE

UTF-32 big endian: 00 00 FE FF

UTF-32 little endian: FF FE 00 00

用Encoding的静态属性Unicode，UTF8，UTF32构造的Encoding都是默认带有BOM的，如果你想在写一个文本时（比如XML文件，如果有BOM，会有乱码的），不想带BOM，那么就必须用它们的实例，

Encoding encodingUTF16=new UnicodeEncoding(false, false);//第二个参数必须要为false Encoding encodingUTF8=new UTF8Encoding(false); Encoding encodingUTF32=new UTF32Encoding(false,false);//第二个参数必须要为false

读写文本和BOM的关系可以参考园子里这篇博客，讲的很详细我就不重复了，.NET(C#)：字符编码(Encoding)和字节顺序标记(BOM)

判断一个文本的编码方式

如果给定一个文本，我们不知道它的编码格式，解码时我们如何选择Encoding呢？答案是根据BOM来判断到底是哪种Unicode，如果没有BOM，这个就很难说了，这个得根据文本文件的来源了，一般是用Encoding.Default，这个是根据你计算机里当前的设置而返回不同的值。如果你的文件是来自一位国际友人的话，你最好用UTF-8来解码了。下面的代码在指定文件没有BOM时，不能保证其正确性，如果你要用到你项目中，千万要注意这一点。

/// <summary> ///Return the Encoding of a text file. Return Encoding.Default if no Unicode // BOM (byte order mark) is found. /// </summary> /// <param name="FileName"></param> /// <returns></returns> public static Encoding GetFileEncoding(String FileName) { Encoding Result = null; FileInfo FI = new FileInfo(FileName); FileStream FS = null; try { FS = FI.OpenRead(); Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8, Encoding.UTF32, new UTF32Encoding(true,true) }; for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++) { FS.Position = 0; byte[] Preamble = UnicodeEncodings[i].GetPreamble(); bool PreamblesAreEqual = true; for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++) { PreamblesAreEqual = Preamble[j] == FS.ReadByte(); } // or use Array.Equals to compare two arrays. // fs.Read(buf, 0, Preamble.Length); // PreamblesAreEqual = Array.Equals(Preamble, buf) if (PreamblesAreEqual) { Result = UnicodeEncodings[i]; } } } catch (System.IO.IOException ex) { throw ex; } finally { if (FS != null) { FS.Close(); } } if (Result == null) { Result = Encoding.Default; } return Result; }

待续。。。。

下一节主要讲Encoder和Decoder

顺便问一下，编辑博客时，看着还挺漂亮的文章，怎么预览时好多格式都不见了？好难看啊
查看全文

相关阅读:
Django模型层Meta内部类详解
 jquery checkbox的相关操作——全选、反选、获得所有选中的checkbox
c# 委托与异步调用
 DataTable转成List集合
 c# winform 自动升级
 C# winform单元格的formatted值的类型错误 DataGridView中CheckBox列运行时候System.FormatException异常
 C#创建无窗体的应用程序
 sql 一个表的字段更新至另一个字段的方法
 datagridview 获取选中行的索引
 CHECKEDLISTBOX用法总结

原文地址：https://www.cnblogs.com/bdqczhl/p/12445206.html

c# 小叙 Encoding(二）

Encoding用法

关于BOM

判断一个文本的编码方式