zoukankan      html  css  js  c++  java
  • 检测字节流是否是UTF8编码

    几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

    也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

    使用无数或条件的正则表达式用起来却是性能不高。

    刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

    先聊聊原理:

    UTF8的编码规则如下表

    UTF8 Encoding Rule

    看起来很复杂,总结起来如下:

    ASCII码(U+0000 - U+007F),不编码

    其余编码规则为

    •第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte)
    •结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。
    因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

    根据这个规则,我给出的C#代码如下:

            /// <summary>
            ///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.
            /// </summary>
            /// <param name="inputStream">
            ///    The input stream.
            ///  </param>
            /// <returns>
            ///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.
            /// </returns>
            /// <remarks>
            ///   All ASCII chars will regards not UTF8 encoding.
            /// </remarks>
            public static bool IsTextUTF8(ref byte[] inputStream)
            {
                int encodingBytesCount = 0;
                bool allTextsAreASCIIChars = true;
    
                for (int i = 0; i < inputStream.Length; i++)
                {
                    byte current = inputStream[i];
    
                    if ((current & 0x80) == 0x80)
                    {                    
                        allTextsAreASCIIChars = false;
                    }
                    // First byte
                    if (encodingBytesCount == 0)
                    {
                        if ((current & 0x80) == 0)
                        {
                            // ASCII chars, from 0x00-0x7F
                            continue;
                        }
    
                        if ((current & 0xC0) == 0xC0)
                        {
                            encodingBytesCount = 1;
                            current <<= 2;
    
                            // More than two bytes used to encoding a unicode char.
                            // Calculate the real length.
                            while ((current & 0x80) == 0x80)
                            {
                                current <<= 1;
                                encodingBytesCount++;
                            }
                        }                    
                        else
                        {
                            // Invalid bits structure for UTF8 encoding rule.
                            return false;
                        }
                    }                
                    else
                    {
                        // Following bytes, must start with 10.
                        if ((current & 0xC0) == 0x80)
                        {                        
                            encodingBytesCount--;
                        }
                        else
                        {
                            // Invalid bits structure for UTF8 encoding rule.
                            return false;
                        }
                    }
                }
    
                if (encodingBytesCount != 0)
                {
                    // Invalid bits structure for UTF8 encoding rule.
                    // Wrong following bytes count.
                    return false;
                }
    
                // Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
                return !allTextsAreASCIIChars;
            }

    再附上单元测试代码:

        /// <summary>
        ///This is a test class for EncodingHelperTest and is intended
        ///to contain all EncodingHelperTest Unit Tests
        ///</summary>
        [TestClass()]
        public class EncodingHelperTest
        {
            /// <summary>
            ///  Normal test for this method.
            ///</summary>
            [TestMethod()]
            public void IsTextUTF8Test()
            {
                for (int i = 0; i < 1000; i++)
                {
                    List<Char> chars = new List<char>();
                    chars.Add('中');
    
                    List<UnicodeCategory> temp = new List<UnicodeCategory>();
                    Random rd = new Random((int)(DateTime.Now.Ticks & 0x7FFFFFFF));
    
                    for (int j = 0; j < 255; j++)
                    {
                        char ch = (char)rd.Next(0xFFFF);
                        UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                        if (uc == UnicodeCategory.Surrogate || // Single surrogate could not be encoding correctly.
                            uc == UnicodeCategory.PrivateUse || // Private use blocks should be excluded.
                            uc == UnicodeCategory.OtherNotAssigned
                            )
                        {
                            j--;
                        }
                        else
                        {
                            chars.Add(ch);
                            temp.Add(uc);
                        }
                    }
    
                    string str = new string(chars.ToArray());
    
                    byte[] inputStream = Encoding.UTF8.GetBytes(str);
                    bool expected = true; 
                    bool actual;
                    actual = EncodingHelper.IsTextUTF8(ref inputStream);
                    Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));
    
                    inputStream = Encoding.GetEncoding(932).GetBytes(str);
                    expected = false;
    
                    actual = EncodingHelper.IsTextUTF8(ref inputStream);
                    Assert.AreEqual(expected, actual, string.Format("ShiftJIS_Assert Fails at:{0}", str));
                }
            }
    
            /// <summary>
            ///   Check with All ASCII chars
            /// </summary>
            [TestMethod]
            public void IsTextUTF8Test_AllASCII()
            {
                string str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";
    
                byte[] inputStream = Encoding.UTF8.GetBytes(str);
                bool expected = false;
                bool actual;
                actual = EncodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));
    
    
            }
        }

    另:

    如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

    参考:

    维基百科:http://en.wikipedia.org/wiki/UTF-8



    本文是由葡萄城技术开发团队发布,转载请注明出处:葡萄城官网


  • 相关阅读:
    HDU 6143 Killer Names【dp递推】【好题】【思维题】【阅读题】
    HDU 6143 Killer Names【dp递推】【好题】【思维题】【阅读题】
    POJ 3974 Palindrome【manacher】【模板题】【模板】
    POJ 3974 Palindrome【manacher】【模板题】【模板】
    HDU 6127 Hard challenge【计算机几何】【思维题】
    HDU 6127 Hard challenge【计算机几何】【思维题】
    HDU 6129 Just do it【杨辉三角】【思维题】【好题】
    HDU 6129 Just do it【杨辉三角】【思维题】【好题】
    HDU 3037 Saving Beans【Lucas定理】【模板题】【模板】【组合数取余】
    8.Math 对象
  • 原文地址:https://www.cnblogs.com/powertoolsteam/p/1831638.html
Copyright © 2011-2022 走看看