zoukankan      html  css  js  c++  java
  • 检测字节流是否是UTF8编码

    几天前偶尔看到有人发帖子问“如何自动识别判断url中的中文参数是GB2312还是Utf-8编码”

    也拜读了wcwtitxu使用巨牛的正则表达式检测UTF8编码的算法。

    使用无数或条件的正则表达式用起来却是性能不高。

    刚好曾经在项目中有类似的需求,这里把处理思路和整理后的源代码贴出来供大家参考

    先聊聊原理:

    UTF8的编码规则如下表

    UTF8 Encoding Rule

    看起来很复杂,总结起来如下:

    ASCII码(U+0000 - U+007F),不编码

    其余编码规则为

    •第一个Byte二进制以形式为n个1紧跟个0 (n >= 2), 0后面的位数用来存储真正的字符编码,n的个数说明了这个多Byte字节组字节数(包括第一个Byte)
    •结下来会有n个以10开头的Byte,后6个bit存储真正的字符编码。
    因此对整个编码byte流进行分析可以得出是否是UTF8编码的判断。

    根据这个规则,我给出的C#代码如下:

            /// <summary>
            ///   Determines whether the given <paramref name="inputStream"/>is UTF8 encoding bytes.
            /// </summary>
            /// <param name="inputStream">
            ///    The input stream.
            ///  </param>
            /// <returns>
            ///   <see langword="true"/> if given bystes stream is in UTF8 encoding; otherwise, <see langword="false"/>.
            /// </returns>
            /// <remarks>
            ///   All ASCII chars will regards not UTF8 encoding.
            /// </remarks>
            public static bool IsTextUTF8(ref byte[] inputStream)
            {
                int encodingBytesCount = 0;
                bool allTextsAreASCIIChars = true;
    
                for (int i = 0; i < inputStream.Length; i++)
                {
                    byte current = inputStream[i];
    
                    if ((current & 0x80) == 0x80)
                    {                    
                        allTextsAreASCIIChars = false;
                    }
                    // First byte
                    if (encodingBytesCount == 0)
                    {
                        if ((current & 0x80) == 0)
                        {
                            // ASCII chars, from 0x00-0x7F
                            continue;
                        }
    
                        if ((current & 0xC0) == 0xC0)
                        {
                            encodingBytesCount = 1;
                            current <<= 2;
    
                            // More than two bytes used to encoding a unicode char.
                            // Calculate the real length.
                            while ((current & 0x80) == 0x80)
                            {
                                current <<= 1;
                                encodingBytesCount++;
                            }
                        }                    
                        else
                        {
                            // Invalid bits structure for UTF8 encoding rule.
                            return false;
                        }
                    }                
                    else
                    {
                        // Following bytes, must start with 10.
                        if ((current & 0xC0) == 0x80)
                        {                        
                            encodingBytesCount--;
                        }
                        else
                        {
                            // Invalid bits structure for UTF8 encoding rule.
                            return false;
                        }
                    }
                }
    
                if (encodingBytesCount != 0)
                {
                    // Invalid bits structure for UTF8 encoding rule.
                    // Wrong following bytes count.
                    return false;
                }
    
                // Although UTF8 supports encoding for ASCII chars, we regard as a input stream, whose contents are all ASCII as default encoding.
                return !allTextsAreASCIIChars;
            }

    再附上单元测试代码:

        /// <summary>
        ///This is a test class for EncodingHelperTest and is intended
        ///to contain all EncodingHelperTest Unit Tests
        ///</summary>
        [TestClass()]
        public class EncodingHelperTest
        {
            /// <summary>
            ///  Normal test for this method.
            ///</summary>
            [TestMethod()]
            public void IsTextUTF8Test()
            {
                for (int i = 0; i < 1000; i++)
                {
                    List<Char> chars = new List<char>();
                    chars.Add('中');
    
                    List<UnicodeCategory> temp = new List<UnicodeCategory>();
                    Random rd = new Random((int)(DateTime.Now.Ticks & 0x7FFFFFFF));
    
                    for (int j = 0; j < 255; j++)
                    {
                        char ch = (char)rd.Next(0xFFFF);
                        UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(ch);
                        if (uc == UnicodeCategory.Surrogate || // Single surrogate could not be encoding correctly.
                            uc == UnicodeCategory.PrivateUse || // Private use blocks should be excluded.
                            uc == UnicodeCategory.OtherNotAssigned
                            )
                        {
                            j--;
                        }
                        else
                        {
                            chars.Add(ch);
                            temp.Add(uc);
                        }
                    }
    
                    string str = new string(chars.ToArray());
    
                    byte[] inputStream = Encoding.UTF8.GetBytes(str);
                    bool expected = true; 
                    bool actual;
                    actual = EncodingHelper.IsTextUTF8(ref inputStream);
                    Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));
    
                    inputStream = Encoding.GetEncoding(932).GetBytes(str);
                    expected = false;
    
                    actual = EncodingHelper.IsTextUTF8(ref inputStream);
                    Assert.AreEqual(expected, actual, string.Format("ShiftJIS_Assert Fails at:{0}", str));
                }
            }
    
            /// <summary>
            ///   Check with All ASCII chars
            /// </summary>
            [TestMethod]
            public void IsTextUTF8Test_AllASCII()
            {
                string str = "ABCDEFGHKLHSJKLDFHJKLHAJKLSHJKLHAJKLSHDJKLAHSDJKLHAJKLSDHJKLASHDJKLHASJKLDHJKLASD";
    
                byte[] inputStream = Encoding.UTF8.GetBytes(str);
                bool expected = false;
                bool actual;
                actual = EncodingHelper.IsTextUTF8(ref inputStream);
                Assert.AreEqual(expected, actual, string.Format("UTF8_Assert Fails at:{0}", str));
    
    
            }
        }

    另:

    如果是判断一个文件是否使用了UTF8编码,不一定非用这种方法,因为通常以UTF8格式保存的文件最初两个字符是BOM头,标示该文件使用了UTF8编码。

    参考:

    维基百科:http://en.wikipedia.org/wiki/UTF-8



    本文是由葡萄城技术开发团队发布,转载请注明出处:葡萄城官网


  • 相关阅读:
    127. Word Ladder(单词变换 广度优先)
    150. Evaluate Reverse Polish Notation(逆波兰表达式)
    32. Longest Valid Parentheses(最长括号匹配,hard)
    20. Valid Parentheses(括号匹配,用桟)
    递归桟相关
    python编写计算器
    python打印9宫格,25宫格等奇数格,且横竖斜相加和相等
    基于百度人工智能图片识别接口开发的自动录题系统
    自动收集有效IP代理
    python数据储存
  • 原文地址:https://www.cnblogs.com/powertoolsteam/p/1831638.html
Copyright © 2011-2022 走看看