zoukankan      html  css  js  c++  java
  • Java实现过滤中文乱码

    最近在日志数据清洗时遇到中文乱码,如果只要有非中文字符就将该字符串过滤掉,这种方法虽简单但并不可取,因为比如像Xperia™主題天天四川麻将Ⅱ这样的字符串也会被过滤掉。

    1. Unicode编码

    Unicode编码是一种涵盖了世界上所有语言、标点等字符的编码方式,简单一点说,就是一种通用的世界码;其编码范围:U+0000 .. U+10FFFF。按Unicode硬编码的区间进行划分,Unicode编码被分成若干个block ( Unicode block);每一个Unicode编码专属于唯一的Unicode block,Unicode block之间互不重叠。从码字的本身的属性出发,Unicode编码被分成了若干script ( Unicode script);比如,与中文相关的字符、标点的scriptHan包括block如下:

    • CJK Radicals Supplement
    • Kangxi Radicals
    • CJK Symbols and Punctuation中的15个字符
    • CJK Unified Ideographs Extension A
    • CJK Unified Ideographs
    • CJK Compatibility Ideographs
    • CJK Unified Ideographs Extension B
    • CJK Unified Ideographs Extension C
    • CJK Unified Ideographs Extension D
    • CJK Unified Ideographs Extension E
    • CJK Compatibility Ideographs Supplement

    其中,常见的中文字符在CJK Unified Ideographs block;此外,考虑繁体字及不常见字等,CJK还有A、B、C、D、E五个extension。Basic Latin block完整地包含了ASCII码的控制字符、标点字符与英文字母字符。

    Unicode编码与block、script之间的映射关系,具体可参看这里

    2. Java的字符编码

    JDK完整实现Unicode的block与script:

    Char c = '☎'
    Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
    Character.UnicodeScript uc = Character.UnicodeScript.of(c);
    

    Java中的字符char内置的编码方式是UTF-16,当char强转成int类型时,其返回值是unicode编码值,只有当getbyte时才返回的是utf-8编码的byte:

    String s = "u00a0";
    String.format("\u%04x", (int) s.charAt(0)) // --> u00a0
    import org.apache.commons.codec.binary.Hex;
    Hex.encodeHex(s.getBytes()) // --> c2a0
    

    UTF-8是Unicode字符的变长前缀编码的一种实现,二者之间的对应关系在这里.现在我们回到开篇过滤中文乱码的问题,有一个基本解决思路:

    • 去掉各种标点字符、控制字符,
    • 计算剩下字符中非中文字符所占的比例,如果超过阈值,则认为该字符串为乱码串

    完整代码如下:

    public class ChineseUtill {
    	 
        private static boolean isChinese(char c) {
        	Character.UnicodeScript sc = Character.UnicodeScript.of(c);
            if (sc == Character.UnicodeScript.HAN) {
                return true;
            }
            return false;
        }
        
        public static boolean isPunctuation(char c) {
            Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
            if (    // punctuation, spacing, and formatting characters
            		ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
            		// symbols and punctuation in the unified Chinese, Japanese and Korean script
                    || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                    // fullwidth character or a halfwidth character
                    || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                    // vertical glyph variants for east Asian compatibility
                    || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                    // vertical punctuation for compatibility characters with the Chinese Standard GB 18030
                    || ub == Character.UnicodeBlock.VERTICAL_FORMS
                    // ascii
                    || ub == Character.UnicodeBlock.BASIC_LATIN
                    ) {
                return true;
            } else {
                return false;
            }
        }
        
        private static Boolean isUserDefined(char c) {
        	Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        	if (ub == Character.UnicodeBlock.NUMBER_FORMS
        			|| ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
        			|| ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS
        			|| c == 'ufeff'
        			|| c == 'u00a0'
        			)
        		return true;
        	return false;
        }
        
        public static Boolean isMessy(String str)  {
        	float chlength = 0;
        	float count = 0;
        	for(int i = 0; i < str.length(); i++) {
        		char c = str.charAt(i);
        		if(isPunctuation(c) || isUserDefined(c))
        			continue;
        		else {
        			if(!isChinese(c)) {
        				count = count + 1;
        			}
        			chlength ++;
        		}
        	}
        	float result = count / chlength;
        	if(result > 0.3)
        		return true;
        	return false;
        }
        
    }
    

    为了得到更为完整的可接受的字符表,定义isUserDefined方法(具体字符表与日志中的字符有关系);加上了Number FormsEnclosed AlphanumericsLetterlike Symbols这三个block,以及u00a0(Non-breaking space)字符与ufeff(ZERO WIDTH NO-BREAK SPACE)字符。

    3. 参考资料

    [1] Wikipedia, Unicode block.
    [2] Tong Zeng, Java 中文字符判断 中文标点符号判断.

  • 相关阅读:
    PAT (Advanced Level) 1010. Radix (25)
    PAT (Advanced Level) 1009. Product of Polynomials (25)
    PAT (Advanced Level) 1008. Elevator (20)
    PAT (Advanced Level) 1007. Maximum Subsequence Sum (25)
    PAT (Advanced Level) 1006. Sign In and Sign Out (25)
    PAT (Advanced Level) 1005. Spell It Right (20)
    PAT (Advanced Level) 1004. Counting Leaves (30)
    PAT (Advanced Level) 1001. A+B Format (20)
    PAT (Advanced Level) 1002. A+B for Polynomials (25)
    PAT (Advanced Level) 1003. Emergency (25)
  • 原文地址:https://www.cnblogs.com/en-heng/p/5320024.html
Copyright © 2011-2022 走看看