python正则的中文处理 - 走看看

zoukankan html css js c++ java

python正则的中文处理
因工作需要，要查找中文汉字分词，因为python正则表达式W+表示的是所有的中文字就连标点符号都包括。所以要想办法过滤掉。

参考博客：http://log.medcl.net/item/2011/03/the-chinese-deal-is-the-python/

1.匹配中文时，正则表达式规则和目标字串的编码格式必须相同
print sys.getdefaultencoding() text =u"#who#helloworld#a中文x#" print isinstance(text,unicode) print text
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 18: ordinal not in range(128)

print text报错
解释：控制台信息输出窗口是按照ascii编码输出的（英文系统的默认编码是ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。
改成 print(word.encode('utf8'))即可

2.//确定系统默认编码
import sys
print sys.getdefaultencoding()

3.//判断字符类型是否unicode
print isinstance(text,unicode)

4.unicodepython字符互转
# -*- coding: utf-8 -*- unistr= u'a'; pystr=unistr.encode('utf8') unistr2=unicode(pystr,'utf8')

#需要unicode的环境 if not isinstance(input,unicode): temp=unicode(input,'utf8') else: temp=input #需要pythonstr的环境 if isinstance(input,unicode): temp2=input.encode('utf8') else: temp2=input

经实验如果脚本的# -*- coding: utf-8 -*- 设置为GBK用unicode转换的时候会报错。
查看全文

相关阅读:
Kafka学习笔记之kafka高版本Client连接0.9Server引发的血案排查
 机器学习笔记之python实现朴素贝叶斯算法样例
 机器学习笔记之python实现支持向量机SVM算法样例
 机器学习笔记之AdaBoost算法详解以及代码实现
 机器学习笔记之python实现关联规则算法Apriori样例
 机器学习笔记之python实现AdaBoost算法
 F-47(copy 邓大顾)
js 设置标题空白
 微信授权验证
 iphone web 时间问题

原文地址：https://www.cnblogs.com/Alex-Zeng/p/3386104.html

Copyright © 2011-2022 走看看