zoukankan html css js c++ java

python中unicode和unicodeescape

在python中，unicode是内存编码集，一般我们将数据存储到文件时，需要将数据先编码为其他编码集，比如utf-8、gbk等。

读取数据的时候再通过同样的编码集进行解码即可。

#python3
>>> s = '中国'
>>> a = s.encode()
>>> a
b'xe4xb8xadxe5x9bxbd'
>>> b = a.decode()
>>> b
'中国'

但是其实还有一种unicode-escape编码集，他是将unicode内存编码值直接存储：

#python3
>>> s = '中国'
>>> b = s.encode('unicode-escape')
>>> b
b'\u4e2d\u56fd'
>>> c = b.decode('unicode-escape')
>>> c
'中国'

拓展：还有一种string-escape编码集，在2中可以对字节流用string-escape进行编码

#python2
>>> s = '中国'
>>> a = s.decode('gbk')
>>> print a
中国
>>> b = s.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:pythonpython2.7libencodingsutf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid c
ontinuation byte
>>> c = s.decode('string-escape')
>>> print c
中国

chardet.detect()

使用chardet.detect()进行编码集检测时很多时候并不准确，比如中文过少时会识别成IBM855编码集：

#python3
>>> s = '中国'
>>> c = s.encode('gbk')
>>> chardet.detect(c)
{'encoding': 'IBM855', 'confidence': 0.7679697235616183, 'language': 'Russian'}

注：855 OEM 西里尔语 IBM855。

中文比较多时，还是准确的：

>>> s = '中国范文芳威风威风'
>>> c = s.encode('gbk')
>>> chardet.detect(c)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

查看全文

相关阅读:
读书笔记7
读书笔记5
读书笔记6
读书笔记4
读书笔记2
读书笔记3
读书笔记1
嵌入式linux的调试技术
 硬件抽象层：HAL
蜂鸣器驱动

原文地址：https://www.cnblogs.com/leomei91/p/7685797.html