参考:
https://blog.csdn.net/csdn_yi_e/article/details/71037288
https://blog.csdn.net/qq_42739440/article/details/89887451
1.chardet判断编码类型
import chardet f=open('a.txt','rb') text=f.read() info=chardet.detect(text) print(info) {'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}
2.编码解码读取
import chardet f=open('a.txt',encoding='UTF-16') text=f.read() print(text.encode("utf-8").decode("unicode_escape")) '1.新出吐鲁番文书及其研究'
先编码然后解码读取到了中文文字。
3.bert中unicode
import six def convert_to_unicode(text): """ Converts `text` to Unicode (if it's not already), assuming UTF-8 input. """ # six_ensure_text is copied from https://github.com/benjaminp/six def six_ensure_text(s, encoding="unicode_escape", errors="strict"): if isinstance(s, six.binary_type): print('true') return s.decode(encoding, errors)#如果是字节流,那么就以指定方式解码 elif isinstance(s, six.text_type):#如果是文本类型,直接返回 return s else: raise TypeError("not expecting type '%s'" % type(s)) return six_ensure_text(text, encoding="unicode_escape", errors="ignore") f=open('a.txt',encoding=('UTF-16')) text=f.read() print(convert_to_unicode(text.encode("utf-8")))
true
1.新出吐鲁番文书及其研究
注意:
>>> type(text.encode("utf-8"))#经过编码之后encode类型为字节类型 <class 'bytes'> >>> type(text)#通过open中的encoding的是文件编码方式,text类型是str <class 'str'>
上面的二进制类型也就是py3中的字节类型。