python的编码判断_unicode_gbk/gb2312_utf8（附函数）

zoukankan html css js c++ java

python的编码判断_unicode_gbk/gb2312_utf8（附函数）
python中，我们平常使用最多的三种编码为 gbk/gb2312, utf8 , unicode。而python中并没有一个函数来进行编码的判断。今天，主要对这三种编码进行讨论，并给出区分这三种编码的函数。

我们知道，

unicode编码是1位 gbk，gb2312是2位 utf-8是3位

所以，若只有一个汉字，我们可以通过长度来判断：
len(u'啊') == 1 #True len(u'啊'.encode("gbk")) == 2 #True len(u'啊'.encdoe("utf-8")) == 3 #True
但是实际中，往往是一句话，包含好多汉字。于是，我们做如下实验：
- 1，u'啊'.encode("gbk")[0].decode("gbk") 将会提示错误 UnicodeDecodeError: 'gbk' codec can't decode byte 0xb0 in position 0: incomplete multibyte sequence
- 2，u'啊'.encode('utf8')[0].decode("utf8") 将会提示错误 UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
- 3，u'啊'.encode('gbk')[0].decode('utf8') 将会提示错误 UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0: invalid start byte
- 4，u'啊'.encode('utf8')[0].decode('gbk') 将会提示错误 UnicodeDecodeError: 'gbk' codec can't decode byte 0xe5 in position 0: incomplete multibyte sequence
- 5，u'啊'.decode('utf8') 将会提示错误 UnicodeEncodeError: 'ascii' codec can't encode character u'\u554a' in position 0: ordinal not in range(128)
- 6，u'啊'.decode('gbk') 将会提示错误 UnicodeEncodeError: 'ascii' codec can't encode character u'\u554a' in position 0: ordinal not in range(128)
由以上可以看出，提示错误若出现 ascii，则该句编码位 ascii 无疑，从2，3可以看出 .decode("utf8")可以区分出不同的编码： unexpected end of data 表示该句为 utf8编码，而 invalid start byte 则表示该句为gbk编码或者gb2312编码。

综上，可以编写如下函数来进行编码判断：（python27）
#! -*-encoding:utf8 -*- def whichEncode(text): text0 = text[0] try: text0.decode('utf8') except Exception, e: if "unexpected end of data" in str(e): return "utf8" elif "invalid start byte" in str(e): return "gbk_gb2312" elif "ascii" in str(e): return "Unicode" return "utf8" if __name__ == "__main__": print(whichEncode(u"啊".encode("gbk"))) print(whichEncode(u"啊".encode("utf8"))) print(whichEncode(u"啊"))
在网上看到另一种方法，感觉也不错，from: https://my.oschina.net/sanpeterguo/blog/209134,,,,from_from:http://my.oschina.net/u/993130/blog/199214
def getCoding(strInput): ''' 获取编码格式 ''' if isinstance(strInput, unicode): return "unicode" try: strInput.decode("utf8") return 'utf8' except: pass try: strInput.decode("gbk") return 'gbk' except: pass def tran2UTF8(strInput): ''' 转化为utf8格式 ''' strCodingFmt = getCoding(strInput) if strCodingFmt == "utf8": return strInput elif strCodingFmt == "unicode": return strInput.encode("utf8") elif strCodingFmt == "gbk": return strInput.decode("gbk").encode("utf8") def tran2GBK(strInput): ''' 转化为gbk格式 ''' strCodingFmt = getCoding(strInput) if strCodingFmt == "gbk": return strInput elif strCodingFmt == "unicode": return strInput.encode("gbk") elif strCodingFmt == "utf8": return strInput.decode("utf8").encode("gbk")
查看全文

相关阅读:
洛谷P3806 【模板】点分治1 【点分治】
《软件自动化测试开发》出版上市-广而告之
 接口测试用例设计
 接口测试用例设计
 测试的行业危机
 测试的行业危机
 测试的行业危机
 从0开始学正则表达式－基于python
从0开始学正则表达式－基于python
从0开始学正则表达式－基于python

原文地址：https://www.cnblogs.com/lc-D-a/p/6074878.html