python 处理文字内容时,常常遇到编码的问题。
汉字常用的两种编码方式为 utf8 和 gbk,解析一个 txt 文件或者一个字符串时经常会遇到编码问题。
对于一行文本,我们分别尝试用 utf8 或者 gbk 去解码,哪一个解码内容多选择哪一个
def force_decode(string:bytes) ->str: """ sometimes neither gbk nor gbk can decode succseefully from string select longger decode result from utf8 or gbk """ if not isinstance(string, bytes): raise ValueError('expected bytes array') decode_chars_count = [] for i in ['utf8', 'gbk']: try: return string.decode(i) except UnicodeDecodeError as ex: decode_chars_count.append(ex.start) # neither utf8 or gbk decode successfully # select the longer decode one utf8_len, gbk_len = decode_chars_count selected_encoding = 'utf8' if utf8_len > gbk_len else 'gbk' return string.decode(selected_encoding, errors='ignore')
代码链接:https://gist.github.com/albertofwb/b53bf32adca5c245c6dee6642ca5463d