python技巧31[unicode和bytes]

zoukankan html css js c++ java

python技巧31[unicode和bytes]

一 Python3 中字符串的类型
bytearray([source[, encoding[, errors]]])

Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.
bytes([source[, encoding[, errors]]])

Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray.
str([object[, encoding[, errors]]])

Return a string version of an object. str默认为unicode的字符串。

貌似也没有了2.x中的basestring类型了。
二实例
# -*- coding: gbk -*-

def TestisStrOrUnicdeOrString():
  bs = b'Hello'
  ustr = 'abc'
  print (isinstance(bs, str))  #False
  print (isinstance(bs,bytes)) #True
  print (isinstance(ustr,str)) #True
  print (isinstance(ustr, bytes)) #False
  print (isinstance(bs,(bytes,str))) #True

def TestChinese():
  us = '中国'
  bs = b'AAA'
  bs2 = bytes('中国','gbk')

  print (us + ':' + str(type(us))) #中国:<class 'str'>
  print (bs) #b'AAA'
  print (bs2) # b'\xd6\xd0\xb9\xfa'
  print (':' + str(type(bs2))) #:<class 'bytes'>
  print (bs2.decode('gbk')) #中国

  # TypeError: Can't convert 'bytes' object to str implicitly
  #newstr = us + bs2

  print ('us == bs2' + ':' + str(us == bs2)) #us == bs2:False

  s3 = 'AAA中国'
  print (s3) # AAA中国

  s4 = bytes('AAA中国','gbk')
  print (s4) # b'AAA\xd6\xd0\xb9\xfa'

def TestPrint():
  print ('AAA' + '中国')  # AAA中国
  #print (b'AAA' + b'中国') #  # SyntaxError: bytes can only contain ASCII literal characters.
  #print ('AAA' + bytes('中国','gbk')) # TypeError: Can't convert 'bytes' object to str implicitly

def TestCodecs():
    import codecs

    look  = codecs.lookup("gbk")

    a = bytes("北京",'gbk')

    print (len(a), a, type(a)) #4 b'\xb1\xb1\xbe\xa9' <class 'bytes'>

    b = look.decode(a)
    print (b[1], b[0], type(b[0])) #4 北京 <class 'str'>


if __name__ == '__main__':
    TestisStrOrUnicdeOrString()
    TestChinese()
    TestPrint()
    TestCodecs()
三总结
1） Python 3会假定我们的源码 — 即.py文件 — 使用的是UTF-8编码方式。Python 2里，.py文件默认的编码方式为ASCII。可以使用# -*- coding: windows-1252 -*-方式来改变文件的编码。如果py文件中包含中文的字符串，则需要制定为# -*- coding: gbk -*-，貌似默认的utf8不够哦。
2） python3中默认的str为unicode的，可以使用str.encode来转为bytes类型。
3） python3的print函数只支持unicode的str，貌似没有对bytes的解码功能，所以对对不能解码的bytes不能正确输出。
4） str和bytes不能连接和比较。
5） codecs任然可以用来str和bytes间的转化。
6）定义非ascii码的bytes时，必须使用如 bytes('中国','gbk') 来转码。
7)貌似必须在中文系统或者系统安装中文的语言包后gbk解码才能正常工作。
python 2.6 的字符及编码转化见：http://www.cnblogs.com/itech/archive/2011/03/27/1996883.html

完！

作者：iTech
微信公众号: cicdops
出处：http://itech.cnblogs.com/
github：https://github.com/cicdops/cicdops

查看全文

相关阅读:
找最大质因子问题
 ACM基础训练题解4302 丢失的牛
 ACM基础训练题解4301 城市地平线
 hnu 7/19 A Broken Audio Signal
hnu7/20比赛 BUG 题解
 http://acm.hnu.cn/online/?action=problem&type=show&id=12817&courseid=267 7.19hnu/数据结构/数学 xxs.code
SGU 解题报告
 2015 Multi-University Training Contest 1 题解&&总结
 HDU 5351 MZL's Border (规律，大数)
SGU 239.Minesweeper

原文地址：https://www.cnblogs.com/itech/p/1997878.html