go 中的rune 与 byte
概述
- byte 等同于int8,常用来处理ascii字符,注重 raw data
- rune 等同于int32,常用来处理unicode或utf-8字符
Unicode。它是ASCII的超集,ASCII只能表示少部分符号,随着互联网接入的国家越多,ASCII已经无法容纳个国家地区的符号文字了,所以Unicode诞生了。也就是UTF-8,UTF-8在1至4个字节之间对所有Unicode进行编码,其中1个字节用于ASCII,其余部分用于符文
在处理普通字符(如,英文字母,数字)时,rune 和 byte 并无差别。
s := "abc123"
b := []byte(s)
fmt.Printf("abc123 convert to []byte is %v
",b)
r := []rune(s)
fmt.Printf("abc123 convert to []rune is %v
",r)
output
abc123 convert to []byte is [97 98 99 49 50 51]
abc123 convert to []rune is [97 98 99 49 50 51]
但在处理特殊字符时(如中文),byte 三个单位存储一个汉字,而 rune,一个单位存储一个汉字。
一个汉字为3字节
s := "测试"
b := []byte(s)
fmt.Printf("测试 convert to []byte is %v
",b)
r := []rune(s)
fmt.Printf("测试 convert to []rune is %v
",r)
测试 convert to []byte is [230 181 139 232 175 149]
测试 convert to []rune is [27979 35797]
why
先看源码:
// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8
// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
byte 表示一字节,而 rune 表示四字节,这也解释了,双方存储汉字时的差异。
在用string ,[]byte, []rune处理汉字时也有不同:
c := "一二三四五"
c1 :=c[:2]
fmt.Printf("c1 is %v
",c1)
bc1 :=[]byte(c)[:6]
fmt.Printf("c1 with []byte is %v
",string(bc1))
rc1 := []rune(c)[:2]
fmt.Printf("c1 with []rune is %v
",string(rc1))
c1 is ��
c1 with []byte is 一二
c1 with []rune is 一二
截取中文字符串切片时,不能直接对string切片截取,最好转换成rune切片。