有关emoji表情以及utf-16编码

zoukankan html css js c++ java

有关emoji表情以及utf-16编码
昨日IOS组的同事遇到一个棘手的问题：当输入框内含有emoji表情时，如何获取文本框内的字符数（一个emoji表情算一个字符）。

先从我最近接触的JAVA说起，JAVA中，在使用String的length方法时，如果是普通的中英文字符，没有问题，但是如果该字符的Unicode编码大于0xFFFF，这个length方法就不能正确的获取字符数量了，事实上会把这样的特殊字符计算成2个字符。当然，JAVA已有现成的方法解决这个问题：codePointCount。

可惜的是，找了很久，在Objective-c中没有找到类似的方案。（似乎SubString后，数组长度就是准确的字符数，有待验证）

我不是IOS程序员，暂时不能提供OC中的解决方案。但在昨日的摸索中，也有一点点收获，拿出来分享一下。

1. emoji表情大部分的unicode编码大于0xFFFF，也就是UTF16编码后占用4个字节，仅小部分表情Unicode小于0xFFFF，这部分UTF16编码后占用2个字节。

2. 不管是Android还是IOS，从文本框中读取到的字符串，在内存中都是UTF-16编码(大端)形式存放的。（默认情况下）

3. 顺便摘录utf-16编码的规则（看明白这个规则，IOS中自行解决code point count的问题也就迎刃而解了）：
1) If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. 2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF, U' must be less than or equal to 0xFFFFF. That is, U' can be represented in 20 bits. 3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and 0xDC00, respectively. These integers each have 10 bits free to encode the character value, for a total of 20 bits. 4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order bits of W1 and the 10 low-order bits of U' to the 10 low-order bits of W2. Terminate. Graphically, steps 2 through 4 look like: U' = yyyyyyyyyyxxxxxxxxxx W1 = 110110yyyyyyyyyy W2 = 110111xxxxxxxxxx
查看全文

相关阅读:
转载：稳定性，鲁棒性和非脆弱性的精辟解读
 BZOJ 2806: [Ctsc2012]Cheat(单调队列优化dp+后缀自动机)
CF 235C. Cyclical Quest(后缀自动机)
BZOJ 5137: [Usaco2017 Dec]Standing Out from the Herd(后缀自动机)
2019/2/28 考试记录
 后缀自动机的应用
 CF 452E. Three strings(后缀数组+并查集)
BZOJ 2281: [Sdoi2011]黑白棋(dp+博弈论)
CF 39E. What Has Dirichlet Got to Do with That?(记忆化搜索+博弈论)
LUOGU P4783 【模板】矩阵求逆(高斯消元)

原文地址：https://www.cnblogs.com/shenzhigang/p/5015113.html