zoukankan      html  css  js  c++  java
  • Lua Unicode

    From:http://lua-users.org/wiki/LuaUnicode

    Pattern Matching

    Lua's pattern matching facilities work character by character. In general, this will not work for Unicode pattern matching, although some things will work as you want. For example, "%u" will not match all Unicode upper case letters. You can match individual Unicode characters in a normalized Unicode string, but you might want to worry about combining character sequences. If there are no following combining characters, "a" will match only the letter a in a UTF-8 string. In UTF-16LE you could match "a%z". (Remember that you cannot use \0 in a Lua pattern.)

    Length and string indexing

    If you want to know the length of a Unicode string there are different answers you might want according to the circumstances.

    If you just want to know how many bytes the string occupies, so that you can make space for copying it into a buffer for example, then the existing Lua function string.len will work.

    You might want to know how many Unicode characters are in a string. Depending on the encoding used, a single Unicode character may occupy up to four bytes. Only UTF-32LE and UTF-32BE are constant length encodings (four bytes per character); UTF-32 is mostly a constant length encoding but the first element in a UTF-32 sequence should be a "Byte Order Mark", which does not count as a character. (UTF-32 and variants are part of Unicode with the latest version, Unicode 4.0.)

    Some implementations of UTF-16 assume that all characters are two bytes long, but this has not been true since Unicode version 3.0.

    Happily UTF-8 is designed so that it is relatively easy to count the number of unicode symbols in a string: simply count the number of octets that are in the ranges 0x00 to 0x7f (inclusive) or 0xC2 to 0xF4 (inclusive). (In decimal, 0-127 and 194-244.) These are the codes which can start a UTF-8 character code. Octets 0xC0, 0xC1 and 0xF5 to 0xFF (192, 193 and 245-255) cannot appear in a conforming UTF-8 sequence; octets in the range 0x80 to 0xBF (128-191) can only appear in the second and subsequent octets of a multi-octet encoding. Remember that you cannot use \0 in a Lua pattern.

    For example, you could use the following code snippet to count UTF-8 characters in a string you knew to be conforming (it will incorrectly count some invalid characters):

            local _, count = string.gsub(unicode_string, "[^\128-\193]", "")
    

    If you want to know how many printing columns a Unicode string will occupy when you print it out using a fixed-width font (imagine you are writing something like the Unix ls program that formats its output into several columns), then that is a different answer again. That's because some Unicode characters do not have a printing width, while others are double-width characters. Combining characters are used to add accents to other letters, and generally they do not take up any extra space when printed.

    So that's at least 3 different notions of length that you might want at different times. Lua provides one of them (string.len) the others you'll need to write functions for.

    There's a similar issue with indexing the characters of a string by position. string.sub(s, -3) will return the last 3 bytes of the string which is not necessarily the same as the last three characters of the string, and may or may not be a complete code.

    You could use the following code snippet to iterate over UTF-8 sequences (this will simply skip over most invalid codes):

            for uchar in string.gfind(ustring, "([%z\1-\127\194-\244][\128-\191]*)") do
              -- something
            end
    
  • 相关阅读:
    Java简单游戏开发之碰撞检测
    Android应用程序在新的进程中启动新的Activity的方法和过程分析
    Android应用程序组件Content Provider的共享数据更新通知机制分析
    Android应用程序组件Content Provider在应用程序之间共享数据的原理分析
    VS2005检查内存泄露的简单方法
    iOS如何让xcode自动检查内存泄露
    wf
    VC++ListBox控件的使用
    textoverflow:ellipsis溢出文本显示省略号的详细方法
    VC编写代码生成GUID并转换为CString字符串
  • 原文地址:https://www.cnblogs.com/superchao8/p/2185484.html
Copyright © 2011-2022 走看看