zoukankan      html  css  js  c++  java
  • 字符编码转换笔记

    何为字符编码?

    字符编码为计算机文字的存储格式, 例如 英文 字母 以ASCII编码存储, 即单字节存储,  其他字符编码有 UTF-8(通用字符编码格式), 其他区域性编码格式, 例如 ISO-8859(西欧), windows-1251俄文,中文GB编码。

    为什么需要转换?

    正因各个地区有不同的编码格式, 为了交换信息的目的, 就需要将相同字符的 从一种编码格式 转换为 另外一种编码格式。

     通用的编码格式为 UTF-8, 其囊括了 世界上所有字符, 所以一般为了通用性, 文件都以UTF-8编码(例如网页支持多语言显示的情况), 其他编码的语言一般都向UTF-8转换。

    转换库LIBICONV

    http://www.gnu.org/software/libiconv/#introduction

    GNU世界提供了 一个开源 转换库, 支持若干编码 和 unicode 编码之间的转换。 此库可以再没有提供编码转换的系统上使用。

    项目地址 http://savannah.gnu.org/projects/libiconv/

    最新的Linux C库以已经提供 iconv 的转换,可以不用安装:

    http://davidgao.github.io/LFSCN/chapter06/glibc.html

    LFS 之外的某些程序包推荐安装 GNU libiconv 用于转换文本编码。此工程的主页 (http://www.gnu.org/software/libiconv/) 表示 “此库提供一个 iconv() 实现,用于没有提供此实现或无法操作 Unicode 的系统。” Glibc 提供一个 iconv() 实现并且可以操作 Unicode,所以在 LFS 系统上不必安装 libiconv。

    LUAICONV

    对于成熟的 lua, 对iconv功能进行了封装, 形成了一个专门的库,提供给LUA应用脚本使用。

    官网介绍

    http://ittner.github.io/lua-iconv/#download-and-installation

     local iconv = require("iconv")
    
      cd = iconv.new(to, from)
      cd = iconv.open(to, from)

      nstr, err = cd:iconv(str)
    
        Converts the 'str' string to the desired charset. This method always
        returns two arguments: the converted string and an error code, which
        may have any of the following values:
    
        nil
            No error. Conversion was successful.
    
        iconv.ERROR_NO_MEMORY
            Failed to allocate enough memory in the conversion process.
    
        iconv.ERROR_INVALID
            An invalid character was found in the input sequence.
    
        iconv.ERROR_INCOMPLETE
            An incomplete character was found in the input sequence.
    
        iconv.ERROR_FINALIZED
            Trying to use an already-finalized converter. This usually means
            that the user was tweaking the garbage collector private methods.
    
        iconv.ERROR_UNKNOWN
            There was an unknown error.

    对于LUA 5.1版本, 推荐下载 lua-iconv-5 版本, 最新的-7版本兼容 LUA5.2

    https://github.com/ittner/lua-iconv/releases/tag/lua-iconv-5

    安装运行有报错:

    :~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ lua test_iconv.lua
    lua: error loading module 'iconv' from file './iconv.so':
        ./iconv.so: undefined symbol: libiconv_open
    stack traceback:
        [C]: ?
        [C]: in function 'require'
        test_iconv.lua:1: in main chunk
        [C]: ?

    经过查证(受到此文启发 http://tonybai.com/2013/04/25/a-libiconv-linkage-problem/), 

    分析为先安装了 libiconv库,  导致 此库的iconv.h拷贝到 usr/local/include/iconv.h

    然后编译 luaiconv工程,编译文件iconv.c文件时候, gcc先找到 usr/local/include/iconv.h 此文件, 以此文件内部的函数声明为准,编译出iconv.so

    实际上次应该以系统提供的 iconv.h 为准,  此文件在 /usr/include/iconv.h

    头文件gcc搜索次序:

    :~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ ld -verbose | grep SEARCH
    SEARCH_DIR("=/usr/i686-linux-gnu/lib32"); SEARCH_DIR("=/usr/local/lib32"); SEARCH_DIR("=/lib32"); SEARCH_DIR("=/usr/lib32"); SEARCH_DIR("=/usr/i686-linux-gnu/lib"); SEARCH_DIR("=/usr/local/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib/i386-linux-gnu"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/lib");

    libiconv-------usr/local/include/iconv.h

    #ifndef LIBICONV_PLUG
    #define iconv_open libiconv_open
    #endif
    extern LIBICONV_DLL_EXPORTED iconv_t iconv_open (const char* tocode, const char* fromcode);

    libiconv -- iconv.c 中 libiconv_open 定义收到宏控制, 应该未开启, 或者编译 luaiconv未链接libiconv库

    #if defined __FreeBSD__ && !defined __gnu_freebsd__
    /* GNU libiconv is the native FreeBSD iconv implementation since 2002.
       It wants to define the symbols 'iconv_open', 'iconv', 'iconv_close'.  */
    #define strong_alias(name, aliasname) _strong_alias(name, aliasname)
    #define _strong_alias(name, aliasname)
      extern __typeof (name) aliasname __attribute__ ((alias (#name)));
    #undef iconv_open
    #undef iconv
    #undef iconv_close
    strong_alias (libiconv_open, iconv_open)
    strong_alias (libiconv, iconv)
    strong_alias (libiconv_close, iconv_close)
    #endif

    解决方法: 修改实现文件中, 引用的 iconv.h 引用方式, 将标准方式, 修改为自定义,并且写为全路径 /usr/include/iconv.h

    然后再次 make && make install, 运行ok

    vim luaiconv.c


    #include <lua.h>
    #include <lauxlib.h>
    #include <stdlib.h>

    #include "/usr/include/iconv.h"
    #include <errno.h>

    安装运行其它报错参考:

    https://github.com/ittner/lua-iconv/issues/3

    生成转换表实验

    在一些嵌入式系统上, 没有安装libiconv库, 或者 libc库中也没有实现 iconv 功能, 但是同时还是需要字符换场景,

    可以在编译服务器上, 安装luaiconv, 利用系统的iconv功能, 生成 一种编码到另外一种编码的映射表, 然后利用此映射表来, 是实现转换。

    例如, 将windows-1251转换为UTF-8

    windows-1251 字符编码参考:

    http://www.science.co.il/language/Character-code.asp?s=1251

    生成表的LUA代码:

    function serializeTable(val, name, skipnewlines, depth)
        skipnewlines = skipnewlines or false
        depth = depth or 0
        local tmp = string.rep(" ", depth)
        if name then tmp = tmp .. name .. " = " end
        if type(val) == "table" then
            tmp = tmp .. "{" .. (not skipnewlines and "
    " or "")
            for k, v in pairs(val) do
                tmp = tmp .. serializeTable(v, k, skipnewlines, depth + 1) .. "," .. (not skipnewlines and "
    " or "")
            end
            tmp = tmp .. string.rep(" ", depth) .. "}"
        elseif type(val) == "number" then
            tmp = tmp .. tostring(val)
        elseif type(val) == "string" then
            tmp = tmp .. string.format("%q", val)
        elseif type(val) == "boolean" then
            tmp = tmp .. (val and "true" or "false")
        else
            tmp = tmp .. ""[inserializeable datatype:" .. type(val) .. "]""
        end
        return tmp
    end
    
    local iconv = require("iconv")
    -- Set your terminal encoding here
    -- local termcs = "iso-8859-1"
    local termcs = "utf-8"
    
    function check_one(to, from, text)
      print("
    -- Testing conversion from " .. from .. " to " .. to)
      local cd = iconv.new(to .. "//TRANSLIT", from)
      assert(cd, "Failed to create a converter object.")
      local ostr, err = cd:iconv(text)
      if err == iconv.ERROR_INCOMPLETE then
        print("ERROR: Incomplete input.")
      elseif err == iconv.ERROR_INVALID then
        print("ERROR: Invalid input.")
      elseif err == iconv.ERROR_NO_MEMORY then
        print("ERROR: Failed to allocate memory.")
      elseif err == iconv.ERROR_UNKNOWN then
        print("ERROR: There was an unknown error.")
      end
    
      print(ostr)
      return ostr
    end
     
    local result = {}
    local num = 255
    for i = 0, num do
      print("----------------------------------- i="..i)
      local char = string.char(i)
      local ostr = check_one(termcs, "windows-1251", char)
      print(string.len(ostr))
      local byteStr = ""
      for j = 1, string.len(ostr) do
          local byteVal = string.byte(ostr,j)
          print("byte j=" ..j .. " byteVal=".. byteVal)
          byteStr = byteStr .. "\" .. byteVal
      end
      print("char i=" ..i .. " byteStr=".. byteStr)
      table.insert(result, byteStr)
    end
    
    print("-----------------------------------!!")
    s = serializeTable(result)
    print(s)

    整理后的 windows-1251转换为UTF-8 的表

    lcoal transTbl_1251toutf8 = {
     1 = "",
     2 = "1",
     3 = "2",
     4 = "3",
     5 = "4",
     6 = "5",
     7 = "6",
     8 = "7",
     9 = "8",
     10 = "9",
     11 = "10",
     12 = "11",
     13 = "12",
     14 = "13",
     15 = "14",
     16 = "15",
     17 = "16",
     18 = "17",
     19 = "18",
     20 = "19",
     21 = "20",
     22 = "21",
     23 = "22",
     24 = "23",
     25 = "24",
     26 = "25",
     27 = "26",
     28 = "27",
     29 = "28",
     30 = "29",
     31 = "30",
     32 = "31",
     33 = "32",
     34 = "33",
     35 = "34",
     36 = "35",
     37 = "36",
     38 = "37",
     39 = "38",
     40 = "39",
     41 = "40",
     42 = "41",
     43 = "42",
     44 = "43",
     45 = "44",
     46 = "45",
     47 = "46",
     48 = "47",
     49 = "48",
     50 = "49",
     51 = "50",
     52 = "51",
     53 = "52",
     54 = "53",
     55 = "54",
     56 = "55",
     57 = "56",
     58 = "57",
     59 = "58",
     60 = "59",
     61 = "60",
     62 = "61",
     63 = "62",
     64 = "63",
     65 = "64",
     66 = "65",
     67 = "66",
     68 = "67",
     69 = "68",
     70 = "69",
     71 = "70",
     72 = "71",
     73 = "72",
     74 = "73",
     75 = "74",
     76 = "75",
     77 = "76",
     78 = "77",
     79 = "78",
     80 = "79",
     81 = "80",
     82 = "81",
     83 = "82",
     84 = "83",
     85 = "84",
     86 = "85",
     87 = "86",
     88 = "87",
     89 = "88",
     90 = "89",
     91 = "90",
     92 = "91",
     93 = "92",
     94 = "93",
     95 = "94",
     96 = "95",
     97 = "96",
     98 = "97",
     99 = "98",
     100 = "99",
     101 = "100",
     102 = "101",
     103 = "102",
     104 = "103",
     105 = "104",
     106 = "105",
     107 = "106",
     108 = "107",
     109 = "108",
     110 = "109",
     111 = "110",
     112 = "111",
     113 = "112",
     114 = "113",
     115 = "114",
     116 = "115",
     117 = "116",
     118 = "117",
     119 = "118",
     120 = "119",
     121 = "120",
     122 = "121",
     123 = "122",
     124 = "123",
     125 = "124",
     126 = "125",
     127 = "126",
     128 = "127",
     129 = "208130",
     130 = "208131",
     131 = "226128154",
     132 = "209147",
     133 = "226128158",
     134 = "226128166",
     135 = "226128160",
     136 = "226128161",
     137 = "226130172",
     138 = "226128176",
     139 = "208137",
     140 = "226128185",
     141 = "208138",
     142 = "208140",
     143 = "208139",
     144 = "208143",
     145 = "209146",
     146 = "226128152",
     147 = "226128153",
     148 = "226128156",
     149 = "226128157",
     150 = "226128162",
     151 = "226128147",
     152 = "226128148",
     153 = "",
     154 = "226132162",
     155 = "209153",
     156 = "226128186",
     157 = "209154",
     158 = "209156",
     159 = "209155",
     160 = "209159",
     161 = "194160",
     162 = "208142",
     163 = "209158",
     164 = "208136",
     165 = "194164",
     166 = "210144",
     167 = "194166",
     168 = "194167",
     169 = "208129",
     170 = "194169",
     171 = "208132",
     172 = "194171",
     173 = "194172",
     174 = "194173",
     175 = "194174",
     176 = "208135",
     177 = "194176",
     178 = "194177",
     179 = "208134",
     180 = "209150",
     181 = "210145",
     182 = "194181",
     183 = "194182",
     184 = "194183",
     185 = "209145",
     186 = "226132150",
     187 = "209148",
     188 = "194187",
     189 = "209152",
     190 = "208133",
     191 = "209149",
     192 = "209151",
     193 = "208144",
     194 = "208145",
     195 = "208146",
     196 = "208147",
     197 = "208148",
     198 = "208149",
     199 = "208150",
     200 = "208151",
     201 = "208152",
     202 = "208153",
     203 = "208154",
     204 = "208155",
     205 = "208156",
     206 = "208157",
     207 = "208158",
     208 = "208159",
     209 = "208160",
     210 = "208161",
     211 = "208162",
     212 = "208163",
     213 = "208164",
     214 = "208165",
     215 = "208166",
     216 = "208167",
     217 = "208168",
     218 = "208169",
     219 = "208170",
     220 = "208171",
     221 = "208172",
     222 = "208173",
     223 = "208174",
     224 = "208175",
     225 = "208176",
     226 = "208177",
     227 = "208178",
     228 = "208179",
     229 = "208180",
     230 = "208181",
     231 = "208182",
     232 = "208183",
     233 = "208184",
     234 = "208185",
     235 = "208186",
     236 = "208187",
     237 = "208188",
     238 = "208189",
     239 = "208190",
     240 = "208191",
     241 = "209128",
     242 = "209129",
     243 = "209130",
     244 = "209131",
     245 = "209132",
     246 = "209133",
     247 = "209134",
     248 = "209135",
     249 = "209136",
     250 = "209137",
     251 = "209138",
     252 = "209139",
     253 = "209140",
     254 = "209141",
     255 = "209142",
     256 = "209143",
    }
  • 相关阅读:
    穷举字符串的一种算法
    使用VirtualBox SDK之初步编译
    Install Shield 中判断安装还是卸载
    [转] win32内核程序中进程的pid,handle,eprocess之间相互转换的方法
    如何做PHD (1)
    在TFS 2010中使用邮件提醒功能(Email Notification)
    Chrome的Awesome Screenshot的插件离线下载
    Visual Studio 2010下生成Crypto++ lib
    VirtualBox开发环境的搭建详解
    VC版PwdHash
  • 原文地址:https://www.cnblogs.com/lightsong/p/4634642.html
Copyright © 2011-2022 走看看