何为字符编码?
字符编码为计算机文字的存储格式, 例如 英文 字母 以ASCII编码存储, 即单字节存储, 其他字符编码有 UTF-8(通用字符编码格式), 其他区域性编码格式, 例如 ISO-8859(西欧), windows-1251俄文,中文GB编码。
为什么需要转换?
正因各个地区有不同的编码格式, 为了交换信息的目的, 就需要将相同字符的 从一种编码格式 转换为 另外一种编码格式。
通用的编码格式为 UTF-8, 其囊括了 世界上所有字符, 所以一般为了通用性, 文件都以UTF-8编码(例如网页支持多语言显示的情况), 其他编码的语言一般都向UTF-8转换。
转换库LIBICONV
http://www.gnu.org/software/libiconv/#introduction
GNU世界提供了 一个开源 转换库, 支持若干编码 和 unicode 编码之间的转换。 此库可以再没有提供编码转换的系统上使用。
项目地址 http://savannah.gnu.org/projects/libiconv/
最新的Linux C库以已经提供 iconv 的转换,可以不用安装:
http://davidgao.github.io/LFSCN/chapter06/glibc.html
LFS 之外的某些程序包推荐安装 GNU libiconv 用于转换文本编码。此工程的主页 (http://www.gnu.org/software/libiconv/) 表示 “此库提供一个
iconv()
实现,用于没有提供此实现或无法操作 Unicode 的系统。” Glibc 提供一个iconv()
实现并且可以操作 Unicode,所以在 LFS 系统上不必安装 libiconv。
LUAICONV
对于成熟的 lua, 对iconv功能进行了封装, 形成了一个专门的库,提供给LUA应用脚本使用。
官网介绍
http://ittner.github.io/lua-iconv/#download-and-installation
local iconv = require("iconv")cd = iconv.new(to, from) cd = iconv.open(to, from)nstr, err = cd:iconv(str) Converts the 'str' string to the desired charset. This method always returns two arguments: the converted string and an error code, which may have any of the following values: nil No error. Conversion was successful. iconv.ERROR_NO_MEMORY Failed to allocate enough memory in the conversion process. iconv.ERROR_INVALID An invalid character was found in the input sequence. iconv.ERROR_INCOMPLETE An incomplete character was found in the input sequence. iconv.ERROR_FINALIZED Trying to use an already-finalized converter. This usually means that the user was tweaking the garbage collector private methods. iconv.ERROR_UNKNOWN There was an unknown error.
对于LUA 5.1版本, 推荐下载 lua-iconv-5 版本, 最新的-7版本兼容 LUA5.2
https://github.com/ittner/lua-iconv/releases/tag/lua-iconv-5
安装运行有报错:
:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ lua test_iconv.lua
lua: error loading module 'iconv' from file './iconv.so':
./iconv.so: undefined symbol: libiconv_open
stack traceback:
[C]: ?
[C]: in function 'require'
test_iconv.lua:1: in main chunk
[C]: ?
经过查证(受到此文启发 http://tonybai.com/2013/04/25/a-libiconv-linkage-problem/),
分析为先安装了 libiconv库, 导致 此库的iconv.h拷贝到 usr/local/include/iconv.h
然后编译 luaiconv工程,编译文件iconv.c文件时候, gcc先找到 usr/local/include/iconv.h 此文件, 以此文件内部的函数声明为准,编译出iconv.so
实际上次应该以系统提供的 iconv.h 为准, 此文件在 /usr/include/iconv.h
头文件gcc搜索次序:
:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ ld -verbose | grep SEARCH
SEARCH_DIR("=/usr/i686-linux-gnu/lib32"); SEARCH_DIR("=/usr/local/lib32"); SEARCH_DIR("=/lib32"); SEARCH_DIR("=/usr/lib32"); SEARCH_DIR("=/usr/i686-linux-gnu/lib"); SEARCH_DIR("=/usr/local/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib/i386-linux-gnu"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/lib");
libiconv-------usr/local/include/iconv.h
#ifndef LIBICONV_PLUG
#define iconv_open libiconv_open
#endif
extern LIBICONV_DLL_EXPORTED iconv_t iconv_open (const char* tocode, const char* fromcode);
libiconv -- iconv.c 中 libiconv_open 定义收到宏控制, 应该未开启, 或者编译 luaiconv未链接libiconv库
#if defined __FreeBSD__ && !defined __gnu_freebsd__
/* GNU libiconv is the native FreeBSD iconv implementation since 2002.
It wants to define the symbols 'iconv_open', 'iconv', 'iconv_close'. */
#define strong_alias(name, aliasname) _strong_alias(name, aliasname)
#define _strong_alias(name, aliasname)
extern __typeof (name) aliasname __attribute__ ((alias (#name)));
#undef iconv_open
#undef iconv
#undef iconv_close
strong_alias (libiconv_open, iconv_open)
strong_alias (libiconv, iconv)
strong_alias (libiconv_close, iconv_close)
#endif
解决方法: 修改实现文件中, 引用的 iconv.h 引用方式, 将标准方式, 修改为自定义,并且写为全路径 /usr/include/iconv.h
然后再次 make && make install, 运行ok
vim luaiconv.c
#include <lua.h>
#include <lauxlib.h>
#include <stdlib.h>
#include "/usr/include/iconv.h"
#include <errno.h>
安装运行其它报错参考:
https://github.com/ittner/lua-iconv/issues/3
生成转换表实验
在一些嵌入式系统上, 没有安装libiconv库, 或者 libc库中也没有实现 iconv 功能, 但是同时还是需要字符换场景,
可以在编译服务器上, 安装luaiconv, 利用系统的iconv功能, 生成 一种编码到另外一种编码的映射表, 然后利用此映射表来, 是实现转换。
例如, 将windows-1251转换为UTF-8
windows-1251 字符编码参考:
http://www.science.co.il/language/Character-code.asp?s=1251
生成表的LUA代码:
function serializeTable(val, name, skipnewlines, depth) skipnewlines = skipnewlines or false depth = depth or 0 local tmp = string.rep(" ", depth) if name then tmp = tmp .. name .. " = " end if type(val) == "table" then tmp = tmp .. "{" .. (not skipnewlines and " " or "") for k, v in pairs(val) do tmp = tmp .. serializeTable(v, k, skipnewlines, depth + 1) .. "," .. (not skipnewlines and " " or "") end tmp = tmp .. string.rep(" ", depth) .. "}" elseif type(val) == "number" then tmp = tmp .. tostring(val) elseif type(val) == "string" then tmp = tmp .. string.format("%q", val) elseif type(val) == "boolean" then tmp = tmp .. (val and "true" or "false") else tmp = tmp .. ""[inserializeable datatype:" .. type(val) .. "]"" end return tmp end local iconv = require("iconv") -- Set your terminal encoding here -- local termcs = "iso-8859-1" local termcs = "utf-8" function check_one(to, from, text) print(" -- Testing conversion from " .. from .. " to " .. to) local cd = iconv.new(to .. "//TRANSLIT", from) assert(cd, "Failed to create a converter object.") local ostr, err = cd:iconv(text) if err == iconv.ERROR_INCOMPLETE then print("ERROR: Incomplete input.") elseif err == iconv.ERROR_INVALID then print("ERROR: Invalid input.") elseif err == iconv.ERROR_NO_MEMORY then print("ERROR: Failed to allocate memory.") elseif err == iconv.ERROR_UNKNOWN then print("ERROR: There was an unknown error.") end print(ostr) return ostr end local result = {} local num = 255 for i = 0, num do print("----------------------------------- i="..i) local char = string.char(i) local ostr = check_one(termcs, "windows-1251", char) print(string.len(ostr)) local byteStr = "" for j = 1, string.len(ostr) do local byteVal = string.byte(ostr,j) print("byte j=" ..j .. " byteVal=".. byteVal) byteStr = byteStr .. "\" .. byteVal end print("char i=" ..i .. " byteStr=".. byteStr) table.insert(result, byteStr) end print("-----------------------------------!!") s = serializeTable(result) print(s)
整理后的 windows-1251转换为UTF-8 的表
lcoal transTbl_1251toutf8 = { 1 = "