iconv命令是运行于linux/unix平台的文件编码装换工具。当我们在linux/unix系统shell查看文本文件时,常常会发现文件的中文是乱码的,这是由于文本文件的编码与当前操作系统设置的编码不同而引起的,这时可以使用iconv进行编码转换,从而解决乱码问题。
解决文本文件乱码问题分3步:1.确定文件编码,2.确定iconv是否支持此编码的转换,3.确定Linux/Unix操作系统编码,4.转换文件编码为与系统编码一致;下面通过对test.txt文件来举例。
1、 使用file命令来确定文件编码:
- $ file -bi gbk.txt | sed -e 's/.*[ ]charset=//' |tr '[a-z]' '[A-Z'
- ISO-8859-1
可见test.txt文件编码为ISO-8859-1编码。
2、 使用iconv -l确定iconv是否支持此种编码的转换:
- $ iconv -l | grep ISO-8859-1
- ISO-8859-1//
- ISO-8859-10//
- ISO-8859-11//
- ISO-8859-13//
- ISO-8859-14//
- ISO-8859-15//
- ISO-8859-16//
3、 确定Linux/Unix操作系统编码:
- $ echo $LANG
- zh_CN.UTF-8
当前操作系统坏境编码为"UTF-8"
4、 转换编码
- $ iconv -f ISO-8859-1 -t UTF-8 test.txt
- 测试
注:由于file命令常常会误判编码,如发现转换出来的编码依然是乱码,可将iconv -f的输入编码换成其他常用编码试试: GBK、BIG5、HZ、GB2312、GB18030、ASCII
iconv命令的详细语法:
iconv [选项..] 文件
选项:
-f 输入编码
-t 输出编码
-l 列出所有已知的编码
-o 输出文件
附录字符编码表:
编码集:
| ISO-8859-2 | ISO 8859-2 standard; ISO Latin 2 |
| ISO-8859-4 | ISO 8859-4 standard; Latin 4 |
| ISO-8859-5 | ISO 8859-5 standard; ISO Cyrillic |
| ISO-8859-13 | ISO 8859-13 standard; ISO Baltic; Latin 7 |
| ISO-8859-16 | ISO 8859-16 standard |
| CP1125 | MS-windows code page 1125 |
| CP1250 | MS-Windows code page 1250 |
| CP1251 | MS-Windows code page 1251 |
| CP1257 | MS-Windows code page 1257; WinBaltRim |
| IBM852 | IBM/MS code page 852; PC (DOS) Latin 2 |
| IBM855 | IBM/MS code page 855 |
| IBM775 | IBM/MS code page 775 |
| IBM866 | IBM/MS code page 866 |
| baltic | ISO-IR-179; Baltic |
| KEYBCS2 | Kamenicky encoding; KEYBCS2 |
| macce | Macintosh Central European |
| maccyr | Macintosh Cyrillic |
| ECMA-113 | Ecma Cyrillic; ECMA-113 |
| KOI-8_CS_2 | KOI8-CS2 code ('T602') |
| KOI8-R | KOI8-R Cyrillic |
| KOI8-U | KOI8-U Cyrillic |
| KOI8-UNI | KOI8-Unified Cyrillic |
| TeX | (La)TeX control sequences |
| UCS-2 | Universal character set 2 bytes; UCS-2; BMP |
| UCS-4 | Universal character set 4 bytes; UCS-4; ISO-10646 |
| UTF-7 | Universal transformation format 7 bits; UTF-7 |
| UTF-8 | Universal transformation format 8 bits; UTF-8 |
| CORK | Cork encoding; T1 |
| GBK | Simplified Chinese National Standard; GB2312 |
| BIG5 | Traditional Chinese Industrial Standard; Big5 |
| HZ | HZ encoded GB2312 |
行结束符:
| /LF | LF line terminators |
| /CRLF | CRLF line terminators |
| N.A. | Mixed line terminators |
| N.A. | Surrounded by/intermixed with non-text data |
| /21 | Byte order reversed in pairs (1,2 -> 2,1) |
| /4321 | Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1) |
| N.A. | Both little and big endian chunks, concatenated |
| /qp | Quoted-printable encoded |
各国语言编码集:
| Bulgarian | CP1251 ISO-8859-5 IBM855 maccyr ECMA-113 |
| Czech | ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK |
| Estonian | ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic |
| Croatian | CP1250 ISO-8859-2 IBM852 macce CORK |
| Hungarian | ISO-8859-2 CP1250 IBM852 macce CORK |
| Lithuanian | CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic |
| Latvian | CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic |
| Polish | ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK |
| Russian | KOI8-R CP1251 ISO-8859-5 IBM866 maccyr |
| Slovak | CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK |
| Slovene | ISO-8859-2 CP1250 IBM852 macce CORK |
| Ukrainian | CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr |
| Chinese | GBK BIG5 HZ |