zoukankan      html  css  js  c++  java
  • linux文本编码格式转化 字幕处理

    在处理字幕的时候,linux的编码格式转换很烦。

    步骤: 用python先判断 其编码,再用iconv 转编码,再用awk处理格式。

    file不能判断吗?file有时不准。

    1.python判断编码

    $ cat t1.py 
    # -*- coding:utf8 -*- 
    import sys
    #f1=open(sys.argv[2],'w')
    with open(sys.argv[1], 'rb') as f:
        for line in f:
            # 转码,因为文件内的编码不一致
            try:
                line = line.decode('utf-8')
            except:
                try:
                    line = line.decode('GB2312')  #right
                    print('hehe')
                except:
                    try:
                        line = line.decode('gbk')
                        print('hehe1')
                    except:
                        try:
                            line = line.decode('GB18030')
                            print('hehe2')
                        except:
                            try:
                                line = line.decode('iso-8859-1')  #wrong
                            except:
                                continue
    
            line = line.strip()  # 去除首尾的空格tab回车换行
            print(line)
            #f1.write(line)
    View Code

    也是试出来的。

    如果用file判断:   file -b --mime-encoding  text

    2.iconv 转码:        iconv -f "GB2312" -t "utf-8" Ep._20:Valar_Morghulis.ass >  Ep._20:Valar_Morghulis.txt

    参考  http://kjetilvalle.com/posts/text-file-encodings.html

    综合:

    $ cat readme.sh
    #!/bin/sh
    TO='utf-8'
    for i in *ass
    do
        FROM=$(file -b --mime-encoding $i)
        p=`basename $i .ass`
        [ $FROM != "iso-8859-1" ] && iconv -f $FROM -t $TO $i > ${p}.txt
        [ $FROM = "iso-8859-1" ] && iconv -f "GB2312" -t $TO $i > ${p}.txt
        awk -F',,' '/Dialogue.*正文/{split($0,arr,",正文,,");split($3,brr,"N");split($3,crr,"{");print "
    "arr[1]"
    " brr[1]"
    "crr[length(crr)-1]}' ${p}.txt |sed -e 's/.*}//g' -e 's/\$//g'  > ${p}.norm
    done
  • 相关阅读:
    LOJ6435 「PKUSC2018」星际穿越
    LOJ6433 「PKUSC2018」最大前缀和
    LOJ2541 「PKUWC2018」猎人杀
    LOJ2545 「JXOI2018」守卫
    LOJ2290 「THUWC 2017」随机二分图
    CF1007B Pave the Parallelepiped
    【学习笔记】卡特兰数
    Linux系统命令“su
    免密
    Window操作系统下的SSL证书管理
  • 原文地址:https://www.cnblogs.com/dahu-daqing/p/10307757.html
Copyright © 2011-2022 走看看