zoukankan      html  css  js  c++  java
  • Text Preprocessing

    Text preprocessing is an essential part of NLP tasks.

    Conversion from Complicated Chinese to Simple Chinese

    The code below has a dependency on two python scripts langconv.py and zh_wiki.py which can be found here.

    from langconv import *
    
    sentence = "xxxxx"
    sentence = Converter('zh-hans').convert(sentence)
    

    Conversion from full-width symbols to half-width symbols

    According to the Unicode Character Table and Baidu Encyclopedia, fullwidth ASCII variants begins from 65281(U+FF01) to 65374(U+FF5E), and their counterparts in halfwidth form vary from 33(U+0021) to 126(U+007E), thus the gap is 65248 except for the space character, whose fullwidth form and halfwidth form are 12288(U+3000) and 32(U+0020) respectively.

    There is a feasible conversion solution on cnblogs, but it ignores the fact that some chinese punctuations are right the fullwidth form of english characters, which may lead to unexpected modifications to chinese punctuations. Considering the procedure is straightforward, my modified version is posted here.

    def full_width_to_half_width(ustring):
        rstring = ""
    
        filter = {65281, 65288, 65289, 65292, 65294, 65306, 65307, 65311}
    
        for uchar in ustring:
            inside_code = ord(uchar)
            if inside_code == 12288:
                inside_code = 32
            elif inside_code not in filter and 65281 <= inside_code <= 65374:
                inside_code -= 65248
            rstring += chr(inside_code)
        return rstring
    
  • 相关阅读:
    DevExpress.XtraScheduler控件的使用方法
    读写Excel文档
    让程序以管理员身份运行
    读写TXT文档
    判断程序是自动启动还是用户启动
    用指针读BMP图像
    判断网络是否连接通
    WSL初体验
    Realsense内参标定
    FreeSwitch权威指南
  • 原文地址:https://www.cnblogs.com/YoungF/p/13412981.html
Copyright © 2011-2022 走看看