zoukankan      html  css  js  c++  java
  • Text Preprocessing

    Text preprocessing is an essential part of NLP tasks.

    Conversion from Complicated Chinese to Simple Chinese

    The code below has a dependency on two python scripts langconv.py and zh_wiki.py which can be found here.

    from langconv import *
    
    sentence = "xxxxx"
    sentence = Converter('zh-hans').convert(sentence)
    

    Conversion from full-width symbols to half-width symbols

    According to the Unicode Character Table and Baidu Encyclopedia, fullwidth ASCII variants begins from 65281(U+FF01) to 65374(U+FF5E), and their counterparts in halfwidth form vary from 33(U+0021) to 126(U+007E), thus the gap is 65248 except for the space character, whose fullwidth form and halfwidth form are 12288(U+3000) and 32(U+0020) respectively.

    There is a feasible conversion solution on cnblogs, but it ignores the fact that some chinese punctuations are right the fullwidth form of english characters, which may lead to unexpected modifications to chinese punctuations. Considering the procedure is straightforward, my modified version is posted here.

    def full_width_to_half_width(ustring):
        rstring = ""
    
        filter = {65281, 65288, 65289, 65292, 65294, 65306, 65307, 65311}
    
        for uchar in ustring:
            inside_code = ord(uchar)
            if inside_code == 12288:
                inside_code = 32
            elif inside_code not in filter and 65281 <= inside_code <= 65374:
                inside_code -= 65248
            rstring += chr(inside_code)
        return rstring
    
  • 相关阅读:
    C++ CheckListBox
    TreeView查获节点并选中节点
    创建文件自动重命名
    bat
    Edit显示行号
    FindStringExact
    Extended ComboBox添加图标
    C++ Combobox输入时自动完成
    C++ ComboBox基础
    C++ Code_combobox
  • 原文地址:https://www.cnblogs.com/YoungF/p/13412981.html
Copyright © 2011-2022 走看看