zoukankan      html  css  js  c++  java
  • Text Preprocessing

    Text preprocessing is an essential part of NLP tasks.

    Conversion from Complicated Chinese to Simple Chinese

    The code below has a dependency on two python scripts langconv.py and zh_wiki.py which can be found here.

    from langconv import *
    
    sentence = "xxxxx"
    sentence = Converter('zh-hans').convert(sentence)
    

    Conversion from full-width symbols to half-width symbols

    According to the Unicode Character Table and Baidu Encyclopedia, fullwidth ASCII variants begins from 65281(U+FF01) to 65374(U+FF5E), and their counterparts in halfwidth form vary from 33(U+0021) to 126(U+007E), thus the gap is 65248 except for the space character, whose fullwidth form and halfwidth form are 12288(U+3000) and 32(U+0020) respectively.

    There is a feasible conversion solution on cnblogs, but it ignores the fact that some chinese punctuations are right the fullwidth form of english characters, which may lead to unexpected modifications to chinese punctuations. Considering the procedure is straightforward, my modified version is posted here.

    def full_width_to_half_width(ustring):
        rstring = ""
    
        filter = {65281, 65288, 65289, 65292, 65294, 65306, 65307, 65311}
    
        for uchar in ustring:
            inside_code = ord(uchar)
            if inside_code == 12288:
                inside_code = 32
            elif inside_code not in filter and 65281 <= inside_code <= 65374:
                inside_code -= 65248
            rstring += chr(inside_code)
        return rstring
    
  • 相关阅读:
    正则表达式
    HDU 2066 多源最短路
    UVA 11039 模拟
    Concrete Mathematics Chapter 1 Warmups
    List differences between JAVA and C++
    uva 11107Life Forms
    poj 1509 Glass Beads
    poj 3581
    网络流建图
    图论算法----网络流
  • 原文地址:https://www.cnblogs.com/YoungF/p/13412981.html
Copyright © 2011-2022 走看看