zoukankan      html  css  js  c++  java
  • Text Preprocessing

    Text preprocessing is an essential part of NLP tasks.

    Conversion from Complicated Chinese to Simple Chinese

    The code below has a dependency on two python scripts langconv.py and zh_wiki.py which can be found here.

    from langconv import *
    
    sentence = "xxxxx"
    sentence = Converter('zh-hans').convert(sentence)
    

    Conversion from full-width symbols to half-width symbols

    According to the Unicode Character Table and Baidu Encyclopedia, fullwidth ASCII variants begins from 65281(U+FF01) to 65374(U+FF5E), and their counterparts in halfwidth form vary from 33(U+0021) to 126(U+007E), thus the gap is 65248 except for the space character, whose fullwidth form and halfwidth form are 12288(U+3000) and 32(U+0020) respectively.

    There is a feasible conversion solution on cnblogs, but it ignores the fact that some chinese punctuations are right the fullwidth form of english characters, which may lead to unexpected modifications to chinese punctuations. Considering the procedure is straightforward, my modified version is posted here.

    def full_width_to_half_width(ustring):
        rstring = ""
    
        filter = {65281, 65288, 65289, 65292, 65294, 65306, 65307, 65311}
    
        for uchar in ustring:
            inside_code = ord(uchar)
            if inside_code == 12288:
                inside_code = 32
            elif inside_code not in filter and 65281 <= inside_code <= 65374:
                inside_code -= 65248
            rstring += chr(inside_code)
        return rstring
    
  • 相关阅读:
    缓存
    vue 生命周期:
    mongodb 数据库 增删改查
    微信小程序左右分类滚动列表
    4月29日记
    什么是MVVM
    什么是mvc
    React路由
    TodoList案例
    React中兄弟组件传值
  • 原文地址:https://www.cnblogs.com/YoungF/p/13412981.html
Copyright © 2011-2022 走看看