Text Preprocessing - 走看看

zoukankan html css js c++ java

Text Preprocessing
Text preprocessing is an essential part of NLP tasks.

Conversion from Complicated Chinese to Simple Chinese

The code below has a dependency on two python scripts langconv.py and zh_wiki.py which can be found here.
```
from langconv import *

sentence = "xxxxx"
sentence = Converter('zh-hans').convert(sentence)
```
Conversion from full-width symbols to half-width symbols

According to the Unicode Character Table and Baidu Encyclopedia, fullwidth ASCII variants begins from 65281(U+FF01) to 65374(U+FF5E), and their counterparts in halfwidth form vary from 33(U+0021) to 126(U+007E), thus the gap is 65248 except for the space character, whose fullwidth form and halfwidth form are 12288(U+3000) and 32(U+0020) respectively.

There is a feasible conversion solution on cnblogs, but it ignores the fact that some chinese punctuations are right the fullwidth form of english characters, which may lead to unexpected modifications to chinese punctuations. Considering the procedure is straightforward, my modified version is posted here.
```
def full_width_to_half_width(ustring):
    rstring = ""

    filter = {65281, 65288, 65289, 65292, 65294, 65306, 65307, 65311}

    for uchar in ustring:
        inside_code = ord(uchar)
        if inside_code == 12288:
            inside_code = 32
        elif inside_code not in filter and 65281 <= inside_code <= 65374:
            inside_code -= 65248
        rstring += chr(inside_code)
    return rstring
```
查看全文

相关阅读:
os和sys模块
 time模块
 collections模块
 re模块
 Python初识一
 Python闭包函数
 压栈
 isinstance()和issubclass()
匿名函数--lambda函数
 机器学习入门文章

原文地址：https://www.cnblogs.com/YoungF/p/13412981.html

Copyright © 2011-2022 走看看