zoukankan html css js c++ java

文本规范化

2.文本规范化

再进一步开展分析或 NLP 之前，首先需要规范文本文档的语料库。为此，将再次使用规范化模块，此外还需要应用一些专门针对内容的新技术。

在分析了许多语料库后，经过精心挑选了一些新词，并将它们更新禁了停用词名单，如下代码展示：

stopword_list = nltk.corpus.stopwords.words('english')
stopword_list = stopword_list + ['mr', 'mrs', 'come', 'go', 'get',
                                 'tell', 'listen', 'one', 'two', 'three',
                                 'four', 'five', 'six', 'seven', 'eight',
                                 'nine', 'zero', 'join', 'find', 'make',
                                 'say', 'ask', 'tell', 'see', 'try', 'back',
                                 'also']

可以看出新添加的单词大多数是通用的、没有多大意义的动词或名词。将它们更新进停用词列表对于文本聚类中的特征提取是十分有用的。还在规范化 pipeline 中添加了一个新函数，它使用正则表达式从文本主题中提取文本标识，如下所示：

import re
 
def keep_text_characters(text):
    filtered_tokens = []
    tokens = tokenize_text(text)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

将新函数连同前面反复使用的函数（包括扩展缩写词，解码 HTML，词语切分，删除停用词及特殊字符，词性还原）一起添加到最终的规范化函数中，如下：

def normalize_corpus(corpus, lemmatize=True,
                     only_text_chars=False,
                     tokenize=False):
     
    normalized_corpus = []   
    for text in corpus:
        text = html_parser.unescape(text)
        text = expand_contractions(text, CONTRACTION_MAP)
        if lemmatize:
            text = lemmatize_text(text)
        else:
            text = text.lower()
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if only_text_chars:
            text = keep_text_characters(text)
         
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
    return normalized_corpus

可以看出上述函数非常类似前面讲过的函数，只是添加了 keep_text_charachters() 函数来保留文本字符，该函数通过将 only_text_chars 参数设置为 True 来执行。

查看全文

相关阅读:
商贸通帐套隐藏方法
 固定资产打开提示：上年度数据未结转！
ZOJ 2432 Greatest Common Increasing Subsequence
POJ 1080 Human Gene Functions
POJ 1088 滑雪
 POJ 1141 Brackets Sequence
POJ 1050 To the Max
HDOJ 1029 Ignatius and the Princess IV
POJ 2247 Humble Numbers
HDOJ 1181 变形课

原文地址：https://www.cnblogs.com/dalton/p/11354009.html