所含的包可以在这里下载:http://www.nltk.org/nltk_data/
1.WordNetLemmatizer提取词干
确定词根
import nltk nltk.download('wordnet') lemmatizer = WordNetLemmatizer()#确定词源 print(lemmatizer.lemmatize('gathering', 'v')) print(lemmatizer.lemmatize('gathering', 'n'))
输出:
gather gathering
2.word_tokenize分词
https://kite.com/python/docs/nltk.word_tokenize
分词:
import nltk nltk.download('punkt') sentence = "At eight o'clock on Thursday morning, Arthur didn't feel very good." print(word_tokenize(sentence)) # ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
2020-9-18更新————————
1.基本功能
https://www.jianshu.com/p/9d232e4a3c28
- 分句
- 分词
- 词性标注
- 命名实体识别
分句和分词:
#分句 sents = nltk.sent_tokenize("And now for something completely different. I love you.")#对于这种比较简单的句子,是可以处理的很好的。 word = [] for s in sents: print(s) #在句子内部分词 for sent in sents: word.append(nltk.word_tokenize(sent)) print(word)
词性标注:
nltk.download('averaged_perceptron_tagger') s="And now for something completely different." text = nltk.word_tokenize(s) print(text) #词性标注 tagged = nltk.pos_tag(text)#这里需要是分词的结果,否则就会将单个char作为单位 tagged [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), ('.', '.')]
分块:
entities = nltk.chunk.ne_chunk(tagged)#如果使用这个函数的话,输入的变量必须是经过词性标注之后的,进行分块 print (entities)
输出:
Tree('S', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), ('.', '.')])
2.统计文本中词的概率分布FreqDist
from nltk.book import FreqDist#这个类是在book中的 text1 = nltk.word_tokenize("And now for something completely different. I love you. This is my friend. You are my friend.") # FreqDist()获取在文本中每个出现的标识符的频率分布 fdist = FreqDist(text1) print(fdist) # 词数量 print(fdist.N()) # 不重复词的数量 print(fdist.B()) >>><FreqDist with 16 samples and 21 outcomes> 21 16
# 获取频率 print(fdist.freq('friend') * 100) # 获取频数 print(fdist['friend']) #出现次数最多的词 fdist.max() >>>9.523809523809524 2 '.'
后面还有对文章进行词干化的代码,仔细看了一下,但是觉得目前用不到,所以就不贴了。
3.Text类和TextCollection类
前者是对单个文本的分析,后者是前者的集合,可以计算某一单词的逆文档频率IDF等。