zoukankan      html  css  js  c++  java
  • nltk常用函数

    所含的包可以在这里下载:http://www.nltk.org/nltk_data/

    1.WordNetLemmatizer提取词干

    确定词根

    import nltk
    nltk.download('wordnet')
    lemmatizer = WordNetLemmatizer()#确定词源
    print(lemmatizer.lemmatize('gathering', 'v'))
    print(lemmatizer.lemmatize('gathering', 'n'))

    输出:

    gather
    gathering

    2.word_tokenize分词

    https://kite.com/python/docs/nltk.word_tokenize

    分词: 

    import nltk
    nltk.download('punkt')
    
    sentence = "At eight o'clock on Thursday morning, Arthur didn't feel very good."
    print(word_tokenize(sentence))
    
    #
    ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', ',', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

    2020-9-18更新————————

    1.基本功能

    https://www.jianshu.com/p/9d232e4a3c28

    • 分句
    • 分词
    • 词性标注
    • 命名实体识别

    分句和分词:

    #分句
    sents = nltk.sent_tokenize("And now for something completely different. I love you.")#对于这种比较简单的句子,是可以处理的很好的。
    word = []
    for s in sents:
        print(s)
    #在句子内部分词
    for sent in sents:
        word.append(nltk.word_tokenize(sent))
    print(word)

    词性标注:

    nltk.download('averaged_perceptron_tagger')
    s="And now for something completely different."
    text = nltk.word_tokenize(s)
    print(text)
    #词性标注
    tagged = nltk.pos_tag(text)#这里需要是分词的结果,否则就会将单个char作为单位
    
    tagged
    [('And', 'CC'),
     ('now', 'RB'),
     ('for', 'IN'),
     ('something', 'NN'),
     ('completely', 'RB'),
     ('different', 'JJ'),
     ('.', '.')]

    分块:

    entities = nltk.chunk.ne_chunk(tagged)#如果使用这个函数的话,输入的变量必须是经过词性标注之后的,进行分块
    print (entities)

    输出:

    Tree('S', [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ'), ('.', '.')])

    2.统计文本中词的概率分布FreqDist

    from nltk.book import FreqDist#这个类是在book中的
    
    text1 = nltk.word_tokenize("And now for something completely different. I love you. This is my friend. You are my friend.")
    
    # FreqDist()获取在文本中每个出现的标识符的频率分布
    fdist = FreqDist(text1)
    print(fdist)
    # 词数量
    print(fdist.N())
    # 不重复词的数量
    print(fdist.B())
    
    >>><FreqDist with 16 samples and 21 outcomes>
    21
    16
    # 获取频率
    print(fdist.freq('friend') * 100)
    # 获取频数
    print(fdist['friend'])
    #出现次数最多的词
    fdist.max()
    
    >>>9.523809523809524
    2
    '.'

    后面还有对文章进行词干化的代码,仔细看了一下,但是觉得目前用不到,所以就不贴了。

    3.Text类和TextCollection类

    前者是对单个文本的分析,后者是前者的集合,可以计算某一单词的逆文档频率IDF等。

  • 相关阅读:
    LeetCode 788. Rotated Digits
    LeetCode 606. Construct String from Binary Tree
    LeetCode 13. Roman to Integer
    LeetCode 387. First Unique Character in a String
    LeetCode 520. Detect Capital
    LeetCode 557. Reverse Words in a String III
    RSA加密算法及其与SpringMVC集成
    用phantomjs 进行网页整页截屏
    redis应用场景
    MQ产品比较-ActiveMQ-RocketMQ
  • 原文地址:https://www.cnblogs.com/BlueBlueSea/p/13154590.html
Copyright © 2011-2022 走看看