zoukankan      html  css  js  c++  java
  • genism单词检测Phrases

    转自:https://radimrehurek.com/gensim/models/phrases.html

    1.gensim.models.phrases.Phraser 

     //上面链接中的样例代码复制不行,绝了。。。

    >>> from gensim.test.utils import datapath
    >>> from gensim.models.word2vec import Text8Corpus
    >>> from gensim.models.phrases import Phrases, Phraser
    >>>
    >>> sentences = Text8Corpus(datapath('testcorpus.txt'))#首先加载作为分词的训练集
    >>> phrases = Phrases(sentences, min_count=1, threshold=1)#
    >>>
    >>> bigram = Phraser(phrases)#这样训练出一个二元分词器
    >>> sent = [u'trees', u'graph', u'minors']#在这个句子上测试
    >>> print(bigram[sent])
    [u'trees_graph', u'minors']#前两个word组合为了词组

    其中参数min_count/threshold,越高那么词组越少,将单词组合成双字母组的难度就越大。

    下面是三词组:

    https://blog.csdn.net/weixin_33826609/article/details/91472435

    # Build the bigram and trigram models
    bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
    trigram = gensim.models.Phrases(bigram[data_words], threshold=100) #以双词组作为训练数据 
     
    # Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
     
    # See trigram example
    print(trigram_mod[bigram_mod[data_words[0]]])
  • 相关阅读:
    java学习的第三天
    java学习的第二天
    java学习的第一天
    兼容性测试2
    兼容性测试
    安全性测试
    界面测试
    功能性测试
    简历小技巧
    day13
  • 原文地址:https://www.cnblogs.com/BlueBlueSea/p/13154784.html
Copyright © 2011-2022 走看看