zoukankan      html  css  js  c++  java
  • TFIDF计算

    计算细节:参见知乎文章“sklearn-TfidfVectorizer彻底说清楚”

    1.根据训练集语料库,计算出tfidf值

    2.计算出测试语句每个词语的tfidf值(只有当测试语句的词语在训练语料库的dictionary中,测试语句的词语才会计算tfidf值)

    import jieba
    from gensim import corpora, similarities, models
    sentances = ['我爱你', '我喜欢他和他喜欢我', '他说今天空气很清新']
    test_sent = '我爱你们,我喜欢他'
    text = [[word for word in jieba.cut(sentance)]for sentance in sentances]  # 1.把每个句子分词
    dictionary  = corpora.Dictionary(text)  # 2.把每个词语建立索引,得到索引字典
    print('dictionary=', dictionary)
    for idx,word in dictionary.items():
        print(idx, word,end="	")
    print()
    corpus = [dictionary.doc2bow(word_list) for word_list in text]  # 3.对每句话的每个词语进行词频统计,得到词频统计过后的语料corpus
    print("[dictionary.doc2bow(word_list) for word_list in text]")
    for word_list in text:
        print('	',word_list, end="	")
        print(dictionary.doc2bow(word_list))
        
    model = models.TfidfModel(corpus)  # 4. corpus输入到TFIDF模型计算,model保存着有每句话中每个词语的tfidf值
    tfidf = model[corpus]  #  保存着每句话中每个词语的tfidf值
    print('tfidf=',tfidf)
    for ele in tfidf:  
        print('	',ele)
    
    similarity =similarities.MatrixSimilarity(tfidf)  # 用于计算相似度,similarity的输入参数是tfidf值
    print('similarity=', similarity)
    for ele in similarity:
        print('	',ele)
        
    test_word_list = [word for word in jieba.cut(test_sent)]
    print('test_word_list=',test_word_list)
    test_word_freq_count = dictionary.doc2bow(test_word_list)
    print('test_word_freq_count=', test_word_freq_count)  # 因为是根据训练数据得到的dictionary,测试语句只有部分词语在训练集中
    test_tfidf = model[test_word_freq_count]
    print('test_tfidf=', test_tfidf)
    
    sim = similarity[test_tfidf]  # 获得与所有句子的相似度,训练集有三个句子,所以sim的长度为3
    print("sim=",sim,sim.dtype)
    max_sim = max(sim)
    print('max_sim=', max_sim, end='	')
    max_index = list(sim).index(max_sim)
    print('max_index=', max_index)
    # 输出
    dictionary= Dictionary(10 unique tokens: ['我爱你', '', '', '喜欢', '']...)
    0 我爱你    1 他    2 和    3 喜欢    4 我    5 今天    6 很    7 清新    8 空气    9 说    
    [dictionary.doc2bow(word_list) for word_list in text]
         ['我爱你']    [(0, 1)]
         ['', '喜欢', '', '', '', '喜欢', '']    [(1, 2), (2, 1), (3, 2), (4, 2)]
         ['', '', '今天', '空气', '', '清新']    [(1, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
    tfidf= <gensim.interfaces.TransformedCorpus object at 0x000001DD8C472648>
         [(0, 1.0)]
         [(1, 0.23892106670040594), (2, 0.323679663983242), (3, 0.647359327966484), (4, 0.647359327966484)]
         [(1, 0.16284991207632715), (5, 0.44124367556640004), (6, 0.44124367556640004), (7, 0.44124367556640004), (8, 0.44124367556640004), (9, 0.44124367556640004)]
    similarity= MatrixSimilarity<3 docs, 10 features>
         [1. 0. 0.]
         [0.         0.99999994 0.03890828]
         [0.         0.03890828 1.        ]
    test_word_list= ['', '', '你们', '', '', '喜欢', '']
    test_word_freq_count= [(1, 1), (3, 1), (4, 2)]
    test_tfidf= [(1, 0.16284991207632712), (3, 0.44124367556640004), (4, 0.8824873511328001)]
    sim= [0.        0.8958379 0.0265201] float32
    max_sim= 0.8958379    max_index= 1

    可以看到,测试语句与训练语料库中的第index=1条语句最相似.

    tfidf如何表示一个句子:

    加入一个句子有n个单词,每个单词计算出它的tfidf值,即每个单词用一个标量表示,则句子的维度是1*n

    如果是用embedding表示法,每个单词用m维向量表示,句子的维度是m*n

    保存和加载模型的方法:

    保存词典:

    dictionary.save(DICT_PATH)

    保存tfidf模型:model.save(MODEL_PATH)

    保存相似度

    similarity.save(SIMILARITY_PATH)

    加载词典:

    dictionary = corpora.Dictionary.load('require_files/dictionary.dict')

    加载模型

    tfidf = models.TfidfModel.load("require_files/my_model.tfidf")

    加载相似度

    index=similarities.MatrixSimilarity.load('require_files/similarities.0')
    ————————————————
    refference:https://blog.csdn.net/qq_33908388/article/details/94554309

  • 相关阅读:
    Leetcode题库——40.组合总和II
    (课)阅读笔记3_1
    (课)学习进度报告十
    (课)赛题的需求分析
    (课)阅读笔记2_3
    (课)学习进度报告九
    (课)学习进度报告八
    (tensorflow计算)如何查看tensorflow计算用的是CPU还是GPU
    (课)阅读笔记2_2
    (课)温昱 第三部分Refined Architecture阶段 总结
  • 原文地址:https://www.cnblogs.com/sunupo/p/12942540.html
Copyright © 2011-2022 走看看