zoukankan      html  css  js  c++  java
  • 文本相似度和分类

    文本相似度

    • 度量文本间的相似性
    • 使用词频表示文本特征
    • 文本中单词出现的频率或次数
    • NLTK实现词频统计

    文本相似度案例:

    import nltk
    from nltk import FreqDist
    
    text1 = 'I like the movie so much '
    text2 = 'That is a good movie '
    text3 = 'This is a great one '
    text4 = 'That is a really bad movie '
    text5 = 'This is a terrible movie'
    
    text = text1 + text2 + text3 + text4 + text5
    words = nltk.word_tokenize(text)
    freq_dist = FreqDist(words)
    print(freq_dist['is'])
    # 输出结果:
    # 4
    
    
    # 取出常用的n=5个单词
    n = 5
    # 构造“常用单词列表”
    most_common_words = freq_dist.most_common(n)
    print(most_common_words)
    # 输出结果:
    # [('a', 4), ('movie', 4), ('is', 4), ('This', 2), ('That', 2)]
    
    
    
    def lookup_pos(most_common_words):
        """
            查找常用单词的位置
        """
        result = {}
        pos = 0
        for word in most_common_words:
            result[word[0]] = pos
            pos += 1
        return result
    
    # 记录位置
    std_pos_dict = lookup_pos(most_common_words)
    print(std_pos_dict)
    # 输出结果:
    # {'movie': 0, 'is': 1, 'a': 2, 'That': 3, 'This': 4}
    
    
    # 新文本
    new_text = 'That one is a good movie. This is so good!'
    # 初始化向量
    freq_vec = [0] * n
    # 分词
    new_words = nltk.word_tokenize(new_text)
    
    # 在“常用单词列表”上计算词频
    for new_word in new_words:
        if new_word in list(std_pos_dict.keys()):
            freq_vec[std_pos_dict[new_word]] += 1
    
    print(freq_vec)
    # 输出结果:
    # [1, 2, 1, 1, 1]

    文本分类

    TF-IDF (词频-逆文档频率)

    • TF, Term Frequency(词频),表示某个词在该文件中出现的次数

    • IDF,Inverse Document Frequency(逆文档频率),用于衡量某个词普 遍的重要性。

    • TF-IDF = TF * IDF

    • 举例假设:

    一个包含100个单词的文档中出现单词cat的次数为3,则TF=3/100=0.03

    样本中一共有10,000,000个文档,其中出现cat的文档数为1,000个,则IDF=log(10,000,000/1,000)=4

    TF-IDF = TF IDF = 0.03 4 = 0.12

    • NLTK实现TF-IDF

    TextCollection.tf_idf()

    案例:

    from nltk.text import TextCollection
    
    text1 = 'I like the movie so much '
    text2 = 'That is a good movie '
    text3 = 'This is a great one '
    text4 = 'That is a really bad movie '
    text5 = 'This is a terrible movie'
    
    # 构建TextCollection对象
    tc = TextCollection([text1, text2, text3, 
                            text4, text5])
    new_text = 'That one is a good movie. This is so good!'
    word = 'That'
    tf_idf_val = tc.tf_idf(word, new_text)
    print('{}的TF-IDF值为:{}'.format(word, tf_idf_val))

    执行结果:

    That的TF-IDF值为:0.02181644599700369
  • 相关阅读:
    2.Servlet基础
    3.AOP入门1
    1.Tomcat配置
    Linq学习(一)-初涉Linq
    Linq学习(零)-错误汇总
    Intellij IDEA14配置
    大数据插入Excel报错处理
    VS插件-Resharper
    SVN异常处理(五)-状态小图标不见了
    SVN系列学习(四)-TortoiseSVN其他操作
  • 原文地址:https://www.cnblogs.com/alexzhang92/p/9794420.html
Copyright © 2011-2022 走看看