zoukankan      html  css  js  c++  java
  • MIT6.006Lec02:DocumentDistance

    MIT6.006是算法导论,Lec02讲的是Document Distance(文档距离),比如比较两个文档相似度或者搜索引擎中都会用到。

    计算步骤为:

    1.将每个文档分离为单词

    2.统计词频

    3.计算点积(并做除法)

    说明:

    1.“单词”指的是字母和数字(alphanumeric)

    2.每个文档统计完词频后得到的list,可看作一个向量

    3.两个文档间的相似度,是相似的单词除以总的单词,类似于两个向量的夹角公式

    MIT6.006下载的相关资源中,给出了8个逐渐改善的代码版本,但本质都是一样的。代码8短小精悍,我添加了一些中文注释

    #coding:utf8
    #description:计算文档距离
    import sys
    import math
    import string
    
    
    ######################################
    #步骤1:读取文件
    ######################################
    def read_file(filename):
        try:
            f = open(filename, 'r')
            return f.read()
        except IOError:
            print "Error opening or reading input file: ", filename
            sys.exit()
    
    
    
    #####################################
    #步骤2:从文本中分离单词
    #####################################
    translation_table=string.maketrans(string.punctuation+string.uppercase,
                                       " "*len(string.punctuation)+string.lowercase)
    
    def get_words_from_line_list(text):
        """从给定的文本中找出所有的单词,返回一个list"""
        text = text.translate(translation_table)
        word_list = text.split()
        return word_list
    
    
    
    ######################################
    #步骤3:统计词频
    ######################################
    def count_frequency(word_list):
        D = {}
        for new_word in word_list:
            if new_word in D:
                D[new_word] = D[new_word] + 1
            else:
                D[new_word] = 1
        return D
    
    
    def word_frequencies_for_file(filename):
        """返回(单词,频率)组成的list"""
        line_list = read_file(filename)
        word_list = get_words_from_line_list(line_list)
        freq_mapping = count_frequency(word_list)
        return freq_mapping
    
    
    
    def inner_product(D1, D2):
        sum = 0.0
        for key in D1:
            if key in D2:
                sum += D1[key] * D2[key]
        return sum
    
    
    def vector_angle(D1, D2):
        """计算两个向量的夹角"""
        numerator = inner_product(D1, D2)
        denominator = math.sqrt(inner_product(D1,D1)*inner_product(D2,D2))
        return math.acos(numerator/denominator)
    
    
    def main():
        if len(sys.argv) != 3:
            print "Usage: docdist.py filename_1 filename_2"
        else:
            filename_1 = sys.argv[1]
            filename_2 = sys.argv[2]
            sorted_word_list_1 = word_frequencies_for_file(filename_1)
            sorted_word_list_2 = word_frequencies_for_file(filename_2)
            distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
            print "The distance between the document is: %0.6f (radians)"%distance
    
    
    if __name__ == '__main__':
        main()
    

     Lec02的讲义在这里 

  • 相关阅读:
    POJ 2923 Relocation (状态压缩,01背包)
    HDU 2126 Buy the souvenirs (01背包,输出方案数)
    hdu 2639 Bone Collector II (01背包,求第k优解)
    UVA 562 Dividing coins (01背包)
    POJ 3437 Tree Grafting
    Light OJ 1095 Arrange the Numbers(容斥)
    BZOJ 1560 火星藏宝图(DP)
    POJ 3675 Telescope
    POJ 2986 A Triangle and a Circle
    BZOJ 1040 骑士
  • 原文地址:https://www.cnblogs.com/zjutzz/p/3270022.html
Copyright © 2011-2022 走看看