zoukankan      html  css  js  c++  java
  • 倒排索引

    倒排索引(inverted index)

    常被成为反向索引、置入文档和反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档

    或者一组文档中的存储位置的映射。是文档检索系统中最常用的数据结构。

    例如:

    下面是要被索引的文本:

    T0 = "it is what it is"

    T1 = "what is it"

    T2 = "it is a banana"

    生成的倒排索引可以表示为下面所示:

    "a" = {(2,2)}

    "banana" = {(2,3)}

    "is" = {(0,1),(0,4),(1,1),(2,1)}

    "it" = {(0,0),(0,3),(1,2),(2,0)}

    "what" = {(0,2),(1,0)}

    我们可以得到这些完全反向索引,有(文档位置、查询单词所在文档中位置)组成的成对数据。

    同样,文档位置、和查询单词所在文档中位置,都从零开始计算。

    所以,"banana":{(2,3)}表示 banana在第三个文档中的第四个单词位置。

    =====例子如下:

    DATA:存储正向索引

    word_index:存储倒排索引,每个空格分隔的单词作为key,

          value是list结果,通过list.append方法,依次添加相应单词在文本文件中的位置()。

          单词位置使用(行中index+所在行号)的形式表示。  

    #coding:utf-8
    import sys
    
    DATA = {}
    word_index = {}# query->(line_no,word_index)
    
    #using rever_index
    #使用倒排结果
    def check_index(sentense):
        query = sentense.split(' ')
        for v in query:
            if word_index.has_key(v)==True:
                #print word_index[v],"####",v
                for index_lineno in word_index[v]:  #['0.0','2,1','2,3']
                    #print index_lineno
                    print DATA[int(index_lineno.split('.')[1])]
    
    
    
    
    if __name__ =="__main__":
        # 生成倒排
        line_num = 0
        for line in sys.stdin:
            line = line.strip(' 
    ')
            fields = line.split(' ')
    
            DATA[line_num] = line
    
            for i, val in enumerate(fields):
                if word_index.has_key(val) == False:
                    word_index[val] = []
                word_index[val].append(".".join(
                    [str(i), str(line_num)]))
            line_num += 1
    
        print word_index
        print DATA
    
        print "=====test query"
        queries = "it is example"
        print ("####input search sentense:%s",queries)
        print "####search result is :"
        check_index(queries)
        print "done=========="
        sys.exit(0)

    =====

    input.data 文本文件:

    it is what it is
    what is it
    it is a banana
    from your second example
    When I run the algo using some sample
    What am I doing wrong ?

    ======运行结果:

    {'What': ['0.5'], 'doing': ['3.5'], 'is': ['1.0', '4.0', '1.1', '1.2'], 
    'some': ['6.4'], 'it': ['0.0', '3.0', '2.1', '0.2'], 'sample': ['7.4'],
    'second': ['2.3'], 'your': ['1.3'], 'what': ['2.0', '0.1'], 'from': ['0.3'],
    'banana': ['3.2'], '?': ['5.5'], 'run': ['2.4'], 'I': ['1.4', '2.5'],
    'When': ['0.4'], 'wrong': ['4.5'], 'using': ['5.4'], 'a': ['2.2'],
    'am': ['1.5'], 'algo': ['4.4'], 'the': ['3.4'], 'example': ['3.3']}
    {0:
    'it is what it is', 1: 'what is it', 2: 'it is a banana',
      3: 'from your second example', 4: 'When I run the algo using some sample',
      5: 'What am I doing wrong ?'} =====test query ('####input search sentense:%s', 'it is example') ####search result is : it is what it is it is what it is what is it it is a banana it is what it is it is what it is what is it it is a banana from your second example done==========
  • 相关阅读:
    linux命令(14):ifup/ifdown/ip addr命令
    linux命令(13):kill/killall命令
    linux命令(12):ping命令
    linux命令(11):df命令
    linux命令(9):route命令
    npm安装node-sass报msbuild相关错误的解决办法
    'vue-cli-service' 不是内部或外部命令,也不是可运行的程序 或批处理文件。
    redis发布与订阅的实现
    设计模式-工厂模式
    设计模式的介绍
  • 原文地址:https://www.cnblogs.com/li-daphne/p/7223472.html
Copyright © 2011-2022 走看看