zoukankan      html  css  js  c++  java
  • Lucene in action 笔记 term vector——针对特定field建立的词频向量空间,不存!不会!影响搜索,其作用是告诉我们搜索结果是“如何”匹配的,用以提供高亮、计算相似度,在VSM模型中评分计算

    摘自:http://makble.com/what-is-term-vector-in-lucene

    given a document, find all its terms and the positions information of these terms. Index tell us which document matched , term vector tells us how and where its matched. A classic example is search result highlighting. The term vector contains all the necessary information let us do this. See how to high light a blog post with Lucene 6.0.0 and Gradle build How to do Lucene search highlight example

    Another interesting thing we can do with term vector is find similar documents of a particular document, for example the "related posts" feature in a blog entry which is a list of links point to other documents similar to current blog entry. With term vector information we can actually calculate how much two documents similar with each other with a simple formula.

    The term vector also play an important role when scoring matching documents in vector space model.

    The term vectors is like a micro version of inverted index against only one document. This index will answer such query: for a search term how many times it occurs in this document and where it show up? Or simply: frequencies and positions.

    The term vector is generated in the analyzing process. When analyzer generate tokens, it also provide position and offset information . You can specify whether to store these information in term vectors:

    TermVector.YES: Only store number of occurrences.

    TermVector.WITH_POSITIONS: Store number of occurrence and positions of terms, but no offset.

    TermVector.WITH_OFFSETS: Store number of occurrence and offsets of terms, but no positions.

    TermVector.WITH_POSITIONS_OFFSETS:number of occurrence and positions , offsets of terms.

    TermVector.NO:Don't store any term vector information.

    If those information is not stored, you can also compute it on the fly when searching.

    摘自:http://blog.csdn.net/fxjtoday/article/details/5142661

    Leveraging term vectors
    所谓term vector, 就是对于documents的某一field,如title,body这种文本类型的, 建立词频的多维向量空间.每一个词就是一维, 这维的值就是这个词在这个field中的频率.

    如果你要使用term vectors, 就要在indexing的时候对该field打开term vectors的选项:

    Field options for term vectors
    TermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.
    TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
    TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.
    TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.
    TermVector.NO – do not store any term vector information.
    If Index.NO is specified for a field, then you must also specify TermVector.NO.

    这样在index完后, 给定这个document id和field名称, 我们就可以从IndexReader读出这个term vector(前提是你在indexing时创建了terms vector):
    TermFreqVector termFreqVector = reader.getTermFreqVector(id, "subject");
    你可以遍历这个TermFreqVector去取出每个词和词频, 如果你在index时选择存下offsets和positions信息的话, 你在这边也可以取到.

    有了这个term vector我们可以做一些有趣的应用:
    1) Books like this
    比较两本书是否相似,把书抽象成一个document文件, 具有author, subject fields. 那么就通过这两个field来比较两本书的相似度.
    author这个field是multiple fields, 就是说可以有多个author, 那么第一步就是比author是否相同,
    String[] authors = doc.getValues("author");
    BooleanQuery authorQuery = new BooleanQuery(); // #3
    for (int i = 0; i < authors.length; i++) { // #3
        String author = authors[i]; // #3
        authorQuery.add(new TermQuery(new Term("author", author)), BooleanClause.Occur.SHOULD); // #3
    }
    authorQuery.setBoost(2.0f);
    最后还可以把这个查询的boost值设高, 表示这个条件很重要, 权重较高, 如果作者相同, 那么就很相似了.
    第二步就用到term vector了, 这里用的很简单, 单纯的看subject field的term vector中的term是否相同,
    TermFreqVector vector = // #4
    reader.getTermFreqVector(id, "subject"); // #4
    BooleanQuery subjectQuery = new BooleanQuery(); // #4
    for (int j = 0; j < vector.size(); j++) { // #4
        TermQuery tq = new TermQuery(new Term("subject", vector.getTerms()[j])); 
        subjectQuery.add(tq, BooleanClause.Occur.SHOULD); // #4
    }

    2) What category?
    这个比上个例子高级一点, 怎么分类了,还是对于document的subject, 我们有了term vector.
    所以对于两个document, 我们可以比较这两个文章的term vector在向量空间中的夹角, 夹角越小说明这个两个document越相似.
    那么既然是分类就有个训练的过程, 我们必须建立每个类的term vector作为个标准, 来给其它document比较.
    这里用map来实现这个term vector, (term, frequency), 用n个这样的map来表示n维. 我们就要为每个category来生成一个term vector, category和term vector也可以用一个map来连接.创建这个category的term vector, 这样做:
    遍历这个类中的每个document, 取document的term vector, 把它加到category的term vector上.
    private void addTermFreqToMap(Map vectorMap, TermFreqVector termFreqVector) {
        String[] terms = termFreqVector.getTerms();
        int[] freqs = termFreqVector.getTermFrequencies();
        for (int i = 0; i < terms.length; i++) {
            String term = terms[i];
            if (vectorMap.containsKey(term)) {
                Integer value = (Integer) vectorMap.get(term);
                vectorMap.put(term, new Integer(value.intValue() + freqs[i]));
            } else {
                vectorMap.put(term, new Integer(freqs[i]));
            }
       }
    }
    首先从document的term vector中取出term和frequency的list, 然后从category的term vector中取每一个term, 把document的term frequency加上去.OK了

    有了这个每个类的category, 我们就要开始计算document和这个类的向量夹角了
    cos = A*B/|A||B|
    A*B就是点积, 就是两个向量每一维相乘, 然后全加起来.
    这里为了简便计算, 假设document中term frequency只有两种情况, 0或1.就表示出现或不出现

    3) MoreLikeThis

    对于找到比较相似的文档,lucene还提供了个比较高效的接口,MoreLikeThis接口

    http://lucene.apache.org/Java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

    对于上面的方法我们可以比较每两篇文档的余弦值,然后对余弦值进行排序,找出最相似的文档,但这个方法的最大问题在于计算量太大,当文档数目很大时,几乎是无法接受的,当然有专门的方法去优化余弦法,可以使计算量大大减少,但这个方法精确,但门槛较高。

    这个接口的原理很简单,对于一篇文档中,我们只需要提取出interestingTerm(即tf×idf高的词),然后用lucene去搜索包含相同词的文档,作为相似文档,这个方法的优点就是高效,但缺点就是不准确,这个接口提供很多参数,你可以配置来选择interestingTerm。

     

  • 相关阅读:
    python 获取当前执行的命令 处于什么文件内
    FlatBuffers
    flink
    auto_ptr,unique_ptr,shared_ptr,weak_ptr
    Java DES 加解密文件
    quartz Web项目基础最简单配置
    C# webbrowser 修改useragent
    bat产生随机数并复制文件及生成文件列表
    outlook 用宏发邮件
    SSL证书请求文件(CSR)生成指南
  • 原文地址:https://www.cnblogs.com/bonelee/p/6604370.html
Copyright © 2011-2022 走看看