zoukankan      html  css  js  c++  java
  • Lucene学习总结之九:Lucene的查询对象(3)

    6、FilteredQuery

    FilteredQuery包含两个成员变量:

    • Query query:查询对象
    • Filter filter:其有一个函数DocIdSet getDocIdSet(IndexReader reader) 得到一个文档号集合,结果文档必须出自此文档集合,注此处的过滤器所包含的文档号并不是要过滤掉的文档号,而是过滤后需要的文档号。

    FilterQuery所得到的结果集同两者取AND查询相同,只不过打分的时候,FilterQuery只考虑query的部分,不考虑filter的部分。

    Filter包含很多种如下:

    6.1、TermsFilter

    其包含一个成员变量Set<Term> terms=new TreeSet<Term>(),所有包含terms集合中任一term的文档全部属于文档号集合。

    其getDocIdSet函数如下:

      public DocIdSet getDocIdSet(IndexReader reader) throws IOException

      {

            //生成一个bitset,大小为索引中文档总数

            OpenBitSet result=new OpenBitSet(reader.maxDoc());

            TermDocs td = reader.termDocs();

            try

            {

                //遍历每个term的文档列表,将文档号都在bitset中置一,从而bitset包含了所有的文档号。

                for (Iterator<Term> iter = terms.iterator(); iter.hasNext();)

                {

                    Term term = iter.next();

                    td.seek(term);

                    while (td.next())

                    {

                        result.set(td.doc());

                    }

                }

            }

            finally

            {

                td.close();

            }

            return result;

      }

    6.2、BooleanFilter

    其像BooleanQuery相似,包含should的filter,must的filter,not的filter,在getDocIdSet的时候,先将所有满足should的文档号集合之间取OR的关系,然后同not的文档号集合取NOT的关系,最后同must的文档号集合取AND的关系,得到最后的文档集合。

    其getDocIdSet函数如下:

    public DocIdSet getDocIdSet(IndexReader reader) throws IOException

    {

      OpenBitSetDISI res = null;

      if (shouldFilters != null) {

        for (int i = 0; i < shouldFilters.size(); i++) {

          if (res == null) {

            res = new OpenBitSetDISI(getDISI(shouldFilters, i, reader), reader.maxDoc());

          } else {

            //将should的filter的文档号全部取OR至bitset中

            DocIdSet dis = shouldFilters.get(i).getDocIdSet(reader);

            if(dis instanceof OpenBitSet) {

              res.or((OpenBitSet) dis);

            } else {

              res.inPlaceOr(getDISI(shouldFilters, i, reader));

            }

          }

        }

      }

      if (notFilters!=null) {

        for (int i = 0; i < notFilters.size(); i++) {

          if (res == null) {

            res = new OpenBitSetDISI(getDISI(notFilters, i, reader), reader.maxDoc());

            res.flip(0, reader.maxDoc());

          } else {

            //将not的filter的文档号全部取NOT至bitset中

            DocIdSet dis = notFilters.get(i).getDocIdSet(reader);

            if(dis instanceof OpenBitSet) {

              res.andNot((OpenBitSet) dis);

            } else {

              res.inPlaceNot(getDISI(notFilters, i, reader));

            }

          }

        }

      }

      if (mustFilters!=null) {

        for (int i = 0; i < mustFilters.size(); i++) {

          if (res == null) {

            res = new OpenBitSetDISI(getDISI(mustFilters, i, reader), reader.maxDoc());

          } else {

            //将must的filter的文档号全部取AND至bitset中

            DocIdSet dis = mustFilters.get(i).getDocIdSet(reader);

            if(dis instanceof OpenBitSet) {

              res.and((OpenBitSet) dis);

            } else {

              res.inPlaceAnd(getDISI(mustFilters, i, reader));

            }

          }

        }

      }

      if (res !=null)

        return finalResult(res, reader.maxDoc());

      return DocIdSet.EMPTY_DOCIDSET;

    }

    6.3、DuplicateFilter

    DuplicateFilter实现了如下的功能:

    比如说我们有这样一批文档,每篇文档都分成多页,每篇文档都有一个id,然而每一页是按照单独的Document进行索引的,于是进行搜索的时候,当一篇文档的两页都包含关键词的时候,此文档id在结果集中出现两次,这是我们不想看到的,DuplicateFilter就是指定一个域如id,在此域相同的文档仅取其中一篇。

    DuplicateFilter包含以下成员变量:

    • String fieldName:域的名称
    • int keepMode:KM_USE_FIRST_OCCURRENCE表示重复的文档取第一篇,KM_USE_LAST_OCCURRENCE表示重复的文档取最后一篇。
    • int processingMode:
      • PM_FULL_VALIDATION是首先将bitset中所有文档都设为false,当出现同组重复文章的第一篇的时候,将其设为1
      • PM_FAST_INVALIDATION是首先将bitset中所有文档都设为true,除了同组重复文章的第一篇,其他的的全部设为0
      • 两者在所有的文档都包含指定域的情况下,功能一样,只不过后者不用处理docFreq=1的文档,速度加快。
      • 然而当有的文档不包含指定域的时候,后者由于都设为true,则没有机会将其清零,因而会被允许返回,当然工程中应避免这种情况。

    其getDocIdSet函数如下:

      public DocIdSet getDocIdSet(IndexReader reader) throws IOException

        {

            if(processingMode==PM_FAST_INVALIDATION)

            {

                return fastBits(reader);

            }

            else

            {

                return correctBits(reader);

            }

        }

      private OpenBitSet correctBits(IndexReader reader) throws IOException

        {

            OpenBitSet bits=new OpenBitSet(reader.maxDoc());

            Term startTerm=new Term(fieldName);

            TermEnum te = reader.terms(startTerm);

            if(te!=null)

            {

                Term currTerm=te.term();

               //如果属于指定的域

                while((currTerm!=null)&&(currTerm.field()==startTerm.field()))

                {

                    int lastDoc=-1;

                    //则取出包含此term的所有的文档

                    TermDocs td = reader.termDocs(currTerm);

                    if(td.next())

                    {

                        if(keepMode==KM_USE_FIRST_OCCURRENCE)

                        {

                            //第一篇设为true

                            bits.set(td.doc());

                        }

                        else

                        {

                            do

                            {

                                lastDoc=td.doc();

                            }while(td.next());

                            bits.set(lastDoc); //最后一篇设为true

                        }

                    }

                    if(!te.next())

                    {

                        break;

                    }

                    currTerm=te.term();

                }

            }

            return bits;

        }

    private OpenBitSet fastBits(IndexReader reader) throws IOException

        {

            OpenBitSet bits=new OpenBitSet(reader.maxDoc());

            bits.set(0,reader.maxDoc());  //全部设为true

            Term startTerm=new Term(fieldName);

            TermEnum te = reader.terms(startTerm);

            if(te!=null)

            {

                Term currTerm=te.term();

                //如果属于指定的域

                while((currTerm!=null)&&(currTerm.field()==startTerm.field()))

                {

                    if(te.docFreq()>1)

                    {

                        int lastDoc=-1;

                        //取出所有的文档

                        TermDocs td = reader.termDocs(currTerm);

                        td.next();

                        if(keepMode==KM_USE_FIRST_OCCURRENCE)

                        {

                            //除了第一篇不清零

                            td.next();

                        }

                        do

                        {

                            lastDoc=td.doc();

                            bits.clear(lastDoc); //其他全部清零

                        }while(td.next());

                        if(keepMode==KM_USE_LAST_OCCURRENCE)

                        {

                            bits.set(lastDoc); //最后一篇设为true

                        }                   

                    }

                    if(!te.next())

                    {

                        break;

                    }

                    currTerm=te.term();

                }

            }

            return bits;

        }

    举例,我们索引如下的文件:

    File indexDir = new File("TestDuplicateFilter/index");
    IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
    Document doc = new Document();
    doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("contents", "page 1: hello world", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("contents", "page 2: hello world", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("contents", "page 3: hello world", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("contents", "page 1: hello world", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(doc);

    doc = new Document();
    doc.add(new Field("id", "2", Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("contents", "page 2: hello world", Field.Store.YES, Field.Index.ANALYZED));
    writer.addDocument(doc);
    writer.close();

    如果搜索TermQuery tq = new TermQuery(new Term("contents","hello")),则结果为:

    id : 1
    id : 1
    id : 1
    id : 2
    id : 2

    如果按如下进行搜索:

    File indexDir = new File("TestDuplicateFilter/index");
    IndexReader reader = IndexReader.open(FSDirectory.open(indexDir));
    IndexSearcher searcher = new IndexSearcher(reader);
    TermQuery tq = new TermQuery(new Term("contents","hello"));
    DuplicateFilter filter = new DuplicateFilter("id");
    FilteredQuery query = new FilteredQuery(tq, filter);
    TopDocs docs = searcher.search(query, 50);
    for (ScoreDoc doc : docs.scoreDocs) {
      Document ldoc = reader.document(doc.doc);
      String id = ldoc.get("id");
      System.out.println("id : " + id);
    }

    则结果为:

    id : 1
    id : 2

    6.4、FieldCacheRangeFilter<T>及FieldCacheTermsFilter

    在介绍与FieldCache相关的Filter之前,先介绍FieldCache。

    FieldCache缓存的是不是存储域的内容,而是索引域中term的内容,索引中的term是String的类型,然而可以将其他的类型作为String类型索引进去,例如"1","2.3"等,然后搜索的时候将这些信息取出来。

    FieldCache支持如下类型:

    • byte[] getBytes (IndexReader reader, String field, ByteParser parser)
    • double[] getDoubles(IndexReader reader, String field, DoubleParser parser)
    • float[] getFloats (IndexReader reader, String field, FloatParser parser)
    • int[] getInts (IndexReader reader, String field, IntParser parser)
    • long[] getLongs(IndexReader reader, String field, LongParser parser)
    • short[] getShorts (IndexReader reader, String field, ShortParser parser)
    • String[] getStrings (IndexReader reader, String field)
    • StringIndex getStringIndex (IndexReader reader, String field)

    其中StringIndex包含两个成员:

    • String[] lookup:按照字典顺序排列的所有term。
    • int[] order:其中位置表示文档号,order[i]第i篇文档包含的term在lookup中的位置。

    FieldCache默认的实现FieldCacheImpl,其中包含成员变量Map<Class<?>,Cache> caches保存从类型到Cache的映射。

    private synchronized void init() {

      caches = new HashMap<Class<?>,Cache>(7);

      caches.put(Byte.TYPE, new ByteCache(this));

      caches.put(Short.TYPE, new ShortCache(this));

      caches.put(Integer.TYPE, new IntCache(this));

      caches.put(Float.TYPE, new FloatCache(this));

      caches.put(Long.TYPE, new LongCache(this));

      caches.put(Double.TYPE, new DoubleCache(this));

      caches.put(String.class, new StringCache(this));

      caches.put(StringIndex.class, new StringIndexCache(this));

    }

    其实现接口getInts 如下,即先得到Integer类型所对应的IntCache然后,再从其中根据reader和由field和parser组成的Entry得到整型值。

    public int[] getInts(IndexReader reader, String field, IntParser parser) throws IOException {

      return (int[]) caches.get(Integer.TYPE).get(reader, new Entry(field, parser));

    }

    各类缓存的父类Cache包含成员变量Map<Object, Map<Entry, Object>> readerCache,其中key是IndexReader,value是一个Map,此Map的key是Entry,也即是field,value是缓存的int[]的值。(也即在这个reader的这个field中有一个数组的int,每一项代表一篇文档)。

    Cache的get函数如下:

    public Object get(IndexReader reader, Entry key) throws IOException {

      Map<Entry,Object> innerCache;

      Object value;

      final Object readerKey = reader.getFieldCacheKey(); //此函数返回this,也即IndexReader本身

      synchronized (readerCache) {

        innerCache = readerCache.get(readerKey); //通过IndexReader得到Map

        if (innerCache == null) { //如果没有则新建一个Map

          innerCache = new HashMap<Entry,Object>();

          readerCache.put(readerKey, innerCache);

          value = null;

        } else {

          value = innerCache.get(key); //此Map的key是Entry,value即是缓存的值

        }

        //如果缓存不命中,则创建此值

        if (value == null) {

          value = new CreationPlaceholder();

          innerCache.put(key, value);

        }

      }

      if (value instanceof CreationPlaceholder) {

        synchronized (value) {

          CreationPlaceholder progress = (CreationPlaceholder) value;

          if (progress.value == null) {

            progress.value = createValue(reader, key); //调用此函数创建缓存值

            synchronized (readerCache) {

              innerCache.put(key, progress.value);

              }

            }

          }

          return progress.value;

      }

      return value;

    }

    Cache的createValue函数根据类型的不同而不同,我们仅分析IntCache和StringIndexCache的实现.

    IntCache的createValue函数如下:

      protected Object createValue(IndexReader reader, Entry entryKey) throws IOException {

        Entry entry = entryKey;

        String field = entry.field;

        IntParser parser = (IntParser) entry.custom;

        int[] retArray = null;

        TermDocs termDocs = reader.termDocs();

        TermEnum termEnum = reader.terms (new Term (field));

        try {

          //依次将域中所有的term都取出来,用IntParser进行解析,缓存retArray[]位置即文档号,retArray[i]即第i篇文档所包含的int值.

          do {

            Term term = termEnum.term();

            if (term==null || term.field() != field) break;

            int termval = parser.parseInt(term.text());

            if (retArray == null)

              retArray = new int[reader.maxDoc()];

            termDocs.seek (termEnum);

            while (termDocs.next()) {

              retArray[termDocs.doc()] = termval;

            }

          } while (termEnum.next());

        } catch (StopFillCacheException stop) {

        } finally {

          termDocs.close();

          termEnum.close();

        }

        if (retArray == null)

          retArray = new int[reader.maxDoc()];

        return retArray;

      }

    };

    StringIndexCache的createValue函数如下:

    protected Object createValue(IndexReader reader, Entry entryKey) throws IOException {

      String field = StringHelper.intern(entryKey.field);

      final int[] retArray = new int[reader.maxDoc()];

      String[] mterms = new String[reader.maxDoc()+1];

      TermDocs termDocs = reader.termDocs();

      TermEnum termEnum = reader.terms (new Term (field));

      int t = 0; 

      mterms[t++] = null;

      try {

        do {

          Term term = termEnum.term();

          if (term==null || term.field() != field) break;

          mterms[t] = term.text(); //mterms[i]保存的是按照字典顺序第i个term所对应的字符串。

          termDocs.seek (termEnum);

          while (termDocs.next()) {

            retArray[termDocs.doc()] = t; //retArray[i]保存的是第i篇文档所包含的字符串在mterms中的位置。

          }

          t++;

        } while (termEnum.next());

      } finally {

        termDocs.close();

        termEnum.close();

      }

      if (t == 0) {

        mterms = new String[1];

      } else if (t < mterms.length) {

        String[] terms = new String[t];

        System.arraycopy (mterms, 0, terms, 0, t);

        mterms = terms;

      }

      StringIndex value = new StringIndex (retArray, mterms);

      return value;

    }

    FieldCacheRangeFilter的可以是各种类型的Range,其中Int类型用下面的函数生成:

    public static FieldCacheRangeFilter<Integer> newIntRange(String field, FieldCache.IntParser parser, Integer lowerVal, Integer upperVal, boolean includeLower, boolean includeUpper) {

      return new FieldCacheRangeFilter<Integer>(field, parser, lowerVal, upperVal, includeLower, includeUpper) {

        @Override

        public DocIdSet getDocIdSet(IndexReader reader) throws IOException {

          final int inclusiveLowerPoint, inclusiveUpperPoint;

          //计算左边界

          if (lowerVal != null) {

            int i = lowerVal.intValue();

            if (!includeLower && i == Integer.MAX_VALUE)

              return DocIdSet.EMPTY_DOCIDSET;

            inclusiveLowerPoint = includeLower ? i : (i + 1);

          } else {

            inclusiveLowerPoint = Integer.MIN_VALUE;

          }

          //计算右边界

          if (upperVal != null) {

            int i = upperVal.intValue();

            if (!includeUpper && i == Integer.MIN_VALUE)

              return DocIdSet.EMPTY_DOCIDSET;

            inclusiveUpperPoint = includeUpper ? i : (i - 1);

          } else {

            inclusiveUpperPoint = Integer.MAX_VALUE;

          }

          if (inclusiveLowerPoint > inclusiveUpperPoint)

            return DocIdSet.EMPTY_DOCIDSET;

          //从cache中取出values,values[i]表示第i篇文档在此域中的值

          final int[] values = FieldCache.DEFAULT.getInts(reader, field, (FieldCache.IntParser) parser);

          return new FieldCacheDocIdSet(reader, (inclusiveLowerPoint <= 0 && inclusiveUpperPoint >= 0)) {

            @Override

            boolean matchDoc(int doc) {

              //仅在文档i所对应的值在区间内的时候才返回。

              return values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;

            }

          };

        }

      };

    }

    FieldCacheRangeFilter同NumericRangeFilter或者TermRangeFilter功能类似,只不过后两者取得docid的bitset都是从索引中取出,而前者是缓存了的,加快了速度。

    同样FieldCacheTermsFilter同TermFilter功能类似,也是前者进行了缓存,加快了速度。

    6.5、MultiTermQueryWrapperFilter<Q>

    MultiTermQueryWrapperFilter包含成员变量Q query,其getDocIdSet得到满足此query的文档号bitset。

    public DocIdSet getDocIdSet(IndexReader reader) throws IOException {

      final TermEnum enumerator = query.getEnum(reader);

      try {

        if (enumerator.term() == null)

          return DocIdSet.EMPTY_DOCIDSET;

        final OpenBitSet bitSet = new OpenBitSet(reader.maxDoc());

        final int[] docs = new int[32];

        final int[] freqs = new int[32];

        TermDocs termDocs = reader.termDocs();

        try {

          int termCount = 0;

          //遍历满足query的所有term

          do {

            Term term = enumerator.term();

            if (term == null)

              break;

            termCount++;

            termDocs.seek(term);

            while (true) {

              //得到每个term的文档号列表,放入bitset

              final int count = termDocs.read(docs, freqs);

              if (count != 0) {

                for(int i=0;i<count;i++) {

                  bitSet.set(docs[i]);

                }

              } else {

                break;

              }

            }

          } while (enumerator.next());

          query.incTotalNumberOfTerms(termCount);

        } finally {

          termDocs.close();

        }

        return bitSet;

      } finally {

        enumerator.close();

      }

    }

    MultiTermQueryWrapperFilter有三个重要的子类:

    • NumericRangeFilter<T>:以NumericRangeQuery作为query
    • PrefixFilter:以PrefixQuery作为query
    • TermRangeFilter:以TermRangeQuery作为query

    6.6、QueryWrapperFilter

    其包含一个查询对象,getDocIdSet会获得所有满足此查询的文档号:

    public DocIdSet getDocIdSet(final IndexReader reader) throws IOException {

      final Weight weight = query.weight(new IndexSearcher(reader));

      return new DocIdSet() {

        public DocIdSetIterator iterator() throws IOException {

          return weight.scorer(reader, true, false); //Scorer的next即返回一个个文档号。

        }

      };

    }

    6.7、SpanFilter

    6.7.1、SpanQueryFilter 

    其包含一个SpanQuery query,作为过滤器,其除了通过getDocIdSet得到文档号之外,bitSpans函数得到的SpanFilterResult还包含位置信息,可以用于在FilterQuery中起过滤作用。

    public DocIdSet getDocIdSet(IndexReader reader) throws IOException {

      SpanFilterResult result = bitSpans(reader);

      return result.getDocIdSet();

    }

    public SpanFilterResult bitSpans(IndexReader reader) throws IOException {

      final OpenBitSet bits = new OpenBitSet(reader.maxDoc());

      Spans spans = query.getSpans(reader);

      List<SpanFilterResult.PositionInfo> tmp = new ArrayList<SpanFilterResult.PositionInfo>(20);

      int currentDoc = -1;

      SpanFilterResult.PositionInfo currentInfo = null;

      while (spans.next())

      {

        //将docid放入bitset

        int doc = spans.doc();

        bits.set(doc);

        if (currentDoc != doc)

        {

          currentInfo = new SpanFilterResult.PositionInfo(doc);

          tmp.add(currentInfo);

          currentDoc = doc;

        }

        //将start和end信息放入PositionInfo

        currentInfo.addPosition(spans.start(), spans.end());

      }

      return new SpanFilterResult(bits, tmp);

    }

    6.7.2、CachingSpanFilter

    由Filter的接口DocIdSet getDocIdSet(IndexReader reader)得知,一个docid的bitset是同一个reader相对应的。

    有前面对docid的描述可知,其仅对一个打开的reader有意义。

    CachingSpanFilter有一个成员变量Map<IndexReader,SpanFilterResult> cache保存从reader到SpanFilterResult的映射,另一个成员变量SpanFilter filter用于缓存不命中的时候得到SpanFilterResult。

    其getDocIdSet如下:

    public DocIdSet getDocIdSet(IndexReader reader) throws IOException {

      SpanFilterResult result = getCachedResult(reader);

      return result != null ? result.getDocIdSet() : null;

    }

    private SpanFilterResult getCachedResult(IndexReader reader) throws IOException {

      lock.lock();

      try {

        if (cache == null) {

          cache = new WeakHashMap<IndexReader,SpanFilterResult>();

        }

        //如果缓存命中,则返回缓存中的结果。

        final SpanFilterResult cached = cache.get(reader);

        if (cached != null) return cached;

      } finally {

        lock.unlock();

      }

      //如果缓存不命中,则用SpanFilter直接从reader中得到结果。

      final SpanFilterResult result = filter.bitSpans(reader);

      lock.lock();

      try {

        //将新得到的结果放入缓存

        cache.put(reader, result);

      } finally {

        lock.unlock();

      }

      return result;


    }

  • 相关阅读:
    寒假学习第九天
    寒假学习第八天
    寒假学习第七天
    寒假学习第六天
    寒假学习第五天
    寒假学习第四天
    寒假学习第三天
    寒假学习第二天
    寒假学习第一天
    阅读笔记
  • 原文地址:https://www.cnblogs.com/forfuture1978/p/1738805.html
Copyright © 2011-2022 走看看