zoukankan      html  css  js  c++  java
  • Solr4.8.0源码分析(6)之非排序查询

    Solr4.8.0源码分析(6)之非排序查询

    上篇文章简单介绍了Solr的查询流程,本文开始将详细介绍下查询的细节。查询主要分为排序查询和非排序查询,由于两者走的是两个分支,所以本文先介绍下非排序的查询。

    查询的流程主要在SolrIndexSearch.getDocListC(QueryResult qr, QueryCommand cmd),顾名思义该函数对queryResultCache进行处理,并根据查询条件选择进入排序查询还是非排序查询。

    1   /**
    2 * getDocList version that uses+populates query and filter caches.
    3 * In the event of a timeout, the cache is not populated. 4 */ 5 private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException { 6 DocListAndSet out = new DocListAndSet(); 7 qr.setDocListAndSet(out); 8 QueryResultKey key=null; 9 int maxDocRequested = cmd.getOffset() + cmd.getLen(); //当有偏移的查询产生,Solr首先会获取cmd.getOffset()+cmd.getLen()个的doc id然后                                    //再根据偏移量获取子集,所以maxDocRequested是实际的查询个数。 10 // check for overflow, and check for # docs in index 11 if (maxDocRequested < 0 || maxDocRequested > maxDoc()) maxDocRequested = maxDoc();// 最多的情况获取所有doc id 12 int supersetMaxDoc= maxDocRequested; 13 DocList superset = null; 14 15 int flags = cmd.getFlags(); 16 Query q = cmd.getQuery(); 17 if (q instanceof ExtendedQuery) { 18 ExtendedQuery eq = (ExtendedQuery)q; 19 if (!eq.getCache()) { 20 flags |= (NO_CHECK_QCACHE | NO_SET_QCACHE | NO_CHECK_FILTERCACHE); 21 } 22 } 23 24 25 // we can try and look up the complete query in the cache. 26 // we can't do that if filter!=null though (we don't want to 27 // do hashCode() and equals() for a big DocSet).
    // 先从查询结果的缓存区查找是否出现过该条件的查询,若出现过则返回缓存的结果.关于缓存的内容将会独立写一篇文章 28 if (queryResultCache != null && cmd.getFilter()==null 29 && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE))) 30 { 31 // all of the current flags can be reused during warming, 32 // so set all of them on the cache key. 33 key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags); 34 if ((flags & NO_CHECK_QCACHE)==0) { 35 superset = queryResultCache.get(key); 36 37 if (superset != null) { 38 // check that the cache entry has scores recorded if we need them 39 if ((flags & GET_SCORES)==0 || superset.hasScores()) { 40 // NOTE: subset() returns null if the DocList has fewer docs than 41 // requested 42 out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); //如果有缓存,就从中去除一部分子集 43 } 44 } 45 if (out.docList != null) { 46 // found the docList in the cache... now check if we need the docset too. 47 // OPT: possible future optimization - if the doclist contains all the matches, 48 // use it to make the docset instead of rerunning the query.
    //获取缓存中的docSet,并传给result。 49 if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) { 50 if (cmd.getFilterList()==null) { 51 out.docSet = getDocSet(cmd.getQuery()); 52 } else { 53 List<Query> newList = new ArrayList<>(cmd.getFilterList().size()+1); 54 newList.add(cmd.getQuery()); 55 newList.addAll(cmd.getFilterList()); 56 out.docSet = getDocSet(newList); 57 } 58 } 59 return; 60 } 61 } 62 63 // If we are going to generate the result, bump up to the 64 // next resultWindowSize for better caching. 65 // 修改supersetMaxDoc为queryResultWindwSize的整数倍 66 if ((flags & NO_SET_QCACHE) == 0) { 67 // handle 0 special case as well as avoid idiv in the common case. 68 if (maxDocRequested < queryResultWindowSize) { 69 supersetMaxDoc=queryResultWindowSize; 70 } else { 71 supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize; 72 if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested; 73 } 74 } else { 75 key = null; // we won't be caching the result 76 } 77 } 78 cmd.setSupersetMaxDoc(supersetMaxDoc); 79 80 81 // OK, so now we need to generate an answer. 82 // One way to do that would be to check if we have an unordered list 83 // of results for the base query. If so, we can apply the filters and then 84 // sort by the resulting set. This can only be used if: 85 // - the sort doesn't contain score 86 // - we don't want score returned. 87 88 // check if we should try and use the filter cache 89 boolean useFilterCache=false; 90 if ((flags & (GET_SCORES|NO_CHECK_FILTERCACHE))==0 && useFilterForSortedQuery && cmd.getSort() != null && filterCache != null) { 91 useFilterCache=true; 92 SortField[] sfields = cmd.getSort().getSort(); 93 for (SortField sf : sfields) { 94 if (sf.getType() == SortField.Type.SCORE) { 95 useFilterCache=false; 96 break; 97 } 98 } 99 } 100 101 if (useFilterCache) { 102 // now actually use the filter cache. 103 // for large filters that match few documents, this may be 104 // slower than simply re-executing the query. 105 if (out.docSet == null) { 106 out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter()); 107 DocSet bigFilt = getDocSet(cmd.getFilterList()); 108 if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt); 109 } 110 // todo: there could be a sortDocSet that could take a list of 111 // the filters instead of anding them first... 112 // perhaps there should be a multi-docset-iterator 113 sortDocSet(qr, cmd); //排序查询 114 } else { 115 // do it the normal way... 116 if ((flags & GET_DOCSET)!=0) { 117 // this currently conflates returning the docset for the base query vs 118 // the base query and all filters. 119 DocSet qDocSet = getDocListAndSetNC(qr,cmd); 120 // cache the docSet matching the query w/o filtering 121 if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet); 122 } else { 123 getDocListNC(qr,cmd); //非排序查询,这也是本文的流程。 124 } 125 assert null != out.docList : "docList is null"; 126 } 127 128 if (null == cmd.getCursorMark()) { 129 // Kludge... 130 // we can't use DocSlice.subset, even though it should be an identity op 131 // because it gets confused by situations where there are lots of matches, but 132 // less docs in the slice then were requested, (due to the cursor) 133 // so we have to short circuit the call. 134 // None of which is really a problem since we can't use caching with 135 // cursors anyway, but it still looks weird to have to special case this 136 // behavior based on this condition - hence the long explanation. 137 superset = out.docList; //根据offset和len截取查询结果 138 out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); 139 } else { 140 // sanity check our cursor assumptions 141 assert null == superset : "cursor: superset isn't null"; 142 assert 0 == cmd.getOffset() : "cursor: command offset mismatch"; 143 assert 0 == out.docList.offset() : "cursor: docList offset mismatch"; 144 assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " + 145 cmd.getLen() + " vs " + supersetMaxDoc; 146 } 147 148 // lastly, put the superset in the cache if the size is less than or equal 149 // to queryResultMaxDocsCached 150 if (key != null && superset.size() <= queryResultMaxDocsCached && !qr.isPartialResults()) { 151 queryResultCache.put(key, superset); //如果结果的个数小于或者等于queryResultMaxDocsCached则将本次查询结果放入缓存 152 } 153 }

    进入非排序查询分支getDocListNC(),该函数内部分直接调用Lucene的IndexSearch.Search()

     1       final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); //新建TopDocsCollector对象,里面会新建(offset + len(查询条          //件的len))的HitQueue,每当获取到一个符合查询条件的doc,就会将该doc id放入HitQueue,并totalhit计数加一,这个totalhit变量也就是查询结果的数量
     2       Collector collector = topCollector;
     3       if (terminateEarly) {
     4         collector = new EarlyTerminatingCollector(collector, cmd.len);
     5       }
     6       if( timeAllowed > 0 ) {
     7         collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed); 
    //TimeLimitingCollector的实现原理很简单,从第一个找到符合查询条件的doc id开始计时,在达到timeAllowed之前,会想查询得到的doc id放入HitQue //ue,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询。这对于我们优化查询是个重要的提示
    8 } 9 if (pf.postFilter != null) { 10 pf.postFilter.setLastDelegate(collector); 11 collector = pf.postFilter; 12 } 13 try {
    // 进入Lucene的IndexSearch.Search()
    14 super.search(query, luceneFilter, collector); 15 if(collector instanceof DelegatingCollector) { 16 ((DelegatingCollector)collector).finish(); 17 } 18 } 19 catch( TimeLimitingCollector.TimeExceededException x ) { 20 log.warn( "Query: " + query + "; " + x.getMessage() ); 21 qr.setPartialResults(true); 22 } 23 24 totalHits = topCollector.getTotalHits(); //返回totalhit的结果 25 TopDocs topDocs = topCollector.topDocs(0, len); //返回优先级队列hitqueue的doc id 26 populateNextCursorMarkFromTopDocs(qr, cmd, topDocs); 27 28 maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f; 29 nDocsReturned = topDocs.scoreDocs.length; 30 ids = new int[nDocsReturned]; 31 scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null; 32 for (int i=0; i<nDocsReturned; i++) { 33 ScoreDoc scoreDoc = topDocs.scoreDocs[i]; 34 ids[i] = scoreDoc.doc; 35 if (scores != null) scores[i] = scoreDoc.score; 36 }
    TimeLimitingCollector统计查询结果的方法,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询
      /**
       * Calls {@link Collector#collect(int)} on the decorated {@link Collector}
       * unless the allowed time has passed, in which case it throws an exception.
       * 
       * @throws TimeExceededException
       *           if the time allowed has exceeded.
       */
      @Override
      public void collect(final int doc) throws IOException {
        final long time = clock.get();
        if (timeout < time) {
          if (greedy) {
            //System.out.println(this+"  greedy: before failing, collecting doc: "+(docBase + doc)+"  "+(time-t0));
            collector.collect(doc);
          }
          //System.out.println(this+"  failing on:  "+(docBase + doc)+"  "+(time-t0));
          throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );   
        }
        //System.out.println(this+"  collecting: "+(docBase + doc)+"  "+(time-t0));
        collector.collect(doc);
      }

    接下来开始lucece的查询过程,

    1. 首先会为每一个查询条件新建一个Weight的对象,最后将所有Weight对象放入ArrayList<Weight> weights。该过程给出每个查询条件的权重,并用于后续的评分过程。

     1     public BooleanWeight(IndexSearcher searcher, boolean disableCoord)
     2       throws IOException {
     3       this.similarity = searcher.getSimilarity();
     4       this.disableCoord = disableCoord;
     5       weights = new ArrayList<>(clauses.size());
     6       for (int i = 0 ; i < clauses.size(); i++) {
     7         BooleanClause c = clauses.get(i);
     8         Weight w = c.getQuery().createWeight(searcher);
     9         weights.add(w);
    10         if (!c.isProhibited()) {
    11           maxCoord++;
    12         }
    13       }
    14     }

    2. 遍历所有sgement,一个接一个的查找符合查询条件的doc id。AtomicReaderContext 是包含segment的具体信息,包括doc base,num docs,这些信息室非常有用的,在实现查询优化时候很有帮助。这里需要注意的是这个collector是TopDocsCollector类型的对象,这在上面的代码中已经赋值过了。

     1 /**
     2    * Lower-level search API.
     3    * 
     4    * <p>
     5    * {@link Collector#collect(int)} is called for every document. <br>
     6    * 
     7    * <p>
     8    * NOTE: this method executes the searches on all given leaves exclusively.
     9    * To search across all the searchers leaves use {@link #leafContexts}.
    10    * 
    11    * @param leaves 
    12    *          the searchers leaves to execute the searches on
    13    * @param weight
    14    *          to match documents
    15    * @param collector
    16    *          to receive hits
    17    * @throws BooleanQuery.TooManyClauses If a query would exceed 
    18    *         {@link BooleanQuery#getMaxClauseCount()} clauses.
    19    */
    20   protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)
    21       throws IOException {
    22 
    23     // TODO: should we make this
    24     // threaded...?  the Collector could be sync'd?
    25     // always use single thread:
    26     for (AtomicReaderContext ctx : leaves) { // search each subreader
    27       try {
    28         collector.setNextReader(ctx);
    29       } catch (CollectionTerminatedException e) {
    30         // there is no doc of interest in this reader context
    31         // continue with the following leaf
    32         continue;
    33       }
    34       BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());
    35       if (scorer != null) {
    36         try {
    37           scorer.score(collector);
    38         } catch (CollectionTerminatedException e) {
    39           // collection was terminated prematurely
    40           // continue with the following leaf
    41         }
    42       }
    43     }
    44   }

    3. Weight.bulkScorer对查询条件进行评分,Lucene的多条件查询优化还是写的很不错的。Lucece会根据每个查询条件的词频对查询条件进行排序,词频小的排在前面,词频大的排在后面。这大大优化了多条件的查询。多条件查询的优化会在下文中详细介绍。

    4. 最后Lucene会使用scorer.score(collector)这个过程真正的进行查询。看下Weight的两个函数,就能明白Lucene怎么进行查询统计。

     1  @Override
     2     public boolean score(Collector collector, int max) throws IOException {
     3       // TODO: this may be sort of weird, when we are
     4       // embedded in a BooleanScorer, because we are
     5       // called for every chunk of 2048 documents.  But,
     6       // then, scorer is a FakeScorer in that case, so any
     7       // Collector doing something "interesting" in
     8       // setScorer will be forced to use BS2 anyways:
     9       collector.setScorer(scorer);
    10       if (max == DocIdSetIterator.NO_MORE_DOCS) {
    11         scoreAll(collector, scorer);
    12         return false;
    13       } else {
    14         int doc = scorer.docID();
    15         if (doc < 0) {
    16           doc = scorer.nextDoc();
    17         }
    18         return scoreRange(collector, scorer, doc, max);
    19       }
    20     }

    Lucece会不停的从segment获取符合查询条件的doc,并放入collector的hitqueue里面。需要注意的是这里的collector是Collector类型,是TopDocsCollector等类的父类,所以scoreAll不仅能实现获取TopDocsCollector的doc is也能获取其他查询方式的doc id。

    1     static void scoreAll(Collector collector, Scorer scorer) throws IOException {
    2       int doc;
    3       while ((doc = scorer.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
    4         collector.collect(doc);
    5       }
    6     }

    进入collector.collect(doc)查看TopDocsCollector的统计doc id的方式,就跟之前说的一样。

     1     @Override
     2     public void collect(int doc) throws IOException {
     3       float score = scorer.score();
     4 
     5       // This collector cannot handle these scores:
     6       assert score != Float.NEGATIVE_INFINITY;
     7       assert !Float.isNaN(score);
     8 
     9       totalHits++;
    10       if (score <= pqTop.score) {
    11         // Since docs are returned in-order (i.e., increasing doc Id), a document
    12         // with equal score to pqTop.score cannot compete since HitQueue favors
    13         // documents with lower doc Ids. Therefore reject those docs too.
    14         return;
    15       }
    16       pqTop.doc = doc + docBase;
    17       pqTop.score = score;
    18       pqTop = pq.updateTop();
    19     }
    总结:本章详细的介绍了非排序查询的流程,主要涉及了以下几个类QueryComponent,SolrIndexSearch,TimeLimitingCollector,TopDocsCollector,IndexSearch,BulkScore,Weight. 篇幅原因,并没有将如何从segment里面获取doc id以及多条件查询是怎么实现的,这将是下一问多条件查询中详细介绍。
    转载请注明地址http://www.cnblogs.com/rcfeng/
  • 相关阅读:
    iso下载不完整,无论什么方式下载一定要校验md5码
    NR/NT,Taxonomy和RefSeq——三种NCBI常见数据库
    微生物群落多样性测序与功能分析
    分子伴侣
    细菌或真菌菌种鉴定中的16S rRNA,18S rRNA等
    tmRNA的结构和功能
    神秘的细菌基因组:GC skew
    由浅入深理解 IOC 和 DI
    详细分析 Java 中启动线程的正确和错误方式
    详细分析 Java 中实现多线程的方法有几种?(从本质上出发)
  • 原文地址:https://www.cnblogs.com/rcfeng/p/3928356.html
Copyright © 2011-2022 走看看