zoukankan      html  css  js  c++  java
  • Solr4.8.0源码分析(5)之查询流程分析总述

    Solr4.8.0源码分析(5)之查询流程分析总述

    前面已经写到,solr查询是通过http发送命令,solr servlet接受并进行处理。所以solr的查询流程从SolrDispatchsFilter的dofilter开始。dofilter包含了对http的各个请求的操作。Solr的查询方式有很多,比如q,fq等,本章只关注select和q。页面下发的查询请求如下:http://localhost:8080/solr/test/select?q=code%3A%E8%BE%BD*+AND+last_modified%3A%5B0+TO+1408454600265%5D+AND+id%3Acheng&wt=json&indent=true

    1   @Override
    2   public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
    3     doFilter(request, response, chain, false);
    4   }

    由于只关注select,实际的查询是从如下代码开始:this.execute()是查询的入口函数。这里需要注意下writeResponse()函数。execute只是获取了符合查询条件的doc id,最后在writeResponse()中会根据doc id获取stored属性的字段信息,并写入返回结果。

     1  // With a valid handler and a valid core...
     2           if( handler != null ) {
     3             // if not a /select, create the request
     4             if( solrReq == null ) {
     5               solrReq = parser.parse( core, path, req );
     6             }
     7 
     8             if (usingAliases) {
     9               processAliases(solrReq, aliases, collectionsList);
    10             }
    11             
    12             final Method reqMethod = Method.getMethod(req.getMethod());
    13             HttpCacheHeaderUtil.setCacheControlHeader(config, resp, reqMethod);
    14             // unless we have been explicitly told not to, do cache validation
    15             // if we fail cache validation, execute the query
    16             if (config.getHttpCachingConfig().isNever304() ||
    17                 !HttpCacheHeaderUtil.doCacheHeaderValidation(solrReq, req, reqMethod, resp)) {
    18                 SolrQueryResponse solrRsp = new SolrQueryResponse();
    19                 /* even for HEAD requests, we need to execute the handler to
    20                  * ensure we don't get an error (and to make sure the correct
    21                  * QueryResponseWriter is selected and we get the correct
    22                  * Content-Type)
    23                  */
    24                 SolrRequestInfo.setRequestInfo(new SolrRequestInfo(solrReq, solrRsp));
    25                 this.execute( req, handler, solrReq, solrRsp );
    26                 HttpCacheHeaderUtil.checkHttpCachingVeto(solrRsp, resp, reqMethod);
    27               // add info to http headers
    28               //TODO: See SOLR-232 and SOLR-267.  
    29                 /*try {
    30                   NamedList solrRspHeader = solrRsp.getResponseHeader();
    31                  for (int i=0; i<solrRspHeader.size(); i++) {
    32                    ((javax.servlet.http.HttpServletResponse) response).addHeader(("Solr-" + solrRspHeader.getName(i)), String.valueOf(solrRspHeader.getVal(i)));
    33                  }
    34                 } catch (ClassCastException cce) {
    35                   log.log(Level.WARNING, "exception adding response header log information", cce);
    36                 }*/
    37                QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq);
    38                writeResponse(solrRsp, response, responseWriter, solrReq, reqMethod);
    39             }

    进入excute后会进入SolrCore的excute(), preDecorateResponse 对结果的头信息比如进行预处理,postDecorateResponse对将时间、返回结果写入response中。handleRequest继续进行查询操作。

     1   public void execute(SolrRequestHandler handler, SolrQueryRequest req, SolrQueryResponse rsp) {
     2     if (handler==null) {
     3       String msg = "Null Request Handler '" +
     4         req.getParams().get(CommonParams.QT) + "'";
     5       
     6       if (log.isWarnEnabled()) log.warn(logid + msg + ":" + req);
     7       
     8       throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, msg);
     9     }
    10 
    11     preDecorateResponse(req, rsp);
    12 
    13     // TODO: this doesn't seem to be working correctly and causes problems with the example server and distrib (for example /spell)
    14     // if (req.getParams().getBool(ShardParams.IS_SHARD,false) && !(handler instanceof SearchHandler))
    15     //   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,"isShard is only acceptable with search handlers");
    16 
    17 
    18     handler.handleRequest(req,rsp);
    19     postDecorateResponse(handler, req, rsp);
    20 
    21     if (log.isInfoEnabled() && rsp.getToLog().size() > 0) {
    22       log.info(rsp.getToLogAsString(logid));
    23     }
    24   }

    RequestHandlerBase.handleRequest(SolrQueryRequest req, SolrQueryResponse rsp)再次调用了SearchHandle.handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp),这是时候才真正开始加载QueryComponents。

    以下语句会加载查询有关的组件,包括QueryComponents,FacetComponents,MoreLikeThisComponent,HighlightComponent,StatsComponent,

    DebugComponent,ExpandComponent。本文只关注查询,所以进入的QueryComponent.java.

    for( SearchComponent c : components ) {
        c.process(rb);
    }    

    暂且不提QueryComponent.java中的关于Query的处理(查询的细节将在后面章节中说明,本章只作总述),QueryComponent.process

    (ResponseBuilder rb) 会调用SolrindexSearch.search(QueryResult qr, QueryCommand cmd)进行查询,并在后续代码中对返回的结果进行处理,主要包括doFieldSortValues(rb, searcher);和doPrefetch(rb);

     1     // normal search result
     2     searcher.search(result,cmd);
     3     rb.setResult( result );
     4 
     5     ResultContext ctx = new ResultContext();
     6     ctx.docs = rb.getResults().docList;
     7     ctx.query = rb.getQuery();
     8     rsp.add("response", ctx);
     9     rsp.getToLog().add("hits", rb.getResults().docList.matches());
    10 
    11     if ( ! rb.req.getParams().getBool(ShardParams.IS_SHARD,false) ) {
    12       if (null != rb.getNextCursorMark()) {
    13         rb.rsp.add(CursorMarkParams.CURSOR_MARK_NEXT, 
    14                    rb.getNextCursorMark().getSerializedTotem());
    15       }
    16     }
    17     doFieldSortValues(rb, searcher);
    18     doPrefetch(rb);

    SolrindexSearch.search函数比较简单,只是调用了SolrindexSearch.getDocListC.顾名思义,该函数返回了查询结果的doc id 的list。这时候才是真正的查询开始。查询之前,Solr会从queryResultCache缓存里面读取该条件的结果,queryResultCache里面存放了查询条件和查询结果的键值对。如果queryResultCache里面有这个查询条件,那Solr就会直接返回查询条件的值。如果没有该查询条件,则会进行正常查询,并把查询条件和查询命令写入queryResultCache的键值对里。queryResultCache具有容量大小,可以在solrconfig的缓存配置里进行配置。

     1     // we can try and look up the complete query in the cache.
     2     // we can't do that if filter!=null though (we don't want to
     3     // do hashCode() and equals() for a big DocSet).
     4     if (queryResultCache != null && cmd.getFilter()==null
     5         && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE)))
     6     {
     7         // all of the current flags can be reused during warming,
     8         // so set all of them on the cache key.
     9         key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags);
    10         if ((flags & NO_CHECK_QCACHE)==0) {
    11           superset = queryResultCache.get(key);
    12 
    13           if (superset != null) {
    14             // check that the cache entry has scores recorded if we need them
    15             if ((flags & GET_SCORES)==0 || superset.hasScores()) {
    16               // NOTE: subset() returns null if the DocList has fewer docs than
    17               // requested
    18               out.docList = superset.subset(cmd.getOffset(),cmd.getLen());
    19             }
    20           }
    21           if (out.docList != null) {
    22             // found the docList in the cache... now check if we need the docset too.
    23             // OPT: possible future optimization - if the doclist contains all the matches,
    24             // use it to make the docset instead of rerunning the query.
    25             if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) {
    26               if (cmd.getFilterList()==null) {
    27                 out.docSet = getDocSet(cmd.getQuery());
    28               } else {
    29                 List<Query> newList = new ArrayList<>(cmd.getFilterList().size()+1);
    30                 newList.add(cmd.getQuery());
    31                 newList.addAll(cmd.getFilterList());
    32                 out.docSet = getDocSet(newList);
    33               }
    34             }
    35             return;
    36           }
    37         }
    38 
    39       // If we are going to generate the result, bump up to the
    40       // next resultWindowSize for better caching.
    41 
    42       if ((flags & NO_SET_QCACHE) == 0) {
    43         // handle 0 special case as well as avoid idiv in the common case.
    44         if (maxDocRequested < queryResultWindowSize) {
    45           supersetMaxDoc=queryResultWindowSize;
    46         } else {
    47           supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize;
    48           if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested;
    49         }
    50       } else {
    51         key = null;  // we won't be caching the result
    52       }
    53     }

    如果没有复合的缓存,那么将进行正常的查询。这里查询会走排序和非排序的查询分支(两个分支的差别将在后续文章中写道)。最后查询会进入getDocListNC(qr,cmd)函数继续进行查询。superset.subset()会对查询结果进行截断,比如我查询的结果start=20,row=40,那么Solr查询实际的结果是start=0,row=60,也就是至少说会查(start+row)个结果,然后再获取第20到第60的结果集。

    if (useFilterCache) {
          // now actually use the filter cache.
          // for large filters that match few documents, this may be
          // slower than simply re-executing the query.
          if (out.docSet == null) {
            out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());
            DocSet bigFilt = getDocSet(cmd.getFilterList());
            if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
          }
          // todo: there could be a sortDocSet that could take a list of
          // the filters instead of anding them first...
          // perhaps there should be a multi-docset-iterator
          sortDocSet(qr, cmd);
        } else {
          // do it the normal way...
          if ((flags & GET_DOCSET)!=0) {
            // this currently conflates returning the docset for the base query vs
            // the base query and all filters.
            DocSet qDocSet = getDocListAndSetNC(qr,cmd);
            // cache the docSet matching the query w/o filtering
            if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet);
          } else {
            getDocListNC(qr,cmd);
          }
          assert null != out.docList : "docList is null";
        }
    
        if (null == cmd.getCursorMark()) {
          // Kludge...
          // we can't use DocSlice.subset, even though it should be an identity op
          // because it gets confused by situations where there are lots of matches, but
          // less docs in the slice then were requested, (due to the cursor)
          // so we have to short circuit the call.
          // None of which is really a problem since we can't use caching with
          // cursors anyway, but it still looks weird to have to special case this
          // behavior based on this condition - hence the long explanation.
          superset = out.docList;
          out.docList = superset.subset(cmd.getOffset(),cmd.getLen());
        } else {
          // sanity check our cursor assumptions
          assert null == superset : "cursor: superset isn't null";
          assert 0 == cmd.getOffset() : "cursor: command offset mismatch";
          assert 0 == out.docList.offset() : "cursor: docList offset mismatch";
          assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " +
            cmd.getLen() + " vs " + supersetMaxDoc;
        }

    SolrIndexSearch.getDocListNC(qr,cmd)里面定义了许多Collector的内部类,不过暂时与本章节无关,所以直接查看以下这段代码。首先Solr会创建TopDocsCollector,它会存放所有复合查询条件的结果集。如果查询的时候设置了timeAllowed开关,那么查询就会走TimeLimitingCollector分支。TimeLimitingCollector是Collector的子类,当timeAllowed设定一个数字时,比如200ms,如果Solr查询一旦获取到结果就会在200ms内返回,不管查询的结果是否已经完整。可以看见最后查询过程最后调用了Lucene IndexSearch.Search(),这层开始进入Lucene.最后Solr会对TopDocsCollector的结果总数以及优先级队列进行处理。

     1 final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd);
     2       Collector collector = topCollector;
     3       if (terminateEarly) {
     4         collector = new EarlyTerminatingCollector(collector, cmd.len);
     5       }
     6       if( timeAllowed > 0 ) {
     7         collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed);
     8       }
     9       if (pf.postFilter != null) {
    10         pf.postFilter.setLastDelegate(collector);
    11         collector = pf.postFilter;
    12       }
    13       try {
    14         super.search(query, luceneFilter, collector);
    15         if(collector instanceof DelegatingCollector) {
    16           ((DelegatingCollector)collector).finish();
    17         }
    18       }
    19       catch( TimeLimitingCollector.TimeExceededException x ) {
    20         log.warn( "Query: " + query + "; " + x.getMessage() );
    21         qr.setPartialResults(true);
    22       }
    23 
    24       totalHits = topCollector.getTotalHits();
    25       TopDocs topDocs = topCollector.topDocs(0, len);
    26       populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);
    27 
    28       maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f;
    29       nDocsReturned = topDocs.scoreDocs.length;
    30       ids = new int[nDocsReturned];
    31       scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null;
    32       for (int i=0; i<nDocsReturned; i++) {
    33         ScoreDoc scoreDoc = topDocs.scoreDocs[i];
    34         ids[i] = scoreDoc.doc;
    35         if 

    进入Lucene的IndexSearch.Search()后,Solr开始对所有Segment进行遍历,AtomicReaderContext包含了Segment的所有信息,包括docbase,doc的个数。

    遍历完后,会调用Weight.bulkScore()对多个条件进行重组,比如多个OR的条件组成一个条件,多个AND的查询条件再组成一个List。Weight.bulkScore()会对这个List按照查询条件的词频进行排序。对条件处理好以后,就是会从segment里面获取所有符合查询条件的doc id(具体的获取方法,在后续的文章里会详细介绍),这就是scorer.score(collector);的作用了。

     1  /**
     2    * Lower-level search API.
     3    * 
     4    * <p>
     5    * {@link Collector#collect(int)} is called for every document. <br>
     6    * 
     7    * <p>
     8    * NOTE: this method executes the searches on all given leaves exclusively.
     9    * To search across all the searchers leaves use {@link #leafContexts}.
    10    * 
    11    * @param leaves 
    12    *          the searchers leaves to execute the searches on
    13    * @param weight
    14    *          to match documents
    15    * @param collector
    16    *          to receive hits
    17    * @throws BooleanQuery.TooManyClauses If a query would exceed 
    18    *         {@link BooleanQuery#getMaxClauseCount()} clauses.
    19    */
    20   protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)
    21       throws IOException {
    22 
    23     // TODO: should we make this
    24     // threaded...?  the Collector could be sync'd?
    25     // always use single thread:
    26     for (AtomicReaderContext ctx : leaves) { // search each subreader
    27       try {
    28         collector.setNextReader(ctx);
    29       } catch (CollectionTerminatedException e) {
    30         // there is no doc of interest in this reader context
    31         // continue with the following leaf
    32         continue;
    33       }
    34       BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());
    35       if (scorer != null) {
    36         try {
    37           scorer.score(collector);
    38         } catch (CollectionTerminatedException e) {
    39           // collection was terminated prematurely
    40           // continue with the following leaf
    41         }
    42       }
    43     }
    44   }

    到这一步已经获取到符合查询条件的所有doc id了,但是我们的查询结果是需要显示多有的字段的,所以也就是说Solr后面还是会根据doc id再次取segment获取所有字段信息,至于这是在哪里实现的,在后续文章中会详细描述。

    总结: Solr的查询过程还是比较绕的,且有很多可以优化的地方。本文主要简述了Solr查询的流程,对查询过程中的细节将在后续的文章里面具体阐述。

    转载请注明地址http://www.cnblogs.com/rcfeng/
  • 相关阅读:
    MySQL 存储过程 (2)
    MySQL 存储过程
    MySQL 数据库常用命令
    oracle 控制文件多路复用
    oracle定时清理日志操作
    git 常用的分支技巧
    golang tcp keepalive实践
    TCP keepalive的详解(解惑)
    勾践为什么卧薪尝胆
    golang context学习记录1
  • 原文地址:https://www.cnblogs.com/rcfeng/p/3923534.html
Copyright © 2011-2022 走看看