zoukankan      html  css  js  c++  java
  • Solr Suggest组件的使用

    使用suggest的原因,最主要就是相比于search速度快,In general, we need the autosuggest feature to satisfy two main requirements:
     
    ■ It must be fast; there are few things that are more annoying than a clunky type- ahead solution that cannot keep up with users as they type. The Suggester must be able to update the suggestions as the user types each character, so millisec- onds matter.
    ■ It should return ranked suggestions ordered by term frequency, as there is little benefit to suggesting rare terms that occur in only a few documents in your index, especially when the user has typed only a few characters.
     
    lucene Suggest
     
     
    其中分析了AnalyzingInfixSuggester类的相关源码,建立测试用例帮助理解整体过程。Suggest中手动根据其建立索引,在AnalyzingInfixSuggester类中,主要涉及到的属性有:
     
    • text:搜索关键字域,用户输入的搜索关键字是在该域上进行匹配,使用TextField,并进行store;
    • exacttext: 与text的唯一区别是使用StringField并且不进行Store;
    • contexts: 该域也是用于过滤的,只不过它为比较次要的过滤条件域;
     
    先根据InputIterator建立索引,示例中手写了一个InputIterator来进行,InputIterator接口决定了用于suggest搜索的索引数据来源,用于suggest搜索的索引的每个默认域的域值都需要用户自定义,建立的过程中涉及到下面几个概念:
     
    • key: 用于搜索字域,用户输入的搜索关键字分词后的Term在这个域上进行匹配;
    • content: 就是一个Term集合,用于contexts上的域进行TermQuery,在关键词的基础上再加个限制条件让返回的热词列表更符合要求,例如分类,分组等信息(给定限定范围,搜索衬衫,在男装范围内);
    • weight:指定一个数字类型(int, long)的域,搜索结果将按照该域进行降序排序;
    • payload:存储一个额外信息,以ByteBuf存储(其实就是byte[]方式存入索引),当搜索返回后,可以通过LookupResult结果对象的payload属性返回并反序列化该值。
    • allTermRequired: 搜索阶段,是否所有用户输入的关键词都需要全部匹配;
     
    LookupResult包含了如下信息:
    • key:用户输入的搜索关键字,再返回给你
    • highlightKey:其实就是经过高亮的搜索关键字文本,假如你在搜索的时候设置了需要关键字高亮
    • value:即InputInterator接口中weight方法的返回值,即返回的当前热词的权重值,排序就是根据这个值排的
    • payload:就是InputInterator接口中payload方法中指定的payload信息,设计这个payload就是用来让你存一些任意你想存的信息,这就留给你们自己去发挥想象了。
    • contexts:同理即InputInterator接口中contexts方法的返回值再原样返回给你。
     
    Suggest索引的建立
     
    从lucene suggester的源码中可以看出,suggest在内部存在一个SearchManager和一个IndexWriter,建立索引:
     
    @Override
      public void build(InputIterator iter) throws IOException {
    
        if (searcherMgr != null) {
          searcherMgr.close();
          searcherMgr = null;
        }
    
        if (writer != null) {
          writer.close();
          writer = null;
        }
    
        boolean success = false;
        try {
          // First pass: build a temporary normal Lucene index,
          // just indexing the suggestions as they iterate:
          writer = new IndexWriter(dir,
                                   getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
          //long t0 = System.nanoTime();
    
          // TODO: use threads?
          BytesRef text;
          while ((text = iter.next()) != null) {
            BytesRef payload;
            if (iter.hasPayloads()) {
              payload = iter.payload();
            } else {
              payload = null;
            }
    
            add(text, iter.contexts(), iter.weight(), payload);
          }
    
    public void add(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
        ensureOpen();
        writer.addDocument(buildDocument(text, contexts, weight, payload));
      }
    
     
     
    关键是其中的buildDocument,可以看出是通过在其中建立内部的Document并存储来实现的
     
    private Document buildDocument(BytesRef text, Set<BytesRef> contexts, long weight, BytesRef payload) throws IOException {
        String textString = text.utf8ToString();
        Document doc = new Document();
        FieldType ft = getTextFieldType();
        doc.add(new Field(TEXT_FIELD_NAME, textString, ft));
        doc.add(new Field("textgrams", textString, ft));
        doc.add(new StringField(EXACT_TEXT_FIELD_NAME, textString, Field.Store.NO));
        doc.add(new BinaryDocValuesField(TEXT_FIELD_NAME, text));
        doc.add(new NumericDocValuesField("weight", weight));
        if (payload != null) {
          doc.add(new BinaryDocValuesField("payloads", payload));
        }
        if (contexts != null) {
          for(BytesRef context : contexts) {
            doc.add(new StringField(CONTEXTS_FIELD_NAME, context, Field.Store.NO));
            doc.add(new SortedSetDocValuesField(CONTEXTS_FIELD_NAME, context));
          }
        }
        return doc;
      }
    
     
    Suggest查询
     
    使用suggest查询是通过lookup方法来完成的,查询过程使用的SORT是根据weight字段来定义的:
     
    private static final Sort SORT = new Sort(new SortField("weight", SortField.Type.LONG, true));
    
     
    建立一个比较大的BooleanQuery,其连接方式取决于allTermsRequired属性:
    if (allTermsRequired) {
          occur = BooleanClause.Occur.MUST;
        } else {
          occur = BooleanClause.Occur.SHOULD;
        }
     
     
    使用QueryAnalyzer进行切词,在最终的query加入单个TermQuery,注意这些Term都是以text为关键词的,
     
    try (TokenStream ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()))) {
          //long t0 = System.currentTimeMillis();
          ts.reset();
          final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
          final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
          String lastToken = null;
          query = new BooleanQuery.Builder();
          int maxEndOffset = -1;
          matchedTokens = new HashSet<>();
          while (ts.incrementToken()) {
            if (lastToken != null) {  
              matchedTokens.add(lastToken);
              query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
            }
            lastToken = termAtt.toString();
            if (lastToken != null) {
              maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
            }
          }
     
     
    我们的示例中查询contexts的时候,需要将region的字符串转换为BytesRef数组。
     
    Set<BytesRef> contexts = new HashSet<>();
            contexts.add(new BytesRef(region.getBytes("UTF8")));
            List<Lookup.LookupResult> results = suggester.lookup(name, contexts, 2, true, false);
     
     
    至此,Suggest组件的基本流程梳理完成。
     
     
    Solr Suggest组件
     
    在Solr中是如何定义并使用suggest组件的,可以参考:https://cwiki.apache.org/confluence/display/solr/Suggester
     
    首先,建立一个SearchComponent,用来设置提供suggest功能的组件
     
    <searchComponent name="suggest" class="solr.SuggestComponent">
        <lst name="suggester">
          <str name="name">default</str>
          <str name="lookupImpl">FuzzyLookupFactory</str>      
          <str name="dictionaryImpl">DocumentDictionaryFactory</str>
          <str name="field">suggest</str>
          <str name="weightField"></str>
          <str name="suggestAnalyzerFieldType">string</str>
          <str name="buildOnStartup">false</str>
        </lst>
      </searchComponent>
     
     
    根据当前使用到的suggest组件,来绘制一份类图帮助理解整体过程:
     


     
     
    LookupFactory可以根据当前使用到的SolrCore和配置项来创建一个Lucene Suggester(Lookup)组件,我们使用到的InputIterator是根据Directory类来提供的,这两个类均存在对应的工厂类。
     
    我可以根据需要,选择不同的Suggester类,以及对应Directionary组合来共同完成suggest提示。
     
    在requestHandler中也需要加入声明来进行/suggest,以相应http GET请求:
     
      
    <requestHandler name="/suggest" class="org.apache.solr.handler.component.SearchHandler" 
                      startup="lazy" >
        <lst name="defaults">
          <str name="suggest">true</str>
          <str name="suggest.count">10</str>
        </lst>
        <arr name="components">
          <str>suggest</str>
        </arr>
      </requestHandler>
     
     
    为了验证各种类型的Suggester,我们可以在本地加入测试用例,开展测试相关工作。
     
    在AnalyzingInfixSuggester中,InputIterator的使用方式如下:
     
    writer = new IndexWriter(dir,
                                   getIndexWriterConfig(getGramAnalyzer(), IndexWriterConfig.OpenMode.CREATE));
          BytesRef text;
          while ((text = iter.next()) != null) {
            BytesRef payload;
            if (iter.hasPayloads()) {
              payload = iter.payload();
            } else {
              payload = null;
            }
    
            add(text, iter.contexts(), iter.weight(), payload);
          }
     
     
     
    FieldType中存在两种Analyzer,index和query,在fieldType中进行配置。type string和text的主要区别在于是否会进行analyze,string是不需要的,当做一整个单词,而text需要。
     
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <!-- in this example, we will only use synonyms at query time
            <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
            -->
            <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
        </fieldType>
    
     
     
    应用场景示例
     
    假设我们有一张品牌关键字表,需要可以根据品牌的拼音搜索到对应的品牌名称,我们在solr中使用下面的db-data-import语句来进行导入操作:
     
     <entity name="gt_brand" query="
    select brand_id, brand_name, brand_pinyin, brand_name_second, sort from gt_goods_brand
    " >
            <field column="brand_id" name="id"/>
            <field column="brand_name" name="brand_name"/>
            <field column="brand_pinyin" name="brand_pinyin"/>
            <field column="brand_name_second" name="brand_name_second"/>
            <field column="sort" name="sort"/>
        </entity>
     
     
    其中brand_pinyin作为关键词,sort作为权重(weight),brand_name为搜索后真正显示的文本 
     
    Directory indexDir = FSDirectory.open(Paths.get("/Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/index"));
            StandardAnalyzer analyzer = new StandardAnalyzer();
            AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer);
    
    
            DirectoryReader directoryReader = DirectoryReader.open(indexDir);
            DocumentDictionary documentDictionary = new DocumentDictionary(directoryReader, "brand_pinyin", "sort", "brand_name");
            suggester.build(documentDictionary.getEntryIterator());
    
            List<Lookup.LookupResult> cha = suggester.lookup("nijiazhubao", 5, false, false);
            for (Lookup.LookupResult lookupResult : cha) {
    //            System.out.println(lookupResult.key);
    //            System.out.println(lookupResult.value);
                System.out.println(new String(lookupResult.payload.bytes, "UTF8"));
            }
     
     
    <str name="field">brand_pinyin</str>
          <str name="weightField">sort</str>
          <str name="payloadField">brand_name</str>
          <str name="suggestAnalyzerFieldType">string</str>
          <str name="buildOnStartup">true</str>
    
     
    注意,处理的field一定需要有相应的analyzer(index, search)才能suggest出来:
     


     
     
    如何使用两个字段来联想
     
     
    视图去建立多个searchComponent,因为searchHandler可以包含多个searchComponent的名称,但并没有奏效:
      
    <searchComponent name="suggest" class="solr.SuggestComponent">
        <lst name="suggester">
          <str name="name">default</str>
          <str name="lookupImpl">FuzzyLookupFactory</str>      <!-- org.apache.solr.spelling.suggest.fst -->
          <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --> 
          <str name="field">category_name</str>
          <str name="weightField"></str>
          <str name="suggestAnalyzerFieldType">string</str>
        </lst>
      </searchComponent>
    
      <searchComponent name="suggest1" class="solr.SuggestComponent">
       <lst name="suggester">
          <str name="name">default</str>
          <str name="lookupImpl">FuzzyLookupFactory</str>      <!-- org.apache.solr.spelling.suggest.fst -->
          <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory --> 
          <str name="field">brand_name</str>
          <str name="weightField"></str>
          <str name="suggestAnalyzerFieldType">string</str>
        </lst>
      </searchComponent>
    
      <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
        <lst name="defaults">
          <str name="suggest">true</str>
          <str name="suggest.count">5</str>
        </lst>
        <arr name="components">
          <str>suggest</str>
          <str>suggest1</str>
        </arr>
      </requestHandler>
     出现问题:
     
    suggest: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Users/xxx/develop/tools/solr-5.5.0/server/solr/suggest/data/analyzingInfixSuggesterIndexDir/write.lock
    
      
    这其实也是indexPath导致的问题,当存在多个suggester配置的时候,需要将其索引对应的目录分开(至少使用AnalyzingInfixLookupFactory的时候是这样的,看源码可以设置为相对于core/data目录的相对路径:
     
    String indexPath = params.get(INDEX_PATH) != null
        ? params.get(INDEX_PATH).toString()
        : DEFAULT_INDEX_PATH;
        if (new File(indexPath).isAbsolute() == false) {
          indexPath = core.getDataDir() + File.separator + indexPath;
        }
     
     
    但我们加入<str name=“indexPath”>xxx</str>,虽然Exception已经消除,但是查询也没有起作用,只能采用另外的方案来处理,将多个字段copy至同一个字段,以便能够对单独的字段进行suggest提示,参考:http://stackoverflow.com/questions/7712606/solr-suggester-multiple-field-autocomplete
     
    https://issues.apache.org/jira/browse/SOLR-5529,该ISSUE中也提供了解决方案,但是没有试验成功~
     
     
  • 相关阅读:
    PostgreSQL表空间、数据库、模式、表、用户/角色之间的关系(转)
    PostgreSQL学习手册-模式Schema(转)
    Python中的编码与解码(转)
    HttpRequest中常见的四种Content-Type(转)
    Django中对静态文件的支持(转)
    IPython的基本功能(转)
    GET请求Referer限制绕过总结
    Linux pwn入门教程(6)——格式化字符串漏洞
    CVE-2015-1641 Office类型混淆漏洞及shellcode分析
    我用着差不多的套路收拾差不多的骗子过着差不多的又一天!
  • 原文地址:https://www.cnblogs.com/mmaa/p/5789863.html
Copyright © 2011-2022 走看看