zoukankan      html  css  js  c++  java
  • Elasticsearch的分析器

    概念解释:  

      全文搜索引擎会用某种算法对要建索引的文档进行分析, 从文档中提取出若干Token(词元), 这些算法称为Tokenizer(分词器), 这些Token会被进一步处理, 比如转成小写等, 这些处理算法被称为Token Filter(词元处理器), 被处理后的结果被称为Term(词), 文档中包含了几个这样的Term被称为Frequency(词频)。 引擎会建立Term和原文档的Inverted Index(倒排索引), 这样就能根据Term很快到找到源文档了。 文本被Tokenizer处理前可能要做一些预处理, 比如去掉里面的HTML标记, 这些处理的算法被称为Character Filter(字符过滤器), 这整个的分析算法被称为Analyzer(分析器)
      一个分析器是3个顺序执行的组件的结合(字符过滤器Character Filter(0/N个),分词器Tokenizer(1个),词元处理器Token Filter(0/N个)):
        1)Character Filters的作用就是对文本进行一个预处理,例如把文本中所有“&”换成“and”,把“?”去掉等等操作
        2)Tokenizer的作用是进行分词,例如,“tom is a good doctor”,分词器Tokenizer会将这个文本分出很多词来:“tom”、“is”、“a”、“good”、“doctor”
        3)Token Filter的作用就是对分词出来的词元进行处理,得到term,例如tom可能被处理成"t","o","m",最后得出来的结果集合,就是最终的集合

      

     

     ES中关于内置分析器,分词器的定义在: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

    常用分词器介绍:http://jingyan.baidu.com/article/cbcede071e1b9e02f40b4d19.html

    http://blog.csdn.net/i6448038/article/details/51614220

    http://blog.csdn.net/i6448038/article/details/51509439

    1 ES内置的分析器

     

    analyzerlogical namedescription
    standard analyzer standard standard tokenizer, standard filter, lower case filter, stop filter
    simple analyzer simple lower case tokenizer
    stop analyzer stop lower case tokenizer, stop filter
    keyword analyzer keyword 不分词,内容整体作为一个token(not_analyzed)
    pattern analyzer whitespace 正则表达式分词,默认匹配W+
    language analyzers lang 各种语言
    snowball analyzer snowball standard tokenizer, standard filter, lower case filter, stop filter, snowball filter
    custom analyzer custom 一个Tokenizer, 零个或多个Token Filter, 零个或多个Char Filter

    2 ES内置的分词器

    tokenizerlogical namedescription
    standard tokenizer standard  
    edge ngram tokenizer edgeNGram  
    keyword tokenizer keyword 不分词
    letter analyzer letter 按单词分
    lowercase analyzer lowercase letter tokenizer, lower case filter
    ngram analyzers nGram  
    whitespace analyzer whitespace 以空格为分隔符拆分
    pattern analyzer pattern 定义分隔符的正则表达式
    uax email url analyzer uax_url_email 不拆分url和email
    path hierarchy analyzer path_hierarchy 处理类似/path/to/somthing样式的字符串

    ES内置的过滤器:

    token filterlogical namedescription
    standard filter standard  
    ascii folding filter asciifolding  
    length filter length 去掉太长或者太短的
    lowercase filter lowercase 转成小写
    ngram filter nGram  
    edge ngram filter edgeNGram  
    porter stem filter porterStem 波特词干算法
    shingle filter shingle 定义分隔符的正则表达式
    stop filter stop 移除 stop words
    word delimiter filter word_delimiter 将一个单词再拆成子分词
    stemmer token filter stemmer  
    stemmer override filter stemmer_override  
    keyword marker filter keyword_marker  
    keyword repeat filter keyword_repeat  
    kstem filter kstem  
    snowball filter snowball  
    phonetic filter phonetic 插件
    synonym filter synonyms 处理同义词
    compound word filter dictionary_decompounder, hyphenation_decompounder 分解复合词
    reverse filter reverse 反转字符串
    elision filter elision 去掉缩略语
    truncate filter truncate 截断字符串
    unique filter unique  
    pattern capture filter pattern_capture  
    pattern replace filte pattern_replace 用正则表达式替换
    trim filter trim 去掉空格
    limit token count filter limit 限制token数量
    hunspell filter hunspell 拼写检查
    common grams filter common_grams  
    normalization filter arabic_normalization, persian_normalization

    ES内置的字符过滤器:

    character filterlogical namedescription
    mapping char filter mapping 根据配置的映射关系替换字符
    html strip char filter html_strip 去掉HTML元素
    pattern replace char filter pattern_replace 用正则表达式处理字符串

     

    自定义一个拼音分析器的命令:

    /test  --建索引命令  post
    /test/_settings   --修改索引setting的命令  put
    {    --设置参数    
      "index": {
        "analysis": {
          "analyzer": {   --自定义分析器
            "pinyinanalyzer": {
              "tokenizer": "lishuai_pinyin", --使用下面自定义的分词器
              "filter": [  --使用过滤器
                "lowercase",
            "mynGramFilter" --自定义的过滤器
              ]
            }
          },
          "tokenizer": {  --自定义分词器
            "lishuai_pinyin": {
              "type": "pinyin", --对应plugin中分词器的名称
              "first_letter": "prefix", --前缀分词器
              "padding_char": "" 
            }
          },
          "filter": { --自定义过滤器
            "mynGramFilter": { --如果词的长度大于最短词长度则分词,则依次分成最小长度递进到最大长度的词。
              "type": "nGram",
              "min_gram": "2", --词语2个以上才会分词
              "max_gram": "5"  --词语最多分成长度为5个的词语
            }
          }
        }
      }
    }
    "padding_char": " "  --分此后 每个词元以指定符号区分:
    "first_letter": "prefix"  --是否分出每个字的首字母 ,如果配置了,词语刘德华会被分成ldh liu de hua
    
    mappings: post
    /test/product/_mapping  --为索引设置type 指定mapping
    {
      "product": {
        "properties": {
          "name": {
            "type": "string",
            "store": "yes",
            "index": "analyzed",
            "term_vector": "with_positions_offsets"  --表示额外索引的当前和结束位置 
            "analyzer": "pinyinanalyzer"
          }
        }
      }
    }
    ES命令
     

    .NET nest实现代码:

    1)es的mapping模型和自定义的分词插件

     1     [ElasticsearchType(Name = "associationtype")]
     2     public class AssociationInfo
     3     {
     4          [String(Store = true, Index = FieldIndexOption.Analyzed, TermVector = TermVectorOption.WithPositionsOffsets, Analyzer = "mypinyinanalyzer")]
     5         public string keywords { get; set; }
     6          [String(Store = true, Index = FieldIndexOption.NotAnalyzed)]
     7         public string webSite { get; set; }
     8     }
     9 
    10     /// <summary>
    11     /// 自定义拼音分词器
    12     /// </summary>
    13     public class PinYinTokenizer : ITokenizer
    14     {
    15 
    16 
    17         public string Type
    18         {
    19             get { return "pinyin"; }
    20         }
    21 
    22         public string Version
    23         {
    24             get
    25             {
    26                 return string.Empty;
    27             }
    28             set
    29             {
    30                 throw new NotImplementedException();
    31             }
    32         }
    33 
    34         public string first_letter
    35         {
    36             get { return "prefix"; } //开启前缀匹配
    37         }
    38 
    39         public string padding_char {
    40             get { return " "; }
    41         }
    42 
    43         public bool keep_full_pinyin
    44         {
    45             get { return true; }
    46 
    47         }
    48 
    49     }
    50     /// <summary>
    51     /// 自定义拼音词元过滤器
    52     /// </summary>
    53     public class PinYinTF : ITokenFilter
    54     {
    55         /// <summary>
    56         /// 连词过滤器
    57         /// </summary>
    58         public string Type
    59         {
    60             get { return "nGram"; }
    61         }
    62 
    63         public string Version
    64         {
    65             get
    66             {
    67                 return "";
    68             }
    69             set
    70             {
    71                 throw new NotImplementedException();
    72             }
    73         }
    74 
    75         public int min_gram
    76         {
    77             get
    78             {
    79                 return 2;
    80             }
    81         }
    82         public int max_gram
    83         {
    84             get
    85             {
    86                 return 5;
    87             }
    88         }
    89     }
    View Code

    2)创建索引

     1         /// <summary>
     2         /// 创建索引
     3         /// </summary>
     4         public void CreateIndex()
     5         {
     6             //使用es默认的过滤器
     7             //List<string> filters = new List<string>() { "word_delimiter" };    // word_delimiter 将一个单词再拆成子分词,nGram 连词过滤器
     8             //自定义的分词器和过滤器
     9             var pinYinTokenizer = new PinYinTokenizer();
    10             var pinYinTF = new PinYinTF();
    11             //一个自定义分析器需要设置1个分词器,0/n个Token Filters和0到多个Char Filters
    12             var create = new CreateIndexDescriptor("shuaiindex").Settings(s => s.Analysis(a => //自定义一个分析器
    13                 a.Tokenizers(ts => ts.UserDefined("pinyintoken", pinYinTokenizer))  //1 为分析器创建一个用户自定义的pinyin分词器,当然还可以创建自定义TokenFilters表征过滤器,自定义CharFilters字符过滤器
    14                  .TokenFilters(tf => tf.UserDefined("pinyintf", pinYinTF))
    15                  .Analyzers(c => c.Custom("mypinyinanalyzer", f =>  //2设置自定义分析器的名称
    16                      f.Tokenizer("pinyintoken").Filters("pinyintf"))) //3 为分析器设置1个刚才自定义的pinyin分词器和一个自定义的词元过滤器
    17                 ))
    18                 //设置mapping
    19                 .Mappings(map => map.Map<AssociationInfo>(m => m.AutoMap()));
    20 
    21             var client = ElasticSearchCommon.GetInstance().GetElasticClient();
    22             var rs = client.CreateIndex(create);
    23             client.IndexMany(CreateAssociationInfo(), "shuaiindex", "associationtype");
    24         }
    25         public List<AssociationInfo> CreateAssociationInfo()
    26         {
    27             List<AssociationInfo> AssociationInfos = new List<AssociationInfo>();
    28             AssociationInfos.Add(new AssociationInfo() { keywords = "牛奶牛肉", webSite = "1" });
    29             AssociationInfos.Add(new AssociationInfo() { keywords = "小牛", webSite = "1" });
    30             AssociationInfos.Add(new AssociationInfo() { keywords = "果苹", webSite = "1" });
    31             AssociationInfos.Add(new AssociationInfo() { keywords = "牛肉", webSite = "1" });
    32             AssociationInfos.Add(new AssociationInfo() { keywords = "牛肉干", webSite = "1" });
    33             AssociationInfos.Add(new AssociationInfo() { keywords = "牛奶", webSite = "1" });
    34             AssociationInfos.Add(new AssociationInfo() { keywords = "肥牛", webSite = "1" }); 
    35             AssociationInfos.Add(new AssociationInfo() { keywords = "牛头", webSite = "1" });
    36             AssociationInfos.Add(new AssociationInfo() { keywords = "苹果", webSite = "1" }); 
    37             AssociationInfos.Add(new AssociationInfo() { keywords = "车厘子", webSite = "1" });
    38             return AssociationInfos;
    39         
    40         }
    View Code

    3)查询

     1         public List<AssociationInfo> query(string key)
     2         {
     3             QueryContainer query = new QueryContainer();
     4             //query=Query<AssociationInfo>.Prefix(s => s.Field(f => f.keywords).Value(key));
     5             query = Query<AssociationInfo>.Term("keywords", key);
     6             //match 会把词语拆开,匹配每一项
     7             //query = Query<AssociationInfo>.Match(m => m
     8             //        .Field(p => p.keywords)
     9             //        .Query(key)
    10             //        );
    11             var client = ElasticSearchCommon.GetInstance().GetElasticClient();
    12 
    13             SearchRequest request = new SearchRequest("shuaiindex", "associationtype");
    14             request.Query = query;
    15             //request.Analyzer = "mypinyinanalyzer";
    16             ISearchResponse<AssociationInfo> response = client.Search<AssociationInfo>(request);
    17             List<AssociationInfo> re = new List<AssociationInfo>();
    18 
    19             if (response.Documents.Count() > 0)
    20             {
    21                 re = response.Documents.ToList();
    22             }
    23             return re;             
    24         }
    View Code

     

     

     

     

     

     

  • 相关阅读:
    《未来简史》一、主导世界力量的形成与崩塌
    《小岛经济学》九、美岛应对危机——汽油救火
    《小岛经济学》八、金本位的破灭、房地产的泡沫
    《小岛经济学》七、美国的生命线正是我们中国
    《小岛经济学》六、政府手中的魔术棒
    《小岛经济学》五、政府与央行的上帝之手
    《小岛经济学》四、最成功的栽赃——通货紧缩
    《小岛经济学》三、是什么在拉动经济的增长
    11.boolean类型
    10.整形数据类型
  • 原文地址:https://www.cnblogs.com/shaner/p/6340925.html
Copyright © 2011-2022 走看看