zoukankan      html  css  js  c++  java
  • elasticsearch插件安装之--拼音插件

    /**

     * vm12下的centos7.2

     * elasticsearch 5.2.2

     */

    有时在淘宝搜索商品的时候, 会发现使用汉字, 拼音, 或者拼音混合汉字都会出来想要的搜索结果, 今天找了一下, 是通过拼音搜索插件实现的:

    1), ik的安装之前已经讲过, 不在赘述

    2),  es2.4版本的安装非常简单, 和ik挺像, 最后在elasticsearch.yml中加上分词配置即可, 也不再说..

    原博客: http://blog.csdn.net/hhl2046/article/details/53319637

    index:  
      analysis:  
        analyzer:  
          ik:  
            alias: [news_analyzer_ik,ik_analyzer]  
            type: org.elasticsearch.index.analysis.IkAnalyzerProvider  
          ik_analyzer_pinyin:        //分词器名称
            type: custom            // custom表示自己定制
            tokenizer: ik            // 分割词源的组建, ik
            filter: [synonym_test_filter,pinyin_mcl]  // 对分隔的词源做处理  拼音和同义词
        filter:  
          synonym_test_filter:  
            type: synonym_filter  
            synonyms_path: synonym.txt  
            dynamic_reload: true  
            reload_interval: 10s  
            expand: true  
          pinyin_mcl:  
            type: pinyin  
            first_letter: none  
            padding_char: ""  

    ik: https://github.com/medcl/elasticsearch-analysis-ik

    拼音分词器: https://github.com/medcl/elasticsearch-analysis-pinyin

    然后, 5.2.2版本 拼音分词 的安装: 

    1, 下载

    https://github.com/medcl/elasticsearch-analysis-pinyin
    mvn package

    打包成功后, 在 target/releases 下, 可以找到 elasticsearch-analysis-ik-5.2.2.zip

    2, 将打包后的zip文件放在 {ES_HOME}/plugins/pinyin/ 目录下, 并解压根目录

    3, 测试:

    curl -XPUT http://localhost:9200/medcl/ -d'
    {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "pinyin_analyzer" : {
                        "tokenizer" : "my_pinyin"
                        }
                },
                "tokenizer" : {
                    "my_pinyin" : {
                        "type" : "pinyin",
                        "keep_separate_first_letter" : false,
                        "keep_full_pinyin" : true,
                        "keep_original" : true,
                        "limit_first_letter_length" : 16,
                        "lowercase" : true,
                        "remove_duplicated_term" : true
                    }
                }
            }
        }
    }'
    http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer

    分词结果为: 

    {
      "tokens" : [
        {
          "token" : "liu",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "de",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "hua",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "刘德华",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "ldh",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 4
        }
      ]
    }

    4, 配置 IK + pinyin 分词配置

    settings设置: 

    curl -XPUT "http://localhost:9200/medcl/" -d'
    {
        "index": {
            "analysis": {
                "analyzer": {
                    "ik_pinyin_analyzer": {
                        "type": "custom",
                        "tokenizer": "ik_smart",
                        "filter": ["my_pinyin", "word_delimiter"]
                    }
                },
                "filter": {
                    "my_pinyin": {
                        "type": "pinyin",
                        "first_letter": "prefix",
                        "padding_char": " "
                    }
                }
            }
        }
    }'

    创建mapping: 

    curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
    {
        "folks": {
            "properties": {
                "name": {
                    "type": "keyword",
                    "fields": {
                        "pinyin": {
                            "type": "text",
                            "store": "no",
                            "term_vector": "with_positions_offsets",
                            "analyzer": "ik_pinyin_analyzer",
                            "boost": 10
                        }
                    }
                }
            }
        }
    }'

    添加测试文档: 

    curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'
    curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中华人民共和国国歌"}'

    测试分词效果: 

    拼音分词效果: 

    curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"
    
    curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"
    
    curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"
    
    curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

    ik分词测试:

    curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
    {
      "query": {
        "match": {
          "name.pinyin": "国歌"
        }
      },
      "highlight": {
        "fields": {
          "name.pinyin": {}
        }
      }
    }'

     ik + pinyin

    curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
    {
      "query": {
        "match": {
          "name.pinyin": "zhonghua"
        }
      },
      "highlight": {
        "fields": {
          "name.pinyin": {}
        }
      }
    }'

    参照: http://blog.csdn.net/napoay/article/details/53907921

        http://www.jianshu.com/p/653f7b33e63c

        https://github.com/medcl/elasticsearch-analysis-pinyin

         https://my.oschina.net/xiaohui249/blog/214505

  • 相关阅读:
    中文词频统计
    复合数据类型,英文词频统计
    Mybatis 异常:Cause: java.io.IOException: Could not find resource com.xxx.xxx.xml
    Ajax:修改了项目的ajax相关代码,点击运行没有效果
    大数据应用期末总评
    分布式并行计算MapReduce
    分布式文件系统HDFS 练习
    安装关系型数据库MySQL和大数据处理框架Hadoop
    爬虫综合大作业
    爬取全部的校园新闻
  • 原文地址:https://www.cnblogs.com/wenbronk/p/6564962.html
Copyright © 2011-2022 走看看