使用 Elasticsearch ik分词实现同义词搜索（转）

zoukankan html css js c++ java

使用 Elasticsearch ik分词实现同义词搜索（转）
1、首先需要安装好Elasticsearch 和elasticsearch-analysis-ik分词器

2、配置ik同义词

Elasticsearch 自带一个名为 synonym 的同义词 filter。为了能让 IK 和 synonym 同时工作，我们需要定义新的 analyzer，用 IK 做 tokenizer，synonym 做 filter。听上去很复杂，实际上要做的只是加一段配置。

打开 /config/elasticsearch.yml 文件，加入以下配置：
[html] view plain copy

index:

  analysis:

    analyzer:

      ik_syno:

          type: custom

          tokenizer: ik_max_word

          filter: [my_synonym_filter]

      ik_syno_smart:

          type: custom

          tokenizer: ik_smart

          filter: [my_synonym_filter]

    filter:

      my_synonym_filter:

          type: synonym

          synonyms_path: analysis/synonym.txt
以上配置定义了 ik_syno 和 ik_syno_smart 这两个新的 analyzer，分别对应 IK 的 ik_max_word 和 ik_smart 两种分词策略。根据 IK 的文档，二者区别如下：
- ik_max_word：会将文本做最细粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、中华人民、中华、华人、人民共和国、人民、人、民、共和国、共和、和、国国、国歌」，会穷尽各种可能的组合；
- ik_smart：会将文本做最粗粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、国歌」；
ik_syno 和 ik_syno_smart 都会使用 synonym filter 实现同义词转换。

3、创建/config/analysis/synonym.txt 文件，输入一些同义词并存为 utf-8 格式。例如

到此同义词配置已经完成，重启ES即可，搜索时指定分词为ik_syno或ik_syno_smart。

创建Mapping映射。执行curl命令如下
[html] view plain copy

curl -XPOST  http://192.168.1.99:9200/goodsindex/goods/_mapping -d'{

  "goods": {

    "_all": {

      "enabled": true,

      "analyzer": "ik_max_word",

      "search_analyzer": "ik_max_word",

      "term_vector": "no",

      "store": "false"

    },

    "properties": {

      "title": {

        "type": "string",

        "term_vector": "with_positions_offsets",

        "analyzer": "ik_syno",

        "search_analyzer": "ik_syno"

      },

      "content": {

        "type": "string",

        "term_vector": "with_positions_offsets",

        "analyzer": "ik_syno",

        "search_analyzer": "ik_syno"

      },

      "tags": {

        "type": "string",

        "term_vector": "no",

        "analyzer": "ik_syno",

        "search_analyzer": "ik_syno"

      },

      "slug": {

        "type": "string",

        "term_vector": "no"

      },

      "update_date": {

        "type": "date",

        "term_vector": "no",

        "index": "no"

      }

    }

  }

}'
以上代码为 test 索引下的 article 类型指定了字段特征： title 、 content 和 tags 字段使用 ik_syno 做为 analyzer，说明它使用 ik_max_word 做为分词，并且应用 synonym 同义词策略； slug 字段没有指定 analyzer，说明它使用默认分词；而 update_date 字段则不会被索引。
查看全文

相关阅读:
HDU 1261 字串数(排列组合)
Codeforces 488C Fight the Monster
HDU 1237 简单计算器
 POJ 2240 Arbitrage
POJ 3660 Cow Contest
POJ 1052 MPI Maelstrom
POJ 3259 Wormholes
POJ 3268 Silver Cow Party
Codesforces 485D Maximum Value
POJ 2253 Frogger（最短路）

原文地址：https://www.cnblogs.com/sandea/p/5744645.html