zoukankan      html  css  js  c++  java
  • Elastic Search 分词器的介绍和使用

    分词器的介绍和使用

    什么是分词器?

    将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具

    常用的内置分词器

    standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer

    standard analyzer

    标准分析器是默认分词器,如果未指定,则使用该分词器。

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"standard",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "the",
                "start_offset": 0,
                "end_offset": 3,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "3",
                "start_offset": 9,
                "end_offset": 10,
                "type": "<NUM>",
                "position": 2
            },
            {
                "token": "points",
                "start_offset": 11,
                "end_offset": 17,
                "type": "<ALPHANUM>",
                "position": 3
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "<ALPHANUM>",
                "position": 4
            },
            {
                "token": "is",
                "start_offset": 26,
                "end_offset": 28,
                "type": "<ALPHANUM>",
                "position": 5
            },
            {
                "token": "curry",
                "start_offset": 29,
                "end_offset": 34,
                "type": "<ALPHANUM>",
                "position": 6
            }
        ]
    }

    simple analyzer

    simple 分析器当它遇到只要不是字母的字符,就将文本解析成 term,而且所有的 term 都是小写的。

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"simple",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "the",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 0
            },
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "word",
                "position": 1
            },
            {
                "token": "points",
                "start_offset": 11,
                "end_offset": 17,
                "type": "word",
                "position": 2
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "word",
                "position": 3
            },
            {
                "token": "is",
                "start_offset": 26,
                "end_offset": 28,
                "type": "word",
                "position": 4
            },
            {
                "token": "curry",
                "start_offset": 29,
                "end_offset": 34,
                "type": "word",
                "position": 5
            }
        ]
    }

    whitespace analyzer

    whitespace 分析器,当它遇到空白字符时,就将文本解析成terms

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"whitespace",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "The",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 0
            },
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "word",
                "position": 1
            },
            {
                "token": "3-points",
                "start_offset": 9,
                "end_offset": 17,
                "type": "word",
                "position": 2
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "word",
                "position": 3
            },
            {
                "token": "is",
                "start_offset": 26,
                "end_offset": 28,
                "type": "word",
                "position": 4
            },
            {
                "token": "Curry!",
                "start_offset": 29,
                "end_offset": 35,
                "type": "word",
                "position": 5
            }
        ]
    }

    stop analyzer

    stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持,默认使用了 english 停止词

    stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等等

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"stop",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "word",
                "position": 1
            },
            {
                "token": "points",
                "start_offset": 11,
                "end_offset": 17,
                "type": "word",
                "position": 2
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "word",
                "position": 3
            },
            {
                "token": "curry",
                "start_offset": 29,
                "end_offset": 34,
                "type": "word",
                "position": 5
            }
        ]
    }

    language analyzer

    (特定的语言的分词器,比如说,english,英语分词器),内置语言:arabic,armenian,basque,bengali,brazilian,bulgarian,catalan,cjk,czech,danish,dutch,english,finnish,french,galician,german,greek,hindi,hungarian,indonesian,irish,italian,latvian,lithuanian,norwegian,persian,portuguese,romanian,russian,sorani,spanish,swedish,turkish,thai

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"english",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "3",
                "start_offset": 9,
                "end_offset": 10,
                "type": "<NUM>",
                "position": 2
            },
            {
                "token": "point",
                "start_offset": 11,
                "end_offset": 17,
                "type": "<ALPHANUM>",
                "position": 3
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "<ALPHANUM>",
                "position": 4
            },
            {
                "token": "curri",
                "start_offset": 29,
                "end_offset": 34,
                "type": "<ALPHANUM>",
                "position": 6
            }
        ]
    }

    pattern analyzer

    用正则表达式来将文本分割成terms,默认的正则表达式是W+(非单词字符)

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer":"pattern",
        "text":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "the",
                "start_offset": 0,
                "end_offset": 3,
                "type": "word",
                "position": 0
            },
            {
                "token": "best",
                "start_offset": 4,
                "end_offset": 8,
                "type": "word",
                "position": 1
            },
            {
                "token": "3",
                "start_offset": 9,
                "end_offset": 10,
                "type": "word",
                "position": 2
            },
            {
                "token": "points",
                "start_offset": 11,
                "end_offset": 17,
                "type": "word",
                "position": 3
            },
            {
                "token": "shooter",
                "start_offset": 18,
                "end_offset": 25,
                "type": "word",
                "position": 4
            },
            {
                "token": "is",
                "start_offset": 26,
                "end_offset": 28,
                "type": "word",
                "position": 5
            },
            {
                "token": "curry",
                "start_offset": 29,
                "end_offset": 34,
                "type": "word",
                "position": 6
            }
        ]
    }

    使用示例

    PUT localhost:9200/my_index
    
    参数:
    {
        "settings":{
            "analysis":{
                "analyzer":{
                    "my_analyzer":{
                        "type":"whitespace"
                    }
                }
            }
        },
        "mappings":{
            "properties":{
                "name":{
                    "type":"text"
                },
                "team_name":{
                    "type":"text"
                },
                "position":{
                    "type":"text"
                },
                "play_year":{
                    "type":"long"
                },
                "jerse_no":{
                    "type":"keyword"
                },
                "title":{
                    "type":"text",
                    "analyzer":"my_analyzer"
                }
            }
        }
    }
    
    返回值:
    {
        "acknowledged": true,
        "shards_acknowledged": true,
        "index": "my_index"
    }
    
    PUT localhost:9200/my_index/_doc/1
    
    参数:
    {
        "name":"库⾥里里",
        "team_name":"勇⼠士",
        "position":"控球后卫",
        "play_year":10,
        "jerse_no":"30",
        "title":"The best 3-points shooter is Curry!"
    }
    
    返回值:
    {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_version": 1,
        "result": "created",
        "_shards": {
            "total": 2,
            "successful": 1,
            "failed": 0
        },
        "_seq_no": 0,
        "_primary_term": 1
    }
    
    POST localhost:9200/my_index/_search
    
    参数:
    {
        "query":{
            "match":{
                "title":"Curry!"
            }
        }
    }
    
    返回值:
    {
        "took": 2,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 1,
                "relation": "eq"
            },
            "max_score": 0.2876821,
            "hits": [
                {
                    "_index": "my_index",
                    "_type": "_doc",
                    "_id": "1",
                    "_score": 0.2876821,
                    "_source": {
                        "name": "库⾥里里",
                        "team_name": "勇⼠士",
                        "position": "控球后卫",
                        "play_year": 10,
                        "jerse_no": "30",
                        "title": "The best 3-points shooter is Curry!"
                    }
                }
            ]
        }
    }

    常见中文分词器的使用

    如果用默认的分词器 standard 进行中文分词,中文会被单独分成一个个汉字

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer": "standard",
        "text": "火箭明年总冠军"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "火",
                "start_offset": 0,
                "end_offset": 1,
                "type": "<IDEOGRAPHIC>",
                "position": 0
            },
            {
                "token": "箭",
                "start_offset": 1,
                "end_offset": 2,
                "type": "<IDEOGRAPHIC>",
                "position": 1
            },
            {
                "token": "明",
                "start_offset": 2,
                "end_offset": 3,
                "type": "<IDEOGRAPHIC>",
                "position": 2
            },
            {
                "token": "年",
                "start_offset": 3,
                "end_offset": 4,
                "type": "<IDEOGRAPHIC>",
                "position": 3
            },
            {
                "token": "总",
                "start_offset": 4,
                "end_offset": 5,
                "type": "<IDEOGRAPHIC>",
                "position": 4
            },
            {
                "token": "冠",
                "start_offset": 5,
                "end_offset": 6,
                "type": "<IDEOGRAPHIC>",
                "position": 5
            },
            {
                "token": "军",
                "start_offset": 6,
                "end_offset": 7,
                "type": "<IDEOGRAPHIC>",
                "position": 6
            }
        ]
    }

    常见分词器

    smartCN分词器:一个简单的中文或中英文混合文本的分词器

    IK分词器:更智能更友好的中文分词器

    smartCn

    安装,进入到bin目录

    • Linux 使用命令:sh elasticsearch-plugin install analysis-smartcn
    • Windows 使用命令:.elasticsearch-plugin install analysis-smartcn

    安装后需要重新启动

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer": "smartcn",
        "text": "火箭明年总冠军"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "火箭",
                "start_offset": 0,
                "end_offset": 2,
                "type": "word",
                "position": 0
            },
            {
                "token": "明年",
                "start_offset": 2,
                "end_offset": 4,
                "type": "word",
                "position": 1
            },
            {
                "token": "总",
                "start_offset": 4,
                "end_offset": 5,
                "type": "word",
                "position": 2
            },
            {
                "token": "冠军",
                "start_offset": 5,
                "end_offset": 7,
                "type": "word",
                "position": 3
            }
        ]
    }

    IK分词器

    下载:https://github.com/medcl/elasticsearch-analysis-ik/releases

    安装:解压安装到 plugins 目录

    安装后重新启动

    POST localhost:9200/_analyze
    
    参数:
    {
        "analyzer": "ik_max_word",
        "text": "火箭明年总冠军"
    }
    
    返回值:
    {
        "tokens": [
            {
                "token": "火箭",
                "start_offset": 0,
                "end_offset": 2,
                "type": "CN_WORD",
                "position": 0
            },
            {
                "token": "明年",
                "start_offset": 2,
                "end_offset": 4,
                "type": "CN_WORD",
                "position": 1
            },
            {
                "token": "总冠军",
                "start_offset": 4,
                "end_offset": 7,
                "type": "CN_WORD",
                "position": 2
            },
            {
                "token": "冠军",
                "start_offset": 5,
                "end_offset": 7,
                "type": "CN_WORD",
                "position": 3
            }
        ]
    }
  • 相关阅读:
    pytorch环境配置
    Java Review(三十八、网络编程)
    Java Review(三十六、IO)
    Java Review(三十五、注解)
    Java Review(三十四、JDBC)
    Java Review(三十三、异常处理----补充:断言、日志、调试)
    Java Review(三十二、异常处理)
    Java Review(三十一、泛型)
    Java Review (三十、集合----- 操作集合的工具类: Collections)
    Java Review (二十八、集合----- Queue 集合)
  • 原文地址:https://www.cnblogs.com/jwen1994/p/12507793.html
Copyright © 2011-2022 走看看