zoukankan      html  css  js  c++  java
  • 谷粒商城学习——P122 es分词&安装ik分词

    一个tokenizer(分词器)接收一串字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流

    例如,whitespace tokenizer遇到空白字符时分割文本,它会将I am zyn分割为【I、am、zyn】。

    该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。elasticsearch提供了很多内置的分词器(标准分词器),可以用来构建custom analyzers(自定义分词器)。

    关于分词器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html

     标准分词器 standard,按空格分

    POST _analyze
    {
      "tokenizer": "standard",
      "text": "Hello the world."
    }

     执行结果:

    {
      "tokens" : [
        {
          "token" : "Hello",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "the",
          "start_offset" : 6,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "world",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    }

    但是es中默认的分词器,都是支持英文的,中文需要安装自己的分词器

    ik分词器https://github.com/medcl/elasticsearch-analysis-ik/releases

    查看es版本以安装对应版本号的ik分词器

    [vagrant@10 ~]$ curl http://192.168.56.10:9200/
    {
      "name" : "3cafb1a4b1b3",
      "cluster_name" : "elasticsearch",
      "cluster_uuid" : "0cNA2l38RFK6LMHislSvNg",
      "version" : {
        "number" : "7.4.2",
        "build_flavor" : "default",
        "build_type" : "docker",
        "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
        "build_date" : "2019-10-28T20:40:44.881551Z",
        "build_snapshot" : false,
        "lucene_version" : "8.2.0",
        "minimum_wire_compatibility_version" : "6.8.0",
        "minimum_index_compatibility_version" : "6.0.0-beta1"
      },
      "tagline" : "You Know, for Search"
    }

     git上看不到v7.4.2版本了,试了下直接输入地址就下了

    https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

    下载完后,解压到es容器内部plugins目录中(或es的对应的映射目录)就可以使用了。

    XSHELL和xftp正版免费下载参考:https://www.cnblogs.com/qingshan-tang/p/12855807.html

     我这下载太慢,转而又回去命令安装了===!!!

    vagrant ssh连接虚拟机后,su root转管理员,安装.wget

    vagrant ssh
    su root
    vagrant
    yum install wget

     安装完wget后,转到es /plugins目录安装ik

    [root@10 /]# cd /mydata/elasticsearch/plugins/
    [root@10 plugins]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

    安装unzip解压命令,然后建ik目录,解压到ik中,然后删除解压过的zip

    yum install unzip
    mkdir ik
    cd ik
    unzip elasticsearch-analysis-ik-7.4.2.zip
    rm elasticsearch-analysis-ik-7.4.2.zip

    装ik文件夹可读可写可执行

    [root@10 plugins]# chmod -R 777 ik/

    查看ik分词器是否安装完成:

    进入es控制台:

    docker exec -it 3caf /bin/bash

      显示ik证明ik分词器安装成功了,然后退出es容器重启es

    试下效果:

    ik_smart:智能分词

    POST _analyze
    {
      "tokenizer": "ik_smart",
      "text": "我是中国人"
    }

    结果:

    {
      "tokens" : [
        {
          "token" : "我",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "CN_CHAR",
          "position" : 0
        },
        {
          "token" : "是",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "CN_CHAR",
          "position" : 1
        },
        {
          "token" : "中国人",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "CN_WORD",
          "position" : 2
        }
      ]
    }

    ik_max_word:最大单词组合

    POST _analyze
    {
      "tokenizer": "ik_max_word",
      "text": "我是中国人"
    }

    结果:

    {
      "tokens" : [
        {
          "token" : "我",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "CN_CHAR",
          "position" : 0
        },
        {
          "token" : "是",
          "start_offset" : 1,
          "end_offset" : 2,
          "type" : "CN_CHAR",
          "position" : 1
        },
        {
          "token" : "中国人",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "中国",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 3
        },
        {
          "token" : "国人",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "CN_WORD",
          "position" : 4
        }
      ]
    }
  • 相关阅读:
    bzoj 1012: [JSOI2008]最大数maxnumber 线段树
    Codeforces Round #260 (Div. 2) A , B , C 标记,找规律 , dp
    Codeforces Round #256 (Div. 2) E. Divisors 因子+dfs
    Codeforces Round #340 (Div. 2) E. XOR and Favorite Number 莫队算法
    Codeforces Round #348 (VK Cup 2016 Round 2, Div. 1 Edition) C. Little Artem and Random Variable 数学
    BZOJ 1005 [HNOI2008]明明的烦恼 purfer序列,排列组合
    BZOJ 1211: [HNOI2004]树的计数 purfer序列
    UVA 1629 Cake slicing 记忆化搜索
    UVA1630 Folding 区间DP
    BNU 51640 Training Plan DP
  • 原文地址:https://www.cnblogs.com/yanan7890/p/15613077.html
Copyright © 2011-2022 走看看