zoukankan      html  css  js  c++  java
  • Elasticsearch 之 Hello World (二)

        首先测试下分词尤其是中文分词功能,这个可是传统数据库如mysql,sqlserver的痛啊。

        打开浏览器,并登录到http://localhost:5601,点击Dev Tools项,在Console栏输入

    POST _analyze
    {
      "analyzer": "standard",
      "text":"Hello World ElasticSearch"
    }

        会在右面显示返回的结果

    {
      "tokens": [
        {
          "token": "hello",
          "start_offset": 0,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "world",
          "start_offset": 6,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "elasticsearch",
          "start_offset": 12,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

        一切看上去都挺美好,等加入中文看看。

    POST _analyze
    {
      "analyzer": "standard",
      "text":"ElasticSearch是一个很不错的全文检索软件。"
    }

        结果是

    {
      "tokens": [
        {
          "token": "elasticsearch",
          "start_offset": 0,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "",
          "start_offset": 13,
          "end_offset": 14,
          "type": "<IDEOGRAPHIC>",
          "position": 1
        },
        {
          "token": "",
          "start_offset": 14,
          "end_offset": 15,
          "type": "<IDEOGRAPHIC>",
          "position": 2
        },
        {
          "token": "",
          "start_offset": 15,
          "end_offset": 16,
          "type": "<IDEOGRAPHIC>",
          "position": 3
        },
        {
          "token": "",
          "start_offset": 16,
          "end_offset": 17,
          "type": "<IDEOGRAPHIC>",
          "position": 4
        },
        {
          "token": "",
          "start_offset": 17,
          "end_offset": 18,
          "type": "<IDEOGRAPHIC>",
          "position": 5
        },
        {
          "token": "",
          "start_offset": 18,
          "end_offset": 19,
          "type": "<IDEOGRAPHIC>",
          "position": 6
        },
        {
          "token": "",
          "start_offset": 19,
          "end_offset": 20,
          "type": "<IDEOGRAPHIC>",
          "position": 7
        },
        {
          "token": "",
          "start_offset": 20,
          "end_offset": 21,
          "type": "<IDEOGRAPHIC>",
          "position": 8
        },
        {
          "token": "",
          "start_offset": 21,
          "end_offset": 22,
          "type": "<IDEOGRAPHIC>",
          "position": 9
        },
        {
          "token": "",
          "start_offset": 22,
          "end_offset": 23,
          "type": "<IDEOGRAPHIC>",
          "position": 10
        },
        {
          "token": "",
          "start_offset": 23,
          "end_offset": 24,
          "type": "<IDEOGRAPHIC>",
          "position": 11
        },
        {
          "token": "",
          "start_offset": 24,
          "end_offset": 25,
          "type": "<IDEOGRAPHIC>",
          "position": 12
        },
        {
          "token": "",
          "start_offset": 25,
          "end_offset": 26,
          "type": "<IDEOGRAPHIC>",
          "position": 13
        }
      ]
    }

        这显然不能忍啊,每个中文字都拆,基本就是不能用的节奏。google下,貌似其还有analyzer为chinese选项,测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件,有的资料和书就推荐IKAnanlyzer,但这些资料都是基于老版本的es,我去IKAnanlyzer的github上去看了下,发现貌似太监了,所以还是用官方推荐的smartcn吧,下载安装的过程和安装其他插件一致,这里还是推荐离线包安装。安装完,应该要重启es服务才能生效。现在再试试

    POST _analyze
    {
      "analyzer": "smartcn",
      "text":"ElasticSearch是一个很不错的全文检索软件。"
    }
    {
      "tokens": [
        {
          "token": "elasticsearch",
          "start_offset": 0,
          "end_offset": 13,
          "type": "word",
          "position": 0
        },
        {
          "token": "",
          "start_offset": 13,
          "end_offset": 14,
          "type": "word",
          "position": 1
        },
        {
          "token": "一个",
          "start_offset": 14,
          "end_offset": 16,
          "type": "word",
          "position": 2
        },
        {
          "token": "",
          "start_offset": 16,
          "end_offset": 17,
          "type": "word",
          "position": 3
        },
        {
          "token": "不错",
          "start_offset": 17,
          "end_offset": 19,
          "type": "word",
          "position": 4
        },
        {
          "token": "",
          "start_offset": 19,
          "end_offset": 20,
          "type": "word",
          "position": 5
        },
        {
          "token": "全文",
          "start_offset": 20,
          "end_offset": 22,
          "type": "word",
          "position": 6
        },
        {
          "token": "检索",
          "start_offset": 22,
          "end_offset": 24,
          "type": "word",
          "position": 7
        },
        {
          "token": "软件",
          "start_offset": 24,
          "end_offset": 26,
          "type": "word",
          "position": 8
        }
      ]
    }

    这下看上去河蟹多了。:)

  • 相关阅读:
    我不想安于当前的限度,以达到所谓的幸福,回顾下2020年的我
    CentOS 7 搭建 TinyProxy 代理 &&python 脚本访问
    使用国内源来安装pytorch速度很快
    opencv-python的格式转换 RGB与BGR互转
    自签SSL证书以及https的双向认证 实现nginx双向代理
    springboot使用 @EnableScheduling、@Scheduled开启定时任务
    微信下载对账单
    SpringBoot 中定时执行注解(@Scheduled、@EnableScheduling)
    使用idea合并 dev分支合并到test分支
    .Net Core + Entity Framework 调用Oracle 存储过程
  • 原文地址:https://www.cnblogs.com/elasticsearch/p/6478799.html
Copyright © 2011-2022 走看看