zoukankan html css js c++ java

Elasticsearch中的mapping和分析过程

映射mapping

自定义表结构 (原本是es自动帮我们定义的)

每个索引都有一个映射类型(6.x版本前可有多个)

参考博客: https://www.cnblogs.com/Neeo/articles/10585039.html

字段的数据类型:

1.简单类型:
*文本(text)	*关键字(keyword)	日期(data)	整形(long)	双精度(double)	   布尔(boolean)	ip

2.支持JSON的层次结构性质的类型:
对象		嵌套

3.特殊类型:
geo_point		geo_shape	  completion(纠正和建议)

mapping的操作实例

PUT a2
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        }
      }
    }
  }
}

GET a2/_mapping

POST a2/doc/1		#创建数据也可以用POST
{
  "name":"黄飞鸿",
  "age":19
}

GET a2/doc/1

dynamic三种状态

dynamic true 动态映射

PUT a2
{
  "mappings": {
    "doc":{
      "dynamic":true,		#关键所在
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        }
      }
    }
  }
}

POST a2/doc/1		#正常按照字段增加数据
{
  "name":"黄飞鸿",
  "age":19
}

POST a2/doc/2		#多加了一个定义时没有的city字段
{
  "name":"李晓龙",
  "age":19,
  "city":"广州"
}

POST a2/doc/3		#忽略定义时定义的name字段
{
  "age":19,
  "city":"广州"
}

GET a2/doc/_search		#查找没有问题!
{
  "query": {
    "match": {
      "city": "广州"
    }
  }
}

不限制新增或忽略某个字段,并且新增的字段也可以作为查询的主条件

dynamic false 静态映射 (常用)

"dynamic":true,
不限制新增或忽略某个字段,但在查找的时候不给新增的字段做分词,也就是说新增的字段不会主动添加新的映射关系,只能作为查询结果出现在查询中!所以新增的字段不能作为主查询条件

dynamic strict 严格模式

"dynamic":"strict",
不允许新增字段,但可以忽略字段

mapping的其他设置

index属性

# index属性
PUT a5
{
  "mappings": {
    "doc":{
      "dynamic":"strict",
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long",
          "index":true
        },
        "city":{
          "type":"text",
          "index":false
        }
      }
    }
  }
}


POST a5/doc/1
{
  "name":"李尔新",
  "age":19,
  "city":"长春"
}

POST a5/doc/2
{
  "name":"周子谦",
  "age":19,
  "city":"长春"
}

GET a5/doc/_search
{
  "query": {
    "match": {
      "city": "长春"
    }
  }
}

# 字段的index属性值为false的话不会为该字段创建索引,也就是无法当做查询的主条件!

copy_to属性

PUT a6
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text",
          "copy_to":"full_name"		#将这个字段的内容copy到full_name字段里
        },
        "age":{
          "type":"long",
          "copy_to":"full_name"
        },
        "full_name":{		#full_name字段
          "type":"text"
        }
      }
    }
  }
}

POST a6/doc/1
{
  "name":"周子谦",
  "age":19
}

GET a6/doc/_search
{
  "query": {
    "match": {
      "name": "周子谦"
    }
  }
}

GET a6/doc/_search
{
  "query": {
    "match": {
      "full_name": 19		#full_name能查名字也能查年龄
    }
  }
}

PUT a7
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text",
          "copy_to":["f1","f2"]		#可以copy到多个字段
        },
        "age":{
          "type":"long"
        },
        "f1":{
          "type":"text"
        },
        "f2":{
          "type":"text"
        }
      }
    }
  }
}

POST a7/doc/1
{
  "name":"周子谦",
  "age":19
}

GET a7/doc/_search
{
  "query": {
    "match": {
      "f1": "周子谦"		#f1 f2都可代替name作为主查询条件
    }
  }
}

对象属性 properties

# 对象属性
PUT a8
{
  "mappings": {
    "doc":{
      "properties":{
        "name":{
          "type":"text"
        },
        "age":{
          "type":"long"
        },
        "info":{
          "properties":{
            "addr":{
              "type":"text"
            },
            "tel":{
              "type":"text"
            }
          }
        }
      }
    }
  }
}

PUT a8/doc/1
{
  "name":"王涛",
  "age":33,
  "info":{
    "addr":"长春",
    "tel":"10018"
  }
}

GET a8/doc/_search
{
  "query": {
    "match": {
      "info.addr": "长存"
    }
  }
}

# 奇技淫巧:正常顺序是PUT mapping后再POST插入数据,最后才能GET查询的,我们也可以直接POST,然后GET a8/_mapping查看es帮我们自动生成的mapping,然后复制过来修改即可

ignore_above属性

PUT w1
{
  "mappings": {
    "doc":{
      "properties":{
        "t1":{
          "type":"keyword",
          "ignore_above": 5		#设置ignore_above属性
        },
        "t2":{
          "type":"keyword",
          "ignore_above": 10	#设置ignore_above属性
        }
      }
    }
  }
}
PUT w1/doc/1
{
  "t1":"elk",
  "t2":"elasticsearch"
}
GET w1/doc/_search
{
  "query":{
    "term": {
      "t1": "elk"	#查t1有结果
    }
  }
}

GET w1/doc/_search
{
  "query": {
    "term": {
      "t2": "elasticsearch"   #查t2无结果,超过设定的最大长度了
    }
  }
}

# 设定最大长度,超过长度的字符不会创建索引!

设置settings

PUT a9
{
  "settings": {
    "number_of_shards": 1,		#一个索引对应的主分片数量
    "number_of_replicas": 0		#一块主分片对应的副分片数量
  }
}

分析过程

当数据发送到es后, 在加入倒排索引之前, es对该文档进行的一系列操作

字符过滤 : 使用字符过滤器转变字符 (特殊字符, 如 & --> and)
文本切分为分词 : 将文本(档)分为多个单词或多个分词
分词过滤 : 使用分词过滤器转变每个分词
分词索引 : 最终将分词存储在Lucene倒排索引中
参考博客: https://www.cnblogs.com/Neeo/articles/10401392.html

分析器

标准分析器

POST _analyze
{
  "analyzer": "standard",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

简单分析器 ( 对亚种语言效果不佳 )

POST _analyze
{
  "analyzer": "simple",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

空白分析器

# 只根据空白切分...
POST _analyze
{
  "analyzer": "whitespace",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

停用词分析器

POST _analyze
{
  "analyzer": "stop",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

停用词:
1.功能词 is the on...
2.词汇词 want...

关键词分析器

#将整个字段作为单独的分词,一般不用...
POST _analyze
{
  "analyzer": "keyword",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

模式分析器

#允许我们指定一个分词切分模式,但是通常更佳的方案是使用定制的分析器,组合现有的模式分词器和所需要的分词过滤器更加合适。
POST _analyze
{
  "analyzer": "pattern",
  "explain": false, 
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

# 我们来自定制一个模式分析器，比如我们写匹配邮箱的正则。
# 需要注意的是，在json字符串中，正则的斜杠需要转义!
PUT pattern_test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer":{
          "type":"pattern",
          "pattern":"\W|_",
          "lowercase":true
        }
      }
    }
  }
}

语言和多语言分析器

#一般也不用...
POST _analyze
{
  "analyzer": "chinese",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

雪球分析器

#除了使用标准的分词和分词过滤器（和标准分析器一样）也是用了小写分词过滤器和停用词过滤器，除此之外，它还是用了雪球词干器对文本进行词干提取。
POST _analyze
{
  "analyzer": "snowball",
  "text":"To be or not to be,  That is a question ———— 莎士比亚"
}

字符过滤器 char_filter

HTML字符过滤器
映射字符过滤器 (敏感词过滤器)
模式过滤器
参考博客: https://www.cnblogs.com/Neeo/articles/10613612.html

分词器 tokenizer

标准分词器
关键词分词器
字母分词器 (根据非字母的符号切分)
小写分词器
空白分词器
模式分词器
UAX URL电子邮件分词器 *
路径层次分词器 *
参考博客: https://www.cnblogs.com/Neeo/articles/10402742.html

分词过滤器 token filter

常见分词过滤器
自定义分词过滤器
自定义小写分词过滤器
参考博客: https://www.cnblogs.com/Neeo/articles/10403757.html

ik 分词器

一个开源的, 轻量级的中文分词工具包
参考博客: https://www.cnblogs.com/Neeo/articles/10614012.html
保证ik分词器和es版本一致
解压将文件打包放到es下的plugins目录里

  GET _analyze
{
  "analyzer": "ik_smart",
  "text": "上海自来水来自海上"
}

#更细
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "上海自来水来自海上"
}

PUT ik1
{
  "mappings": {
    "doc": {
      "dynamic": false,
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word"   #ik分词器
        }
      }
    }
  }
}
#增加数据
PUT ik1/doc/1
{
  "content":"今天是个好日子"
}
PUT ik1/doc/2
{
  "content":"心想的事儿都能成"
}
PUT ik1/doc/3
{
  "content":"我今天不活了"
}
#查找中文分词 没问题
GET ik1/doc/_search
{
  "query": {
    "match": {
      "content": "今天"
    }
  }
}

查看全文

相关阅读:
压缩和还原压缩的JS代码
 1.3（Spring学习笔记）Spring-AOP
软件配置篇-MySQL下载及安装
 软件配置篇-java下载及安装
 1.2（Spring学习笔记）Spring中的Bean
1.1（Spring学习笔记）Spring基础（BeanFactory、ApplicationContext 、依赖注入）
1.6（学习笔记）EL表达式
 1.5 JSP标准标签库（JSTL）(核心标签 out、set、remove、if、choose、forEach、forTokens、redirect)
1.4(学习笔记)JSP自定义标签
 随机算式

原文地址：https://www.cnblogs.com/straightup/p/13737584.html