zoukankan html css js c++ java

Elastic Search中mapping的问题

Mapping在ES中是非常重要的一个概念。决定了一个index中的field使用什么数据格式存储，使用什么分词器解析，是否有子字段，是否需要copy to其他字段等。
Mapping决定了index中的field的特征。

在ES中有一些自动的字段数据类型识别。
自动识别标准：
数字 -> long 长整数
文本 -> text 文本，字符串
特殊格式的字符串（如：2018-01-01） -> 对应的特殊类型（如：date）
字面值true|false -> boolean类型。

1 测试搜索
测试数据：

PUT /test_index/test_type/1
{
  "post_date": "2018-01-01",
  "title": "my first title",
  "content": "this is my first content in this test",
  "author_id": 110
}

PUT /test_index/test_type/2
{
  "post_date": "2018-01-02",
  "title": "my second title",
  "content": "this is my second content in this test",
  "author_id": 110
}

PUT /test_index/test_type/3
{
  "post_date": "2018-01-03",
  "title": "my third title",
  "content": "this is my third content in this test",
  "author_id": 110
}

测试搜索：（ES 6.3.1版本中）

GET /test_index/test_type/_search?q=2018 # 搜索结果不满意。只有一条数据
GET /test_index/test_type/_search?q=2018-01-01 # 搜索结果正确
GET /test_index/test_type/_search?q=post_date:2018-01-01 # 搜索结果正确
GET /test_index/test_type/_search?q=post_date:2018 # 只有一条数据
GET /test_index/test_type/_search?q=this # 搜索结果正确
GET /test_index/test_type/_search?q=content:this # 搜索结果正确

查看mapping：可以检查index的mapping，是否符合具体的需求。

GET /index_name/_mapping/type_name

GET /test_index/_mapping/test_type

{
  "test_index": { 索引名称
    "mappings": { 开始显示mapping
      "test_type": { 类型名称
        "properties": { 映射中的具体配置
          "author_id": { “字段名” :{映射信息} 映射信息包括子字段，数据类型，分词器
            "type": "long" 字段类型为长整数。
          },
          "content": {
            "type": "text", 字段类型是文本
            "fields": { 子字段列表，就是ES自动的为当前字段创建的一个子字段。字段名称是 父字段名.子字段名。 ES为text类型字段默认提供的子字段名称为keyword。
              "keyword": {
                "type": "keyword", 不做任何分词的文本类型
                "ignore_above": 256 默认最长存储多少个字符
              }
            }
          },
          "post_date": {
            "type": "date" 日期类型，没有分词
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

ES中有字段映射mapping。是有一定的规则的。

文本类型为text，分词器为standard，子字段一定创建，命名为xxx.keyword，类型是keyword类型，长度为256个字符。
整数位long类型
“yyyy-MM-dd”是date类型，不做分词

总结：
自动或手动为index中的type建立的一种数据结构和相关配置，简称为mapping
dynamic mpping：是ES自动为我们建立index，创建type，以及type对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置

搜索结果为什么不一致？因为ES自动建立mapping的时候，为不同的field设置了不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索结果和预期不一致的问题（老版本区别更大）。

ES在6.x版本中，对date类型数据进行了搜索优化，会为同年数据创建一个默认搜索数据（如2018-01-01），而不是将2018-01-01分词为2018、01、01三个数据。
而这种搜索日期必须完全匹配，搜索文本可以模糊匹配的搜索方式也称为：exact value（精确匹配）、full text（全文搜索）。

2 测试分词结果

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "2018-01-01 my first title this is my first content in this test 110"
}

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "I Love You"
}

3 mapping核心数据类型

ES中的数据类型有很多，在这里只介绍常用的数据类型。
字符串：text（string）
整数：byte、short、integer、long
浮点型：float、double
布尔类型：boolean
日期类型：date

4 dynamic mapping对字段的类型分配

true or false -> boolean
123 -> long
123.123 -> double
2018-01-01 -> date
hello world -> text（string）
在上述的自动mapping字段类型分配的时候，只有text类型的字段需要分词器。默认分词器是standard分词器。

5 custom mapping
可以通过命令，在创建index和type的时候，自指定mapping，也就是指定字段的类型和字段数据使用的分词器。
手工创建mapping时，只能新增mapping设置，不能对已有的mapping进行修改。
如：有索引a，其中有类型b，增加字段f1的mapping定义。后续可以增加字段f2的mapping定义，但是不能修改f1字段的mapping定义。
通常都是手工创建index，并进行各种定义。如：settings,mapping等。

5.1 创建索引时指定mapping
语法：

PUT /test_index
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "test_type":{
      "properties": {
        "author_id" : {
          "type": "byte",
          "index": false
        },
        "title" : {
          "type": "text",
          "analyzer": "standard",
          "fields": {
            "keyword" : {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "content" : {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        "post_date" : {
          "type": "date"
        }
      }
    }
  }
}

"index" - 是否可以作为搜索索引。可选值：true | false
"analyzer" - 指定分词器。
"type" - 指定字段类型

5.2 为已有索引添加新的字段mapping
语法：

PUT /test_index/_mapping/test_type
{
  "properties" : {
    "new_field" : { "type" : "text" , "analyzer" : "standard" }
  }
}

5.3 测试不同的字段的分词器

GET /test_index/_analyze
{
  "field": "new_field",
  "text": "中华人民共和国国歌"
}

GET /test_index/_analyze
{
  "field": "content",
  "text": "中华人民共和国国歌"
}

6 定制分词器
ES中可以为index定制分词器，就是依托ES提供的默认分词器，实现新的定制化。
案例1：

PUT /test_analyzer
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "my_analyzer" : {
          "type" : "standard",
          "stopwords" : "_english_"
        }
      }
    }
  }
}

GET /test_analyzer/_analyze
{
  "analyzer": "my_analyzer",
  "text": "this is a test analyzer content"
}

GET /test_analyzer/_analyze
{
  "analyzer": "standard",
  "text": "this is a test analyzer content"
}

案例2：

PUT /test_analyzer1
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter" : {
          "type" : "mapping",
          "mappings" : [ "&=>and"]
        }
      },
      "filter":{
        "my_stopwords_filter" :{
          "type" : "stop",
          "stopwords" : [ "the", "a" ]
        }
      },
      "analyzer" : {
        "my_second_analyzer" : {
          "type" : "custom",
          "char_filter" : "my_char_filter",
          "tokenizer" : "standard",
          "filter" : [ "lowercase", "my_stopwords_filter"]
        }
      }
    }
  }
}
GET /test_analyzer1/_analyze
{
  "analyzer": "my_second_analyzer",
  "text": "this is a test analyzer content & it is second analyzer"
}

在商业项目中，使用自定义分词器的相对较少。除非在专业领域。如：生物制药，航空领域，证券等。。。

使用自定义分词器：自定义分词器只能在定义这个分词器的索引中使用。wiki

PUT test_analyzer/_mapping/test_type
{
  "properties": {
    "field_name" : {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

7 mapping复杂定义
ES中可以为类型相对复杂的字段定义mapping。如：multi field（一个字段有多个值、数组），empty field（保存null值，或空数据的[]），object field（对象类型）。上述的复杂类型都是常用的类型。不是全部。

7.1 multi field
数组数据： [ "tags" : "tag1", "tag2" ]
这种数据类型和普通的数据类型没有什么区别。只是要求字段中的多个数据的类型必须相同。
测试：

PUT /test_index/test_type/1
{
  "tags" : [ "tag1", "tag2", "tag3" ],
  "name" : "zhangsan"
}

GET /test_index/_mapping/test_type

手工定义mapping

PUT /test_index
{
  "mappings" : {
    "test_type" : {
      "properties" : {
        "tags" : { "type" : "text" , "analyzer" : "standard" },
        "name" : { "type" : "text" , "analyzer" : "english" }
      }
    }
  }
}

7.2 empty field
空数据： null [] [null]
空数据如果直接保存到index中，由ES为index自动创建mapping，那么此空数据对应的field将不会创建mapping映射值。而任意的mapping定义都可以保存空数据。
测试：

PUT /test_index/test_type/1
{
  "name" : "zhangsan",
  "empty_field" : null
}

GET /test_index/_mapping/test_type

7.3 object field
对象数据： { "address" : { "province" : "北京", "city" : "北京", "street" : "建材城西路" } }
对象数据如果保存到ES中，由ES自动创建mapping，那么ES会为对象中的每个字段定义mapping映射。
测试：

PUT /test_index/test_type/1
{
  "name" : "zhangsan" ,
  "age" : 20,
  "address" : {
    "province" : "beijing",
    "city" : "beijing",
    "street" : "jian chai cheng xi lu"
  }
}

GET /test_index/_mapping/test_type

ES在底层存储对象数据的时候，是使用特定的格式存储的。如上述测试数据中，如果保存到ES中，ES底层存储的数据为：

{
  "name" : "zhangsan",
  "age" : 20,
  "address.province" : "beijing",
  "address.city" : "beijing",
  "address.street" : "jian chai cheng xi lu"
}

手工定义mapping

PUT test_index
{
  "mappings": {
    "test_type":{
      "properties": {
        "name" : {
          "type": "text",
          "analyzer": "ik_max_word"
        },
        "age" : {
          "type": "byte"
        },
        "address" : {
          "properties": {
            "province" : {
              "type" : "text",
              "analyzer" : "ik_max_word"
            },
            "city" : {
              "type" : "text",
              "analyzer" : "ik_max_word"
            },
            "street" : {
              "type" : "text",
              "analyzer" : "ik_max_word"
            }
          }
        }
      }
    }
  }
}

更复杂的对象：（数组+对象）这种数据格式，在ES中如果自动创建mapping，是为数组中的每个对象的字段创建mapping映射信息。如下述的案例中，ES会自动的为emps数组对象中的name和age字段分别创建mapping映射信息。

PUT /test_index/test_type/1
{
  "dept_name" : "sales", 
  "emps" : [
    { "name" : "zhangsan", "age" : 20 },
    { "name" : "lisi", "age" : 21 },
    { "name" : "wangwu", "age" : 22 }
  ]
}

GET /test_index/_mapping/test_type

上述的数据在ES中底层存储也有其特有的格式，大致如下：（如果name数据可以进行分词的话，emps.name对应的数据数组内容会更多。）

{
  "dept_name" : "sales",
  "emps.name" : [ "zhangsan", "lisi", "wangwu" ],
  "emps.age" : [20, 21, 22]
}

8 mapping的root object
所谓的mapping的root object就是设置index的mapping时，一个type对应的json数据。包括的内容有：properties， metadata（_id, _source, _all）, settings（分词器等）。其中字段配置include_in_all已在6.x版本中删除。_all配置将在7.x版本中删除。
如：强调部分就是root object。

PUT /test_index9
{
  "settings" : {
    "number_of_shards" : 2,
    "number_of_replicas" : 1
  },
  "mappings" : {
    "test_type" : {
      "properties" : {
        "post_date" : { "type" : "date" },
        "title" : { "type" : "text", "index" : false },
        "content" : { "type" : "text" , "analyzer" : "english" },
        "author_id" : { "type" : "integer" }
      },
      "_all" : { "enabled" : false },
      "_source" : { "enabled" : false }
    }
  }
}

9 定制dynamic mapping策略
ES中可以手工干预ES的dynamic mapping。如：定义index中是否可以增加不在mapping范围内的字段；如果增加了不在mapping范围内的字段的时候，如何管理；自动映射中如果是对象类型的字段，对象中是否可以增加不在mapping范围内的字段，如何管理不在mapping范围内的字段。
ES中支持在自定义mapping时，为type定制dynamic mapping策略。可以让ES中的index更加的友好。在定制dynamic mapping策略时，可选值有：true（默认值）-遇到陌生字段自动进行dynamic mapping， false-遇到陌生字段，不进行dynamic mapping（会保存数据，但是不做倒排索引，无法实现任何的搜索），strict-遇到陌生字段，直接报错。
案例：

PUT /test_index
{
  "mappings": {
    "test_type" : {
      "dynamic" : "strict",
      "properties": {
        "field1" : {
          "type": "text"
        },
        "field2" : {
          "type": "object",
          "dynamic" : false
        }
      }
    }
  }
}
PUT /test_index/test_type/1
{
  "field1" :"aaa",
  "field3" : "bbb"
}
PUT /test_index/test_type/1
{
  "field1" : "aaa",
  "field2" : {
    "sub_f1" : "sub1",
    "sub_f2" : "sub2"
  }
}
GET /test_index/test_type/1
GET /test_index/_mapping/test_type

定制dynamic mapping，使用比较少，因为很难去分析出一套完整的，有扩展能力的结构。无法适应业务的变更。
如果使用，一般在固定的，几乎不会改变的数据结构中使用。如：人的身份证信息：姓名、出生年月、地址、身份证号、照片、发证机关、有效期。

查看全文

相关阅读:
算法训练 P1103
算法训练表达式计算
 算法训练表达式计算
 基础练习时间转换
 基础练习字符串对比
 Codeforces 527D Clique Problem
Codeforces 527C Glass Carving
Codeforces 527B Error Correct System
Codeforces 527A Glass Carving
Topcoder SRM 655 DIV1 250 CountryGroupHard

原文地址：https://www.cnblogs.com/yucongblog/p/11965495.html