ES mapping 详解

zoukankan html css js c++ java

ES mapping 详解
1 mapping type

映射（mapping）

映射是定义一个文档以及其所包含的字段如何被存储和索引的方法。

例如，用映射来定义以下内容：
- 哪些 string 类型的 field 应当被当成当成 full-text 字段
- 哪些字段应该是数值类型、日期类型或者是地理位置信息
- 日期类型字段的格式是怎么样的
- 是否文档的所有字段都需要被索引到 _all 字段
- 动态增加的 field 可以由用户自定义的模板来控制其行为
映射类型（mapping types）

每个索引都有一个或多个映射类型（mapping type）来对索引内的文档进行逻辑分组（mapping type 就是平常所说的 type）。

每一个映射类型都包含以下内容：

1. 元数据字段

元数据字段用来自定义如何处理关联文档的元数据。元数据字段包括： _index, _type, _id, _source.

2. 字段列表或属性

每个映射类型都包含一个字段列表或者是和该类型相关的一些属性。

字段数据类型（field datatypes）

每一个字段，都属于一种数据类型。

1. 基本数据类型

string, long, boolean, ip

2. JSON 分层数据类型

object, nested

3. 特殊类型

geo_point, geo_shape, completion

动态映射（dynamic mapping）

字段及其映射类型不必在使用前事先定义好，这得益于动态映射的应用。

动态映射能够根据文档索引过程来自动生成映射类型和字段名。

动态映射规则可以用来定义新类型和新字段的映射。

显式映射（explicit mappings）

如果你比 ES 通过猜测来确定映射更加了解你的数据，那么定义一个动态映射将会很有用。不过有时候你可能需要指定自己的显式映射。

显式映射可以在创建索引时候定义，或者用 mapping API 来为已有的索引添加映射类型或字段。

映射更新（updating existing mappings）

映射支持更新，如果需要，必须重建索引并设置正确的 mapping ，而不是试图去更新已有的 mapping。

字段之间共享映射类型（fileds are shared across mapping types）

映射类型用来逻辑分组字段，但是每个映射类型之间的字段并非独立存在的。

1. 规则：

字段在以下条件：
1. 相同字段名
2. 相同索引
3. 不同映射类型
的时候其实是映射到内部相同的字段上，所以，必须拥有相同的映射设置。

2. 例外：

有一些例外，参数：
1. copy_to
2. dynamic
3. enabled
4. ignore_above
5. include_in_all
6. properties
可以对满足前述“规则”的字段进行各自不同的设置。

2 field datatypes

基本类型

1. 字符串

字符串类型被分为两种情况：full-text 和 keywords。

full-text 表示字段内容会被分析，而 keywords 表示字段值只能作为一个精确值查询。

参数：

analyzer、boost、doc_values、fielddata、fields、ignore_above、include_in_all、index、index_options、norms、null_value、position_increment_gap、store、search_analyzer、search_quote_analyzer、similarity、term_vector

2. 数值

数值类型包括： long, integer, short, byte, double, float 。

参数：

coerce、boost、doc_values、ignore_malformed、include_in_all、index、null_value、precision_step、store

3. 日期

JSON 本身并没有日期数据类型，在 ES 中的日期类型可以是：
- 类似 "2015-01-01" or "2015/01/01 12:10:30" 的字符串
- long 类型的毫秒级别的时间戳
- int 类型的秒级别的时间戳
日期类型默认会被转换为 UTC 并且转换为毫秒级别的时间戳的 long 类型存储。

日期类型如果不指定 format ，将会以默认格式表示。

参数：

boost、doc_values、format、ignore_malformed、include_in_all、index、null_value、precision_step、store

4. 布尔

布尔假： false, "false", "off", "no", "0", "" (empty string), 0, 0.0 。

布尔真：任何不为假的值。

像 terms aggregation 聚合，是使用 1 和 0 来作为 key 的，key_as_string 则是用字符串 true 和 false

布尔类型的值，在 scripts 中则始终返回 1 或 0

参数：

boost、doc_values、index、null_value、store

5. 二进制

二进制类型以 Base64 编码方式接收一个二进制值，二进制类型字段默认不存储，也不可搜索。

参数：doc_values、store

复杂类型

1. 对象

JSON 格式本身是分层级的——文档可以包含对象，对象还可以包含子对象。不过，在 ES 内部 "对象" 被索引为一个扁平的键值对。

例如：
```
PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}
```
转换为：
```
{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"  //层级结构被以 "." 来表示。
}
```
2. 数组

数组类型，要求数组元素的数据类型必须一致。
- 字符串数组: [ "one", "two" ]
- 数字数组: [ 1, 2 ]
- 数组数组: [ 1, [ 2, 3 ]] which is the equivalent of [ 1, 2, 3 ]
- 对象数组: [ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }]
数组元素的数据类型，将会由其第一个元素的数据类型决定。

对象数组，在 ES 内部将会被转换为 "多值" 的扁平数据类型。后面将会详解这一点。

例如：
```
PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
```
转转为：
```
{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}
```
3. 对象数组

对象数组在 ES 内部，会把所有数组元素（即对象）合并，对象中的每一个字段被索引为一个 "多值" 字段。

这将导致每个数组元素（对象）内部的字段关联性丢失，解决的方法是使用 nested 类型。

例如：
```
PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": [
    { 
      "first": "John",
      "last":  "Smith"
    },
    { 
      "first": "Bob",
      "last":  "Leo"
    }
    ]
  }
}
```
转换为：
```
{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John Bob",
  "manager.name.last": "Smith Leo" 
}
// 如果我们搜索：
"bool": {
      "must": [
        { "match": { "manager.name.first": "John" }},   // John Smith
        { "match": { "manager.name.last": "Leo"}}       // Bob Leo
      ]
}
//这将会导致导致文档被命中，显然，John Smith 、Bob Leo 两组字段它们内在的关联性都丢失了
```
参数：

dynamic、enabled、include_in_all、properties

4. 嵌套(nested)

嵌套类型是一个特殊对象类型，嵌套类型允许对对象数组的每一个元素（对象）相互独立的进行查询，也即他们不会被合并为一个对象。

嵌套类型的文档可以：
- 用 nested 查询来查询
- 用 nested来分析以及 reverse_nested 来聚合
- 用 nested sorting 来排序
- 用 nested inner hits 来检索或高亮
例如：
```
PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": [
    { 
      "first": "John",
      "last":  "Smith"
    },
    { 
      "first": "Bob",
      "last":  "Leo"
    }
    ]
  }
}
```
转换为：
```
{
  "region":             "US",
  "manager.age":        30,
  {
      "manager.name.first": "John",
      "manager.name.last": "Smith"
  },
  {
      "manager.name.first": "Bob",
      "manager.name.last": "Leo" 
  }
}
// 如果我们搜索：
"bool": {
      "must": [
        { "match": { "manager.name.first": "John" }},   // John Smith
        { "match": { "manager.name.last": "Leo"}}       // Bob Leo
      ]
}
//这样的查询将不能命中文档！！！
```
参数：

dynamic、include_in_all、properties

专有类型

1. IPV4类型

IPV4 数据类型其实质是个 long 类型，不过其能接收一个 IPV4 地址并且将他转换为 long 类型存放。

参数：

boost、doc_values、include_in_all、index、null_value、precision_step、store

3 Meta-Fields

文档标识相关元数据字段

_index
- 当执行多索引查询时，可能需要添加特定的一些与文档有关联的索引的子句。
- _index 字段可以用在 term、terms 查询，聚合(aggregations)操作，脚本(script)操作以及用来排序(sort)。
```
GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": "doc['_index']" 
    }
  }
}
```
_type
- _type 可以用来让针对具体 type 的搜索更加快。
- _type 字段可以用在 querys、aggregations、scripts 以及 sorting。
```
GET my_index/_search/type_*
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": "doc['_type']" 
    }
  }
}
```
原始信息相关元数据字段

_source

字段说明
- _source 字段存放的是文档的原始 JSON 信息
- _source 字段不被 indexed ，不过被 stored ，所以可以通过 get 或 search 取得该字段的值。
禁用_source字段
- _source 字段可以在 mapping 设置中禁用
- 如果禁用 _source 字段将会有一些其它影响，比如：update API 将无法使用等等。
```
PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}
```
_source排除特定字段
- 在 _source 的 mapping 设置中可以通过 includes 和 excludes 参数来包含或排除特定字段
- 包含或排除的字段，需要以 plain 格式的 field 名称，名称支持通配符。
```
PUT logs
{
  "mappings": {
    "event": {
      "_source": {
        "includes": [
          "*.count",
          "meta.*"
        ],
        "excludes": [
          "meta.description",
          "meta.other.*"
        ]
      }
    }
  }
}
```
索引操作相关元数据字段

_all

字段说明
- _all 字段把其他所有字段的内容存储到一个大的字符串中，不管其它字段是什么数据类型，在 _all 中都被当作字符串处理。
- 每个 index 只有一个 _all 字段。
- 该字符串会被 analyzed 和 indexed，但不会 store（存储）。可以被搜索，但无法用来恢复。
- _all 字段也和普通字符串字段一样可以接收：analyzer、term_vectors、index_options 和 store 等参数。
- 生成 _all 字段是有资源消耗的，会消耗 CPU 和 disk 存储。
```
GET my_index/_search
{
  "query": {
    "match": {
      "_all": "john smith 1970"
    }
  }
}
```
_all字段查询
- query_string 和 simple_query_string 查询操作，默认就是查询 _all 字段，除非自己明确指定。
```
GET _search
{
  "query": {
    "query_string": {
      "query": "john smith 1970"
    }
  }
}
```
禁用_all字段
- _all 字段可以在 mapping 设置中完全禁用，如果禁用，query_string 和 simple_query_string 查询操作需要指定默认字段才可用。
```
PUT my_index
{
  "mappings": {
    "my_type": {
      "_all": {
        "enabled": false 
      },
      "properties": {
        "content": {
          "type": "string"
        }
      }
    }
  },
  "settings": {
    "index.query.default_field": "content" 
  },
}
```
_all排除特定字段
- 字段通过 mapping 设置可以通过 include_in_all 参数控制该字段否包含在 _all 字段。
```
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}
```
_all字段存储
- _all 字段可以通过参数 store 来设置其是否存储。
```
PUT myindex
{
  "mappings": {
    "mytype": {
      "_all": {
        "store": true
      }
    }
  }
}
```
_field_names

字段说明
- _field_names 字段是用来存储文档中所有非 null 字段的字段名称的。
- 该字段供 exists 和 missing 查询使用，来查询某个文档中是否包含或不包含某个字段。
```
GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "title" ] 
    }
  },
  "aggs": {
    "Field names": {
      "terms": {
        "field": "_field_names", 
        "size": 10
      }
    }
  },
  "script_fields": {
    "Field names": {
      "script": "doc['_field_names']" 
    }
  }
}
```
路由相关元数据字段

_parent

字段说明
- 在同一个 index 中，可以通过设置 type 的父子关系来建立文档之间的父子关系。
- 父子 type 必须是不同的 type。
- 指定的 parent type 必须要是还不存在的，已存在的 type 不能作为其它 type 的 parent type。
- 父子关系的 doc 必须被索引到相同的 shard 上，子文档通过参数 parent 参数来作为其 routing 来保证索引到相同分片。
```
PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}
```
_routing
- _routing 字段用来确定文档索引的分片：shared_num = hash(routing) % num_primary_shards
- 默认的 _routing 是文档的 _id 或 _parent 的 ID。
- 通过 routing 参数可以自定义 _routing 的值。
```
GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  },
  "aggs": {
    "Routing values": {
      "terms": {
        "field": "_routing", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_routing": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "Routing value": {
      "script": "doc['_routing']" 
    }
  }
}
```
4 mapping setting

mapping type

映射设置一般发生在：

1. 增加新的 index 的时候，添加 mapping type，对 fields 的映射进行设置
```
PUT twitter 
{
  "mappings": {
    "tweet": {
      "properties": {
        "message": {
          "type": "string"
        }
      }
    }
  }
}
```
2. 为 index 增加新的 mapping type，对 fields 的映射进行设置
```
PUT twitter/_mapping/user 
{
  "properties": {
    "name": {
      "type": "string"
    }
  }
}
```
3. 为已有 mapping type 增加新的 fields 映射设置
```
PUT twitter/_mapping/tweet 
{
  "properties": {
    "user_name": {
      "type": "string"
    }
  }
}
```
设置方式

1. 在 PUT 请求体中给出完整的 mapping 设置
```
PUT twitter 
{
  "mappings": {                         //mappings 对象，说明进行 mapping 设置
    "tweet": {                          //指定 mapping type
      "properties": {                   //指定 mapping type 的 properties 设置
        "message": {                    //对字段 message 的映射进行设置
          "type": "string"              //mapping 参数配置
        }
      }
    }
  }
}
```
增加 index 的时候，除了可以设置 mapping type，还可以对 index 进行设置，比如配置自定义 analyzer、索引分片个数设置等
```
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "autocomplete"
        }
      }
    }
  }
}
```
2. 在 PUT 请求 URI 中指定 type，并在请求体中给出 type 的各项设置
```
PUT twitter/_mapping/user 
{
  "properties": {                   //指定 mapping type 的 properties 设置
    "name": {                       //对字段 message 的映射进行设置
      "type": "string"              //mapping 参数配置
    }
  }
}
```
3. 一个完整的 mapping type 设置包括：Meta-fields 和 Fields 或者 properties 设置
```
PUT my_index
{
  "mappings": {
    "type_1": { 
      "properties": {...}           //properties 设置
    },
    "type_2": { 
      "_all": {                     //meta-fields 设置
        "enabled": false
      },
      "properties": {...}
    }
  }
}
```
5 dynamic mapping

概述

在使用 ES 的时，我们不需要事先定义好映射设置就可以直接向索引中导入文档。ES 可以自动实现每个字段的类型检测，并进行 mapping 设置，这个过程就叫动态映射（dynamic mapping）。

动态映射可以通过以下设置来关闭。
```
PUT /_settings 
{
  "index.mapper.dynamic":false
}
```
动态映射的规则也可以自定义，有以下几种我们可以自定义规则的应用场景：
1. 默认映射（_default_ mapping）
2. 字段动态映射（dynamic field mapping）
3. 动态模板（dynamic template）
4. 索引模板（index template）
其中，前 3 个条件中都是针对特定 index 下的 type 进行设置，而条件 4 是针对所有满足条件的 index 进行设置。

默认映射

默认映射通过把 mapping type 设置为 _default_ 来定义。

默认映射将会应用到该 index 下的任意新增 type 中。

默认映射可以在添加 index 时候设置，也可以创建 index 之后再通过 PUT mapping 接口进行设置。
```
PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false         //默认映射禁用掉所有新增 type 的 _all 元数据字段
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true     //覆盖 _default_ 的设置,启用 _all 字段
      }
    }
  }
}
```
字段动态映射

默认情况，发现新的字段，ES 自动检测其 datatype 并将其加入到 mapping type 中。

通过一些设置，我们可以控制字段动态映射的方式，包括：日期类型检测、数值类型检测、自定义日期类型的格式等。
```
PUT my_index         //禁用日期类型检测
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}
PUT my_index       //自定义日期类型的格式
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}
PUT my_index        //启用数值类型检测
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}
```
动态模板

动态模板将会根据条件判断，应用到满足条件的新增字段上去。

应用条件包括：
1. 用 match_mapping_type 来检测新增字段的数据类型是否满足某种条件
2. 用 match、unmatch 和 match_pattern 来判断新增字段的字段名是否满足某种条件
3. 用 path_match 和 path_unmatch 来判断新增字段的完整路径是否匹配某条件
动态模板以数组的形式给出，数组的每一个元素就是一个模板。每个模板都有各自的应用条件，一旦新增的字段满足某个模板，模板内容就会应用到该字段上。

有两个特殊的变量，在模板中可以运用：{name}、{dynamic_type}。前者表示原字段的字段名，后者标识原字段被 ES 自动识别出来的数据类型。
```
"dynamic_templates": [                 //数组,每个元素都是一个动态模板
    {
      "my_template_name": {            //动态模板名称
        ...  match conditions ...      //应用条件判断
        "mapping": { ... }             //映射设置
      }
    },
    ...                                //多个数组元素标识多个动态模板
  ]
```
```
PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "named_analyzers": {
            "match_mapping_type": "string",
            "match": "*",
            "mapping": {
              "type": "string",
              "analyzer": "{name}"
            }
          }
        },
        {
          "no_doc_values": {
            "match_mapping_type":"*",
            "mapping": {
              "type": "{dynamic_type}",
              "doc_values": false
            }
          }
        }
      ]
    }
  }
}
```
索引模板

索引模板根据条件来判断新建的索引（只应用到新建索引上）是否满足某条件，并对其进行映射设置。

索引模板包含一些对索引的设置和映射设置。

在索引模板中有一个特殊变量可以运用：{index}。表示匹配上条件的原索引名称。
```
PUT /_template/template_1
{
  "template": "te*",                          //判断条件,判断哪些索引将应用该模板
  "settings": {                               //索引设置
    "number_of_shards": 1
  },
  "mappings": {                               //映射设置
    "type1": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "host_name": {
          "type": "string",
          "index": "not_analyzed"
        },
        "created_at": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z YYYY"
        }
      }
    }
  }
}
```
参照：https://www.cnblogs.com/licongyu/category/819588.html

更多请参照：
http://blog.csdn.net/napoay
正因为当初对未来做了太多的憧憬，所以对现在的自己尤其失望。生命中曾经有过的所有灿烂，终究都需要用寂寞来偿还。
查看全文

相关阅读:
遗传算法python实现
 lambda的一些用法
 Python遗传和进化算法框架（一）Geatpy快速入门
 电脑连接小爱同学音箱无法调节音量
 Shell脚本批量修改文件编码为UTF-8
java实现批量转换文件编码格式为UTF8
POM添加规范
 SOFA框架跨包调用报错NoClassDefFoundError
logger打印日志时加if (logger.isInfoEnabled())/if (logger.isDebugEnabled())
对象，JSON，字符串，map之间的互转

原文地址：https://www.cnblogs.com/candlia/p/11920031.html

1 mapping type

映射（mapping）

映射类型（mapping types）

字段数据类型（field datatypes）

动态映射（dynamic mapping）

显式映射（explicit mappings）

映射更新（updating existing mappings）

字段之间共享映射类型（fileds are shared across mapping types）

2 field datatypes

基本类型

1. 字符串

2. 数值

3. 日期

4. 布尔

5. 二进制

复杂类型

1. 对象

2. 数组

3. 对象数组

4. 嵌套(nested)

专有类型

1. IPV4类型

3 Meta-Fields

文档标识相关元数据字段

_index

_type

原始信息相关元数据字段

_source

索引操作相关元数据字段

_all

_field_names

路由相关元数据字段

_parent

_routing

4 mapping setting

mapping type

设置方式

5 dynamic mapping

概述

默认映射

字段动态映射

动态模板

索引模板