ES批量索引写入时的ID自动生成算法

zoukankan html css js c++ java

ES批量索引写入时的ID自动生成算法
对bulk request的处理流程：

1、遍历所有的request，对其做一些加工，主要包括：获取routing(如果mapping里有的话)、指定的timestamp(如果没有带timestamp会使用当前时间)，如果没有指定id字段，在action.bulk.action.allow_id_generation配置为true的情况下，会自动生成一个base64UUID作为id字段，并会将request的opType字段置为CREATE，因为如果是使用es自动生成的id的话，默认就是createdocument而不是updatedocument。（注：坑爹啊，我从github上面下的最新的ES代码，发现自动生成id这一段已经没有设置opType字段了，看起来和有指定id是一样的处理逻辑了，见https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java）。

2、创建一个shardId--> Operation的Map，再次遍历所有的request，获取获取每个request应该发送到的shardId，获取的过程是这样的：request有routing就直接返回，如果没有，会先对id求一个hash，这里的hash函数默认是Murmur3，当然你也可以通过配置index.legacy.routing.hash.type来决定使用的hash函数,决定发到哪个shard：

return MathUtils.mod(hash, indexMetaData.getNumberOfShards());

即用hash对shard的总数求模来获取shardId，将shardId作为key，通过遍历的index和request组成BulkItemRequest的集合作为value放入之前说的map中（为什么要拿到遍历的index，因为在bulk response中可以看到对每个request的请求处理结果的），其实说了这么多就是要对request按shard来分组（为负载均衡）。

3、遍历上面得到的map，对不同的分组创建一个bulkShardRequest，包含配置consistencyLevel和timeout。并从集群state中获得primary shard，如果primary在本机就直接执行，如果不在会再发送到其shard所在的node。

上述1中的ID生成算法：

对于ES1.71版本，所处包为org.elasticsearch.action.index.IndexRequest
void org.elasticsearch.action.index.IndexRequest.process(MetaData metaData, @Nullable MappingMetaData mappingMd, boolean allowIdGeneration, String concreteIndex) throws ElasticsearchException {
............ // generate id if not already provided and id generation is allowed if (allowIdGeneration) { if (id == null) { id(Strings.base64UUID()); // since we generate the id, change it to CREATE opType(IndexRequest.OpType.CREATE); autoGeneratedId = true; } }

............

}
IndexRequest org.elasticsearch.action.index.IndexRequest.id(String id)

Sets the id of the indexed document. If not set, will be automatically generated.
Parameters:
id

String org.elasticsearch.common.Strings.base64UUID()

Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as primary key. The id is opaque and the implementation is free to change at any time!
/** Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as * primary key. The id is opaque and the implementation is free to change at any time! */ public static String base64UUID() { return TIME_UUID_GENERATOR.getBase64UUID(); }
参考：

https://discuss.elastic.co/t/generate-id/28536/2

https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

https://github.com/elastic/elasticsearch/pull/7531/files ES历史版本的改动可以在这里看到，最开始ES使用的是randomBase64UUID，出于性能后来用了类似Flake的ID！

http://xbib.org/elasticsearch/2.1.1/apidocs/org/elasticsearch/common/Strings.html

http://www.opscoder.info/es_indexprocess1.html 有bulk插入的详细说明
查看全文

相关阅读:
日志规范实践
 序列化和反序列化及Protobuf 基本使用
 简述TCP网络编程本质
 笔记：多线程服务器的适用场合（1）
聊聊同步、异步、阻塞与非阻塞(转)
《EntrePreneur》发刊词
 make和makefile简明基础
 luogu P3687 [ZJOI2017]仙人掌 |树形dp
luogu P3172 [CQOI2015]选数 |容斥原理
 luogu P4513 小白逛公园 |线段树

原文地址：https://www.cnblogs.com/bonelee/p/6075547.html