zoukankan      html  css  js  c++  java
  • 【原创】大数据基础之ElasticSearch(4)es数据导入过程

    1 准备analyzer

    内置analyzer

    参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

    中文分词

    smartcn

    参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

    ik

    $ bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

    参考:https://github.com/medcl/elasticsearch-analysis-ik

    其他plugins

    参考:https://www.elastic.co/guide/en/elasticsearch/plugins/current/index.html

    2 创建索引--准备mapping,确定shards、replication

    # curl -XPUT -H 'Content-Type: application/json' http://localhost:9200/testdoc -d '
    {
      "settings": {
        "index.number_of_shards" : 10,
        "index.number_of_routing_shards" : 30,
        "index.number_of_replicas":1,
        "index.translog.durability": "async",
        "index.merge.scheduler.max_thread_count": 1,
        "index.refresh_interval": "30s"
      },
      "mappings": {
        "_doc": { 
          "_all": {
            "enabled": false
          },
          "_source": {
            "enabled": false
          },
          "properties": { 
            "title":    { "type": "text", "analyzer": "ik_smart"}, 
            "name":     { "type": "keyword", "doc_values": false}, 
            "age":      { "type": "integer", "index": false},  
            "created":  {
              "type":   "date", 
              "format": "strict_date_optional_time||epoch_millis"
            }
          }
        }
      }
    }'

    其中:

    _source 控制是否存储原始json
    _all 控制是否对原始json建倒排
    analyzer 用于指定分词
    doc_values 用于控制是否列式存储
    index 用于控制是否倒排

    The _source field stores the original JSON body of the document. If you don’t need access to it you can disable it.
    By default Elasticsearch indexes and adds doc values to most fields so that they can be searched and aggregated out of the box.

    数据类型

    参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

    其中String有两种:text和keyword,区别是text会被分词,keyword不会被分词;

    text

    参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html

    keyword

    参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html

    3 导入数据

    3.1 调用index api

    参考:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html

    3.2 准备hive外部表

    详见:https://www.cnblogs.com/barneywill/p/10300951.html

    4 测试

    # curl -XPOST -H 'Content-Type: application/json' 'http://localhost:9200/_xpack/sql?format=txt' -d '{"query":"select * from testdoc limit 10"}'

    or

    # curl -XGET 'http://localhost:9200/testdoc/_search?q=*'

    5 问题

    报错:all nodes failed

    2019-03-27 03:14:50,091 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.1:9200] failed (Read timed out); selected next node [192.168.0.1:9200]
    2019-03-27 03:15:50,148 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.2:9200] failed (Read timed out); selected next node [192.168.0.2:9200]
    2019-03-27 03:16:50,207 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [192.168.0.3:9200] failed (Read timed out); no other nodes left - aborting...
    2019-03-27 03:16:50,208 ERROR [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: Hit error while closing operators - failing tree
    2019-03-27 03:16:50,210 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: Hive Runtime Error while closing operators
            at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
            at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
            at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
            at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:422)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
            at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
    Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.0.1:9200, 192.168.0.2:9200, 192.168.0.3:9200]]
            at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:152)
            at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:398)
            at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:362)
            at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:366)
            at org.elasticsearch.hadoop.rest.RestClient.refresh(RestClient.java:267)
            at org.elasticsearch.hadoop.rest.bulk.BulkProcessor.close(BulkProcessor.java:550)
            at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:219)
            at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
            at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.close(EsHiveOutputFormat.java:74)
            at org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:190)
            at org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:1047)
            at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
            at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
            at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
            at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
            at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:189)
            ... 8 more

    解决方法:增加 index.number_of_shards,只能在创建索引时指定,默认为5

    报错:es_rejected_execution_exception

    Caused by: org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [70/1000]. Error sample (first [5] error messages):
            org.elasticsearch.hadoop.rest.EsHadoopRemoteException: es_rejected_execution_exception: rejected execution of processing of [7622922][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[test_indix][18]] containing [38] requests, target allocation id: iLlIBScJTxahse559pTINQ, primary term: 1 on EsThreadPoolExecutor[name = 1hxgYU_/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@ce11763[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 5686436]]

    报错原因:

    thread_pool.write.queue_size

    For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

    The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.

    查看thread_pool统计

    # curl 'http://localhost:9200/_nodes/stats?pretty'|grep '"write"' -A 7

    通常由于写入速度、并发量或者压力较大超过es处理能力,超出queue的大小就会被reject

    解决方法:

    1)修改配置调优

    index.refresh_interval: -1
    index.number_of_replicas: 0
    indices.memory.index_buffer_size: 40%
    thread_pool.write.queue_size: 1024

    详见:https://www.cnblogs.com/barneywill/p/10615249.html

    2)减小写入压力

  • 相关阅读:
    软件需求分析——阅读笔记
    第二次冲刺阶段 tenth day
    第16周周总结
    第二次冲刺阶段 ninth day
    判断各种数据类型的方法总结
    vue中8种组件通信方式
    字符串常用方法总结
    JS中轻松遍历对象属性的几种方式
    fetch请求和ajax请求
    js 文件下载,当前页下载,新标签下载____后端返回 GET/POST 文件流,下载文件
  • 原文地址:https://www.cnblogs.com/barneywill/p/10600485.html
Copyright © 2011-2022 走看看