zoukankan      html  css  js  c++  java
  • 【原创】大叔经验分享(7)创建hive表时格式如何选择

    常用格式

    textfile

    需要定义分隔符,占用空间大,读写效率最低,非常容易发生冲突(分隔符)的一种格式,基本上只有需要导入数据的时候才会使用,比如导入csv文件;

    ROW FORMAT DELIMITED

      FIELDS TERMINATED BY 'u0001'

      LINES TERMINATED BY ' '

    STORED AS TEXTFILE

    json

    hive3.0后官方支持json格式,之前需要使用第三方,导入jar,http://www.congiu.net/hive-json-serde/

    add jar hdfs://nn/jarpath/json-udf-1.3.8-jar-with-dependencies.jar;
    add jar hdfs://nn/jarpath/json-serde-1.3.8-jar-with-dependencies.jar;

    占用空间最大,读写效率低,基本上只有需要导入数据的时候才会使用,比如导入json文件;

    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
    STORED AS TEXTFILE

    xml

    http://central.maven.org/maven2/com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.0.0/hivexmlserde-1.0.0.0.jar

       CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics 
       map<string,string>, financial map<string,string>)
       ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
       WITH SERDEPROPERTIES (
       "column.xpath.customer_id"="/record/@customer_id",
       "column.xpath.income"="/record/income/text()",
       "column.xpath.demographics"="/record/demographics/*",
       "column.xpath.financial"="/record/financial/*"
        )
        TBLPROPERTIES (
        "xmlinput.start"="<record customer",
        "xmlinput.end"="</record>"
        );

    lzo

    相比textfile多了lzo压缩,占用空间更小;

    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ' '
    STORED AS INPUTFORMAT
    'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

    orc

    列式存储,占用空间最小,非常适合用来做数仓;

    STORED AS ORC

    压缩

    STORED AS ORC TBLPROPERTIES ("orc.compression"="ZLIB")

    STORED AS ORC TBLPROPERTIES ("orc.compression"="SNAPPY")

    注意设置orc压缩格式前一定要先设置:

    set hive.exec.orc.compression.strategy=COMPRESSION;

    否则压缩不生效;

    parquet

    列式存储,占用空间居中,如果后期使用spark来处理,parquet是最佳格式;

     STORED AS PARQUET

    parquet+snappy

    STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY")

    对比测试

    测试表:test_table

    测试行数:10亿

    测试sql类型:aggregation

    测试sql:select col_1, count(1) from test_table group by col_1;

    测试结果

    fs

    hdfs

    kudu

    format

    textfile

    lzo

    parquet

    parquet snappy

    orc

    orc snappy

     

    capacity

    464.0 G

    169.4 G

    177.2 G

    111.3 G

    71.5 G

    65.7G

    184 G

    100%

    36%

    37%

    23%

    15%

    14%

    39%

    Hive2.3.4

    816 s

    711 s

    250 s

    158 s

    130 s

    127 s

    Hive2.3.4

    Tuning

    251 s

    163 s

    109 s

    96 s

    Hive2.3.4

    On

    spark2.4.0

    54 s

    47 s

    149 s

    138 s

    Spark2.1.1

    371 s

    293 s

    17 s

    16 s

    51 s

    Spark2.4.0

    496 s

    297 s

    16 s

    16 s

    21 s

    21 s

    Drill1.15.0

    59 s

    57 s

    75 s

    45 s

    Impala2.12

    15 s

    16 s

    Presto0.215

    25 s

    21 s

    13 s

    12 s

     

     

    • 从数据大小和查询效率上看,表现最好的是presto+orc+snappy;
    • hive下最佳格式为orc snappy,数据大小最小,并且查询最快;
    • hive切换engine为spark后,对parquet格式的查询有一些提升,但是占用相同资源的情况下,远不如直接使用spark sql快;
    • spark2.3以后对orc格式相比之前有很大优化,已经很接近parquet格式;
    • impala+parquet+hdfs的性能和impala+kudu差不多,kudu的好处是支持实时更新;
    • drill看起来没有必要;
    • spark2.4.0中的parquet为2.4,parquet从2.5开始支持column index,预计以后的spark版本对parquet的查询会更快;
    • impala对orc的支持从3.1开始作为实验功能的一部分;

    详细数据

    yarn 200g 50core

    1 hive-2.3.4

    set mapreduce.map.memory.mb=4096;
    set mapreduce.map.java.opts=-Xmx3072m;

    hive-textfile:
    Time taken: 816.202 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 1831 Reduce: 1009 Cumulative CPU: 27614.77 sec HDFS Read: 498267775168 HDFS Write: 88861 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 7 hours 40 minutes 14 seconds 770 msec

    hive-lzo:
    Time taken: 711.266 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 183 Reduce: 711 Cumulative CPU: 13949.24 sec HDFS Read: 181881436157 HDFS Write: 62935 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 3 hours 52 minutes 29 seconds 240 msec

    hive-orc:
    Time taken: 130.194 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 275 Reduce: 300 Cumulative CPU: 4368.67 sec HDFS Read: 626004573 HDFS Write: 27178 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 1 hours 12 minutes 48 seconds 670 msec

    hive-orc snappy:
    Time taken: 127.803 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 191 Reduce: 276 Cumulative CPU: 4374.74 sec HDFS Read: 580889407 HDFS Write: 25090 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 1 hours 12 minutes 54 seconds 740 msec

    hive-orc-tuning:
    Time taken: 109.539 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 275 Reduce: 300 Cumulative CPU: 3051.67 sec HDFS Read: 627064673 HDFS Write: 40321 SUCCESS
    Total MapReduce CPU Time Spent: 50 minutes 51 seconds 670 msec

    hive-orc snappy-tuning:
    Time taken: 94.135 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 191 Reduce: 276 Cumulative CPU: 2393.92 sec HDFS Read: 581727151 HDFS Write: 37201 SUCCESS
    Total MapReduce CPU Time Spent: 39 minutes 53 seconds 920 msec

    hive-parquet:
    Time taken: 250.786 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 642 Reduce: 744 Cumulative CPU: 10919.85 sec HDFS Read: 873784253 HDFS Write: 65806 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 3 hours 1 minutes 59 seconds 850 msec

    hive-parquet snappy:
    Time taken: 158.009 seconds, Fetched: 32 row(s)
    Stage-Stage-1: Map: 367 Reduce: 467 Cumulative CPU: 6246.0 sec HDFS Read: 721915438 HDFS Write: 41707 SUCCESS
    Total MapReduce CPU Time Spent: 0 days 1 hours 44 minutes 6 seconds 0 msec

    2 hive-2.3.4 on spark-2.4.0

    set spark.driver.memory=4g;
    set spark.executor.memory=4g;
    set spark.executor.instances=10;

    hive on spark-parquet:
    Time taken: 54.446 seconds, Fetched: 32 row(s)

    hive on spark-parquet snappy:
    Time taken: 47.364 seconds, Fetched: 32 row(s)

    hive on spark-orc:
    Time taken: 149.901 seconds, Fetched: 32 row(s)

    hive on spark-orc snappy:
    Time taken: 138.844 seconds, Fetched: 32 row(s)

    3 impala-2.12

    MEM_LIMIT=20g * 3

    impala-parquet snappy:
    Fetched 32 row(s) in 15.10s
    +--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+
    | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
    +--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+
    | 04:EXCHANGE | 1 | 211.45us | 211.45us | 32 | 50 | 208.00 KB | 0 B | UNPARTITIONED |
    | 03:AGGREGATE | 3 | 2.58ms | 2.91ms | 32 | 50 | 34.03 MB | 128.00 MB | FINALIZE |
    | 02:EXCHANGE | 3 | 29.23us | 30.92us | 96 | 1.04B | 32.00 KB | 0 B | HASH(cpp_addr_province) |
    | 01:AGGREGATE | 3 | 13.29s | 13.97s | 96 | 1.04B | 34.05 MB | 128.00 MB | STREAMING |
    | 00:SCAN HDFS | 3 | 723.09ms | 760.01ms | 1.04B | 1.04B | 36.55 MB | 88.00 MB | temp.app_ba_userprofile_prop_nonpolar_view_ext_ps |
    +--------------+--------+----------+----------+-------+------------+-----------+---------------+---------------------------------------------------+

    impala-kudu:
    Fetched 32 row(s) in 15.61s

    4 drill-1.15

    10g+10g+1g+1g * 3

    drill-parquet:
    32 rows selected (59.501 seconds)

    drill-parquet snappy:
    32 rows selected (57.653 seconds)

    drill-orc:
    32 rows selected (75.749 seconds)

    drill-orc snappy:
    32 rows selected (45.323 seconds)

    5 spark-sql --master yarn --num-executors 10 --executor-memory 4g --driver-memory 4g

    5.1 spark-2.1.1

    spark sql-textfile:
    Time taken: 371.77 seconds, Fetched 32 row(s)

    spark sql-lzo:
    Time taken: 293.391 seconds, Fetched 32 row(s)

    spark sql-parquet:
    Time taken: 17.338 seconds, Fetched 32 row(s)

    spark sql-parquet snappy:
    Time taken: 16.609 seconds, Fetched 32 row(s)

    spark sql-orc:
    Time taken: 51.959 seconds, Fetched 32 row(s)


    5.2 spark-2.4.0

    spark sql-textfile:
    Time taken: 496.395 seconds, Fetched 32 row(s)

    spark sql-lzo:
    Time taken: 297.142 seconds, Fetched 32 row(s)

    spark sql-parquet:
    Time taken: 16.728 seconds, Fetched 32 row(s)

    spark sql-parquet snappy:
    Time taken: 16.879 seconds, Fetched 32 row(s)

    spark sql-orc:
    Time taken: 21.432 seconds, Fetched 32 row(s)

    spark sql-orc snappy:
    Time taken: 21.935 seconds, Fetched 32 row(s)

    6 presto

    presto-parquet:
    Splits: 3,182 total, 3,182 done (100.00%)
    0:25 [1.04B rows, 612MB] [42.2M rows/s, 24.9MB/s]

    presto-parquet snappy:
    Splits: 2,088 total, 2,088 done (100.00%)
    0:21 [1.04B rows, 584MB] [49.3M rows/s, 27.8MB/s

    presto-orc:
    Splits: 1,532 total, 1,532 done (100.00%)
    0:13 [1.04B rows, 850MB] [81.7M rows/s, 66.8MB/s]

    presto-orc snappy:
    Splits: 1,353 total, 1,353 done (100.00%)
    0:12 [1.04B rows, 1.13GB] [87.5M rows/s, 97.4MB/s]

  • 相关阅读:
    xml根据属性去重。如csprj去重
    table中td的内容换行。
    基于jq的表单填充
    c#包含类文件到csprj中
    t4 根据表名数组生成实体
    js中找string中重复项最多的字符个数
    一步步配置cordova android开发环境
    .net framework卸载工具
    Sql Server查询视图和表
    DbHelper.ttinclude 更新,查询视图和表
  • 原文地址:https://www.cnblogs.com/barneywill/p/10109508.html
Copyright © 2011-2022 走看看