zoukankan      html  css  js  c++  java
  • hive基础知识五

    Hive 主流文件存储格式对比

    1、存储文件的压缩比测试

    1.1 测试数据
    https://github.com/liufengji/Compression_Format_Data
    ​
    log.txt 大小为18.1 M
    1.2 TextFile
    • 创建表,存储数据格式为TextFile

    create table log_text (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    row format delimited fields terminated by '	'
    stored as textfile ;
    • 向表中加载数据

    load data local inpath '/home/hadoop/log.txt' into table log_text ;
    • 查看表的数据量大小

    dfs -du -h /user/hive/warehouse/log_text;
    ​
    +------------------------------------------------+--+
    |                   DFS Output                   |
    +------------------------------------------------+--+
    | 18.1 M  /user/hive/warehouse/log_text/log.txt  |
    +------------------------------------------------+--+
    1.3 Parquet
    • 创建表,存储数据格式为 parquet

    create table log_parquet  (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    row format delimited fields terminated by '	'
    stored as parquet;
    • 向表中加载数据

    insert into table log_parquet select * from log_text;
    • 查看表的数据量大小

    hdfs dfs -du -h /user/hive/warehouse/log_parquet;
    ​
    +----------------------------------------------------+--+
    |                     DFS Output                     |
    +----------------------------------------------------+--+
    | 13.1 M  /user/hive/warehouse/log_parquet/000000_0  |
    +----------------------------------------------------+--+
    1.4 ORC
    • 创建表,存储数据格式为ORC

    create table log_orc  (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    row format delimited fields terminated by '	'
    stored as orc  ;
    • 向表中加载数据

    insert into table log_orc select * from log_text ;
    • 查看表的数据量大小

    hdfs dfs -du -h /user/hive/warehouse/log_orc;
    +-----------------------------------------------+--+
    |                  DFS Output                   |
    +-----------------------------------------------+--+
    | 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
    +-----------------------------------------------+--+
    1.5 存储文件的压缩比总结
    ORC >  Parquet >  textFile

    2、存储文件的查询速度测试

    2.1 TextFile
    select count(*) from log_text;
    +---------+--+
    |   _c0   |
    +---------+--+
    | 100000  |
    +---------+--+
    1 row selected (16.99 seconds)
    2.2 Parquet
    select count(*) from log_parquet;
    +---------+--+
    |   _c0   |
    +---------+--+
    | 100000  |
    +---------+--+
    1 row selected (17.994 seconds)
    2.3 ORC
    select count(*) from log_orc;
    +---------+--+
    |   _c0   |
    +---------+--+
    | 100000  |
    +---------+--+
    1 row selected (15.943 seconds)
    2.4 存储文件的查询速度总结
    ORC > TextFile > Parquet

    3、存储和压缩结合

    3.1 创建一个非压缩的的ORC存储方式表
    • 1、创建一个非压缩的的ORC表

    create table log_orc_none (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    row format delimited fields terminated by '	'
    stored as orc tblproperties("orc.compress"="NONE") ;
    • 2、加载数据

    insert into table log_orc_none select * from log_text ;
    • 3、查看表的数据量大小

    hdfs dfs -du -h /user/hive/warehouse/log_orc_none;
    +----------------------------------------------------+--+
    |                     DFS Output                     |
    +----------------------------------------------------+--+
    | 7.7 M  /user/hive/warehouse/log_orc_none/000000_0  |
    +----------------------------------------------------+--+
    3.2 创建一个snappy压缩的ORC存储方式表
    • 1、创建一个snappy压缩的的ORC表

    create table log_orc_snappy (
    track_time string,
    url string,
    session_id string,
    referer string,
    ip string,
    end_user_id string,
    city_id string
    )
    row format delimited fields terminated by '	'
    stored as orc tblproperties("orc.compress"="SNAPPY") ;
    • 2、加载数据

    insert into table log_orc_snappy select * from log_text ;
    • 3、查看表的数据量大小

    hdfs dfs -du -h /user/hive/warehouse/log_orc_snappy;
    +------------------------------------------------------+--+
    |                      DFS Output                      |
    +------------------------------------------------------+--+
    | 3.8 M  /user/hive/warehouse/log_orc_snappy/000000_0  |
    +------------------------------------------------------+--+
    3.3 创建一个ZLIB压缩的ORC存储方式表
    • 不指定压缩格式的就是默认的采用ZLIB压缩

      • 可以参考上面创建的 log_orc 表

    • 查看表的数据量大小

    hdfs dfs -du -h /user/hive/warehouse/log_orc;
    +-----------------------------------------------+--+
    |                  DFS Output                   |
    +-----------------------------------------------+--+
    | 2.8 M  /user/hive/warehouse/log_orc/000000_0  |
    +-----------------------------------------------+--+
    3.4 存储方式和压缩总结
    • orc 默认的压缩方式ZLIB比Snappy压缩的还小。

    • 在实际的项目开发当中,hive表的数据存储方式一般选择:orc或parquet

    • 由于snappy的压缩和解压缩 效率都比较高,压缩方式一般选择snappy

  • 相关阅读:
    87 求先序排列
    iOS App用WKWebView加载h5, h5多页面手势返回,h5返回App
    H5判断是手机端浏览器or原生App Webview
    vue-router笔记
    我的博客园页面定制 CSS 代码
    用swiper实现不规则分页滚动 & 给分页器加颜色
    Angular1.6.9 选择本地文件上传后台
    Vue移动端上拉加载更多实现请求分页数据
    导航折叠菜单
    类似于Flipboard页面翻折效果
  • 原文地址:https://www.cnblogs.com/lojun/p/11396793.html
Copyright © 2011-2022 走看看