zoukankan      html  css  js  c++  java
  • hive表多种存储格式的文件大小差异,无重复数据

    -- 重点,目标表无重复数据

    -- dbName.num_result 无重复记录
    -- 插入数据
    CREATE TABLE dbName.test_textfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS textfile
    ;
    insert overwrite table dbName.test_textfile select * from dbName.num_result where p_key='9' and p_key2='0';
    
    drop table dbName.test_orcfile;
    CREATE TABLE dbName.test_orcfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS orc
    ;
    insert overwrite table dbName.test_orcfile select * from test_textfile;
    
    CREATE TABLE dbName.test_rcfile(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS rcfile
    ;
    insert overwrite table dbName.test_rcfile select * from test_textfile;
    
    CREATE TABLE dbName.test_parquet(
      `key` string, 
      `value` string,
      `p_key` string, 
      `p_key2` string)
    STORED AS parquet
    ;
    insert overwrite table dbName.test_parquet select * from test_textfile;
    
    -- 统计数据量
    select count(1) as cnt from dbName.test_textfile;
    select count(1) as cnt from dbName.test_orcfile;
    select count(1) as cnt from dbName.test_rcfile;
    select count(1) as cnt from dbName.test_parquet;
    
    -- 统计文件大小
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_text*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_par*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_rc*;
    dfs -du -s -h hdfs://nameservice1/user/hive/warehouse/dbName.db/test_orc*;
    1.0 G  3.1 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile
    1.1 G  3.3 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet
    984.0 M  2.9 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile
    470.0 M  1.4 G  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile

    从结果可以看出,在无重复数据的情况下,parquet的压缩无用武之地,占用空间比textfile还大,ORC是压缩最强的文件模式。

    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_text*;
    1110741501  3332224503  hdfs://nameNode/user/hive/warehouse/dbName.db/test_textfile
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_par*;
    1167366639  3502099917  hdfs://nameNode/user/hive/warehouse/dbName.db/test_parquet
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_rc*;
    1031774688  3095324064  hdfs://nameNode/user/hive/warehouse/dbName.db/test_rcfile
    hive (dbName)> dfs -du -s hdfs://nameNode/user/hive/warehouse/dbName.db/test_orc*;
    492795434  1478386302  hdfs://nameNode/user/hive/warehouse/dbName.db/test_orcfile
  • 相关阅读:
    国内DP厂家的相关资料信息
    【转】挟天子以令诸侯博客关于TCP/IP模型与OSI模型的区别
    TMS320CC657基本外围电路调试
    TMS320C6657双核DSP的图像处理系统开发(1):硬件相关tips
    TI c6657开发资源
    PCIE接口的说明
    Flash Builder4注册机
    myeclipse 方法上加上@Override就报错的处理方法
    oracle安装完成后目录中不论有没有tnsnames.ora和listener.ora文件 PLSQL都能连上的问题解决方法
    POJO和javabean的区别
  • 原文地址:https://www.cnblogs.com/chenzechao/p/10072555.html
Copyright © 2011-2022 走看看