zoukankan      html  css  js  c++  java
  • Hive之Partition的使用

    参考:
    http://blog.csdn.net/qiaochao911/article/details/8613988
    http://xm-king.iteye.com/blog/1088422
    HIVE分区,实际上是通过一个路径来标识的,而不是在物理数据中。比如每天的数据,可能分区是pt=20121023这样,那么路径中它就会变成:/hdfs/path/pt=20121023/data_files。通过路径来标识的好处是,如果我们需要取特定分区的数据,只需要把这个路径下的数据取出来就可以了,不用扫描全部的数据。 HIVE默认是静态分区。但是有时候可能需要动态创建不同的分区,比如商品信息,我想根据它是否在线分成两个分区,这样后续如果要取在线商品,就只需要从在线的分区获取即可。动态分区可以通过下面的设置来打开: [sql] view plaincopy set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; 然后代码里就可以这么写: [sql] view plaincopy insert overwrite table tbl_name partition(pt, if_online) select field1, field2, ..., pt, if_online from tbl where xxx; 注意输入字段的最后面必须是动态分区字段。 看一下与静态分区写法的区别: [sql] view plaincopy insert overwrite table tbl_name partition(pt=20121023, if_online=1) select field1, field2, ..., fieldn from tbl where xxx; 态分区与静态分区还有一个细微的差别是,静态分区一定会创建分区,不管SELECT语句的结果有没有数据。而动态分区,只有在SELECT结果的记录数>0的时候,才会创建分区。因此在不同的业务场景下,可能会选择不同的方案。 另外使用动态分区时需要注意的比较重要的一点是,动态分区会为每一个分区分配reduce数。比如说你在脚本上面写了:set mapred.reduce.tasks=100; 并且有两个分区:pt, if_online。如果结果集中pt=20121023,if_online=0/1,那么它就会为pt=20121023/if_online=0,pt=20121023/if_online=1各分配100个reduce。也就是说,namenode会同时处理200个文件的写操作。这在分区值很多的情况下,会成为一个灾难,容易直接把namenode给搞挂掉,是非常危险的。因此使用动态分区时,一定要清楚地知道产生的动态分区值,并且合理地设置reduce数量。 hive可以通过partitions将表粗粒度划分为不同的目录来提高查询的效率,例如包含时间戳的日志文件,如果我们按照时间来把日志文件分在不同的目录下,那么相同日期的记录就会存储在同一个分区目录下面,那我们就可以更高效率地查询特定某个时间的记录了 hive (economy)> describe stocks; OK exchange string symbol string ymd string price_open float price_high float price_low float price_close float volumn int price_adj_close float Time taken: 0.108 seconds hive (economy)> alter table partition_stocks add if not exists > partition(exchange='NASDAQ',symbol='ABXA'); OK Time taken: 1.834 seconds hive (economy)> show partitions partition_stocks; OK exchange=NASDAQ/symbol=ABXA Time taken: 0.155 seconds hive (economy)> insert overwrite table partition_stocks partition(exchange='NASDAQ', > symbol='ABXA') > select s.ymd,s.price_open,s.price_high,s.price_low,s.price_close,s.volumn, > s.price_adj_close from stocks s > where s.exchange='NASDAQ' and s.symbol='ABXA'; Total MapReduce jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303271617_0021, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0021 Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201303271617_0021 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2013-03-31 20:50:01,631 Stage-1 map = 0%, reduce = 0% 2013-03-31 20:50:10,656 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:11,659 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:12,662 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:13,665 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:14,681 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:15,684 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:16,687 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:17,720 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:18,723 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:19,787 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:20,791 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:21,797 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:22,800 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:23,803 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:24,807 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 8.34 sec 2013-03-31 20:50:25,837 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:26,841 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:27,844 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:28,848 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:29,852 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:30,884 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 21.0 sec 2013-03-31 20:50:31,888 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 21.0 sec MapReduce Total cumulative CPU time: 21 seconds 0 msec Ended Job = job_201303271617_0021 Ended Job = 61034062, job is filtered out (removed at runtime). Ended Job = 914476855, job is filtered out (removed at runtime). Launching Job 3 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303271617_0022, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201303271617_0022 Kill Command = /home/landen/UntarFile/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201303271617_0022 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2013-03-31 20:50:40,727 Stage-3 map = 0%, reduce = 0% 2013-03-31 20:50:46,740 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:47,743 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:48,746 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:49,749 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:50,753 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:51,760 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec 2013-03-31 20:50:52,763 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 1.05 sec MapReduce Total cumulative CPU time: 1 seconds 50 msec Ended Job = job_201303271617_0022 Loading data to table economy.partition_stocks partition (exchange=NASDAQ, symbol=ABXA) Deleted hdfs://localhost:9000/home/landen/UntarFile/hive-0.10.0/user/hive/warehouse/economy.db/partition_stocks/exchange=NASDAQ/symbol=ABXA Partition economy.partition_stocks{exchange=NASDAQ, symbol=ABXA} stats: [num_files: 1, num_rows: 0, total_size: 67679, raw_data_size: 0] Table economy.partition_stocks stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 67679, raw_data_size: 0] 1608 Rows loaded to partition_stocks MapReduce Jobs Launched: Job 0: Map: 2 Cumulative CPU: 21.0 sec HDFS Read: 481098497 HDFS Write: 67679 SUCCESS Job 1: Map: 1 Cumulative CPU: 1.05 sec HDFS Read: 67989 HDFS Write: 67679 SUCCESS Total MapReduce CPU Time Spent: 22 seconds 50 msec OK Time taken: 66.143 seconds hive (economy)> select * from partition_stocks > where exchange = 'NASDAQ' and symbol = 'ABXA'; 2005-03-16 7.97 7.97 7.91 7.92 300200 7.92 NASDAQ ABXA 2005-03-15 7.92 7.97 7.92 7.97 42500 7.97 NASDAQ ABXA 2005-03-14 7.93 7.99 7.92 7.92 109700 7.92 NASDAQ ABXA 2005-03-11 7.96 7.96 7.91 7.94 188200 7.94 NASDAQ ABXA 2005-03-10 7.99 8.0 7.93 7.96 1275900 7.96 NASDAQ ABXA 2005-03-09 8.01 8.02 7.78 8.0 198600 8.0 NASDAQ ABXA 2005-03-08 8.1 8.1 8.0 8.01 81600 8.01 NASDAQ ABXA 2005-03-07 8.05 8.11 8.05 8.11 70300 8.11 NASDAQ ABXA 2005-03-04 8.19 8.19 8.05 8.12 116300 8.12 NASDAQ ABXA 2005-03-03 8.09 8.2 8.09 8.2 64200 8.2 NASDAQ ABXA 2005-03-02 8.29 8.45 8.18 8.2 197600 8.2 NASDAQ ABXA 2005-03-01 8.07 8.35 8.06 8.25 332000 8.25 NASDAQ ABXA 2005-02-28 8.0 8.09 7.81 8.05 241000 8.05 NASDAQ ABXA 2005-02-25 7.55 8.3 7.55 8.1 1114100 8.1 NASDAQ ABXA 2005-02-24 7.01 7.3 7.0 7.3 174100 7.3 NASDAQ ABXA 2005-02-23 7.22 7.22 6.9 7.06 542100 7.06 NASDAQ ABXA 2005-02-22 7.56 7.56 7.25 7.26 209000 7.26 NASDAQ ABXA 2005-02-18 7.7 7.8 7.58 7.58 1525500 7.58 NASDAQ ABXA 2005-02-17 7.41 7.7 7.41 7.65 75200 7.65 NASDAQ ABXA 2005-02-16 7.51 7.51 7.38 7.45 1297600 7.45 NASDAQ ABXA 2005-02-15 7.58 7.61 7.41 7.56 248400 7.56 NASDAQ ABXA 2005-02-14 7.78 7.81 7.53 7.6 1487200 7.6 NASDAQ ABXA 2005-02-11 7.92 7.92 7.73 7.88 265400 7.88 NASDAQ ABXA 2005-02-10 8.03 8.03 7.91 7.93 205100 7.93 NASDAQ ABXA

    Now Let's learn about dynamic partitions properties:
    1. hive.exec.dynamic.partition = false by default. Set to true to enable dynamic partitioning;
    2. hive.exec.dynamic.partition.mode = strict by default. Set to nonstrict to enable all partitions to be determined dynamiclly;
    3. hive.exec.max.dynamic.partition.pernode = 100 by default. The maximum number of dynamic partitions that can be created by mapper or reducer. Raises a fatal error if
    one mapper or reducer attempts to create more than the threshold;
    4. hive.exec.max.dynamic.partitions = +1000 by default. The total number of dynamic partitions that can be created by one statement with dynamic partitioning. Raises a fatal error
    if the limit is exceeded;
    5. hive.exec.max.created.files = 100000 by default. The maximum total number of files that can be created globally. A Hadoop counter is used to track the number of files created.
    Raises a fatal error if the limit is exceeded.
  • 相关阅读:
    PowerDesigner将PDM导出生成WORD文档
    哨兵机制
    redis 主从数据同步
    redis RDB
    redis AOF机制
    redis数据结构
    dependencyManagement 和 dependencies的区别
    同源策略 跨域资源共享
    高可用 可扩展 多层反向代理架构演进
    String 杂记
  • 原文地址:https://www.cnblogs.com/likai198981/p/2996366.html
Copyright © 2011-2022 走看看