zoukankan      html  css  js  c++  java
  • hive 分区表和分桶表

    1、创建分区表

    hive> create table weather_list(year int,data int) partitioned by (createtime string,area string) row format delimited fields terminated by ",";

    修改表:

    hive> alter table weather_list change data new_data int;
    hive> alter table weather_list change year new_year int;

    1.1、加载数据

    hive> load data local inpath '/home/hadoop/sampler/w2.csv' into table weather_list partition(createtime='2011-01-01',area='bj');
    Loading data to table busdata.weather_list partition (createtime=2011-01-01, area=bj)
    OK
    Time taken: 1.455 seconds
    hive> load data local inpath '/home/hadoop/sampler/w3.csv' into table weather_list partition(createtime='2011-01-02',area='sc');
    Loading data to table busdata.weather_list partition (createtime=2011-01-02, area=sc)
    OK
    Time taken: 1.394 seconds
    hive> load data local inpath '/home/hadoop/sampler/w4.csv' into table weather_list partition(createtime='2011-01-03',area='tj');
    Loading data to table busdata.weather_list partition (createtime=2011-01-03, area=tj)
    OK
    Time taken: 1.568 seconds
    hive> load data local inpath '/home/hadoop/sampler/w4.csv' into table weather_list partition(createtime='2011-01-04',area='sz');
    Loading data to table busdata.weather_list partition (createtime=2011-01-04, area=sz)
    OK
    Time taken: 1.209 seconds
    hive> load data local inpath '/home/hadoop/sampler/w5.csv' into table weather_list partition(createtime='2011-01-05',area='gz');
    Loading data to table busdata.weather_list partition (createtime=2011-01-05, area=gz)
    OK
    Time taken: 1.148 seconds
    hive> load data local inpath '/home/hadoop/sampler/w5.csv' into table weather_list partition(createtime='2011-01-01',area='gz');
    Loading data to table busdata.weather_list partition (createtime=2011-01-01, area=gz)
    OK
    Time taken: 1.278 seconds

    partition的分区字段体现在存储目录上,与文件中的实际存储字段没有关系。

    hive> dfs -ls /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz;
    Found 6 items
    -rw-r--r--   1 hadoop supergroup      18018 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w1.csv
    -rw-r--r--   1 hadoop supergroup      18022 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w2.csv
    -rw-r--r--   1 hadoop supergroup      18028 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w3.csv
    -rw-r--r--   1 hadoop supergroup      18022 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w4.csv
    -rw-r--r--   1 hadoop supergroup      18027 2019-03-05 22:12 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w5.csv
    -rw-r--r--   1 hadoop supergroup      18027 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz/w5_copy_1.csv
    hive> dfs -ls /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01;
    Found 2 items
    drwxr-xr-x   - hadoop supergroup          0 2019-03-05 22:09 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=bj
    drwxr-xr-x   - hadoop supergroup          0 2019-03-05 22:14 /hive/warehouse/busdata.db/weather_list/createtime=2011-01-01/area=gz

    1.2、显示分区信息

    hive> show partitions weather_list;
    OK
    createtime=2010-01-01/area=bj
    createtime=2010-01-01/area=sh
    createtime=2010-01-01/area=yn
    createtime=2010-01-02/area=sh
    createtime=2011-01-01/area=bj
    createtime=2011-01-01/area=gz
    createtime=2011-01-02/area=sc
    createtime=2011-01-03/area=tj
    createtime=2011-01-04/area=sz
    createtime=2011-01-05/area=gz
    Time taken: 0.584 seconds, Fetched: 10 row(s)

    1.3、分区列属于表的正式列,但是文件中没有存储分区列信息。分区列的信息是从目录中读取的。

    hive> select * from weather_list where area='bj' limit 10;
    OK
    1999    71    2010-01-01    bj
    1994    57    2010-01-01    bj
    1995    33    2010-01-01    bj
    1993    44    2010-01-01    bj
    1994    99    2010-01-01    bj
    1994    83    2010-01-01    bj
    1995    59    2010-01-01    bj
    1991    32    2010-01-01    bj
    1992    74    2010-01-01    bj
    2000    56    2010-01-01    bj
    Time taken: 2.527 seconds, Fetched: 10 row(s)

     2、分桶表

    2.1、检查分桶属性,设置分桶属性是为了使用hive来自动分桶,因为分桶是根据分桶字段和数量进行hash取余,也可以自己分桶后导入。

    hive> set hive.enforce.bucketing;
    hive.enforce.bucketing=false
    
    hive> set hive.enforce.bucketing=true;
    
    hive> set hive.enforce.bucketing;
    hive.enforce.bucketing=true

    2.2、建立分桶表

    hive> create table bucket_userinfo(userid int,username string) clustered by (userid) sorted by (userid asc) into 2 buckets row format delimited fields terminated by ",";
    
    hive> desc formatted bucket_userinfo;
    OK
    # col_name                data_type               comment             
    userid                  int                                         
    username                string                                      
              
    # Detailed Table Information          
    Database:               busdata                  
    OwnerType:              USER                     
    Owner:                  hadoop                   
    CreateTime:             Wed Mar 06 23:11:37 CST 2019     
    LastAccessTime:         UNKNOWN                  
    Retention:              0                        
    Location:               hdfs://bigdata-senior01.home.com:9000/hive/warehouse/busdata.db/bucket_userinfo     
    Table Type:             MANAGED_TABLE            
    Table Parameters:          
        COLUMN_STATS_ACCURATE    {"BASIC_STATS":"true","COLUMN_STATS":{"userid":"true","username":"true"}}
        SORTBUCKETCOLSPREFIX    TRUE                
        bucketing_version       2                   
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime    1551885097          
              
    # Storage Information          
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe     
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     
    Compressed:             No                       
    Num Buckets:            2                        
    Bucket Columns:         [userid]                 
    Sort Columns:           [Order(col:userid, order:1)]     
    Storage Desc Params:          
        field.delim             ,                   
        serialization.format    ,                   
    Time taken: 0.379 seconds, Fetched: 34 row(s)

    2.3、使用hive自动分桶,这种情况是针对源数据已经导入hive。

    hive> insert overwrite table bucket_userinfo select userid,username from userinfo;
    
    然后hive启动作业分桶导入数据,本例中分两个桶,所以最终会根据userid的奇偶生成两个文件。
    hive> dfs -ls /hive/warehouse/busdata.db/bucket_userinfo;
    Found 2 items
    -rw-r--r--   1 hadoop supergroup        106 2019-03-06 23:13 /hive/warehouse/busdata.db/bucket_userinfo/000000_0
    -rw-r--r--   1 hadoop supergroup        103 2019-03-06 23:13 /hive/warehouse/busdata.db/bucket_userinfo/000001_0
    hive> dfs -cat /hive/warehouse/busdata.db/bucket_userinfo/000000_0;
    2,xu.dm
    4,user123
    6,user2
    8,user4
    10,user6
    14,user8
    16,user10
    18,user12
    20,user14
    22,soldier2
    24,soldier4
    hive> dfs -cat /hive/warehouse/busdata.db/bucket_userinfo/000001_0;
    1,admin
    3,myuser
    5,user1
    7,user3
    9,user5
    13,user7
    15,user9
    17,user11
    19,user13
    21,soldier1
    23,soldier3
    hive> select * from bucket_userinfo;
    OK
    2    xu.dm
    4    user123
    6    user2
    8    user4
    10    user6
    14    user8
    16    user10
    18    user12
    20    user14
    22    soldier2
    24    soldier4
    1    admin
    3    myuser
    5    user1
    7    user3
    9    user5
    13    user7
    15    user9
    17    user11
    19    user13
    21    soldier1
    23    soldier3
    Time taken: 0.238 seconds, Fetched: 22 row(s)

     2.4、从外部文件导入数据,结果与上面一样

    hive> create table bucket_userinfo2(userid int,username string) clustered by (userid) sorted by (userid) into 2 buckets row format delimited fields terminated by ",";
    hive> load data local inpath '/home/hadoop/userinfo2.txt' into table bucket_userinfo2;
    hive> select * from bucket_userinfo2;
    OK
    2    xu.dm
    4    user123
    6    user2
    8    user4
    10    user6
    14    user8
    16    user10
    18    user12
    20    user14
    22    soldier2
    24    soldier4
    1    admin
    3    myuser
    5    user1
    7    user3
    9    user5
    13    user7
    15    user9
    17    user11
    19    user13
    21    soldier1
    23    soldier3
    hive>dfs -ls /hive/warehouse/busdata.db/bucket_userinfo2;
    Found 2 items
    -rw-r--r--   1 hadoop supergroup        106 2019-03-07 22:44 /hive/warehouse/busdata.db/bucket_userinfo2/000000_0
    -rw-r--r--   1 hadoop supergroup        103 2019-03-07 22:44 /hive/warehouse/busdata.db/bucket_userinfo2/000001_0

    2.4、对桶数据采样

    hive> select * from bucket_userinfo tablesample(bucket 1 out of 2 on userid);
    OK
    2    xu.dm
    6    user2
    10    user6
    20    user14
    3    myuser
    7    user3
    17    user11
    19    user13
    21    soldier1
    Time taken: 0.077 seconds, Fetched: 9 row(s)
  • 相关阅读:
    final-第十章
    路由器基本配置-命令行配置模式
    路由器基本配置-对话配置模式
    配置静态路由
    默认路由
    像素值的读写
    矩阵的基本元素表达
    创建Mat对象
    Mat类
    数学基础-3D空间的位置表示
  • 原文地址:https://www.cnblogs.com/asker009/p/10480427.html
Copyright © 2011-2022 走看看