zoukankan      html  css  js  c++  java
  • Hive基础练习二

    下面是hive基本练习,持续补充中。

    Hive导出数据有几种方式,如何导出数据

    1.insert

    # 分为导出到本地或者hdfs,还可以格式化输出,指定分隔符
    # 导出到本地
    0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 0.107 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
    INFO  : Query ID = hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394
    INFO  : Total jobs = 1
    INFO  : Launching Job 1 out of 1
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0002, Tracking URL = http://node01:8088/proxy/application_1573910690864_0002/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0002
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:19:40,957 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:19:42,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.51 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 510 msec
    INFO  : Ended Job = job_1573910690864_0002
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Copying data to local directory /kkb/install/hivedatas/stu3 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-19-32_776_5008666227900564137-1/-mr-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.51 sec   HDFS Read: 3381 HDFS Write: 285797 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 510 msec
    INFO  : Completed executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 10.251 seconds
    INFO  : OK
    No rows affected (10.383 seconds)
    # 查看本地文件
    [hadoop@node01 /kkb/install/hivedatas/stu3]$ cat 000000_0
    1clyang
    
    # 导出到hdfs
    0: jdbc:hive2://node01:10000> insert overwrite directory '/kkb/stu' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 0.173 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
    INFO  : Query ID = hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852
    INFO  : Total jobs = 3
    INFO  : Launching Job 1 out of 3
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0003, Tracking URL = http://node01:8088/proxy/application_1573910690864_0003/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0003
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:24:13,962 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:24:15,018 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 460 msec
    INFO  : Ended Job = job_1573910690864_0003
    INFO  : Starting task [Stage-6:CONDITIONAL] in serial mode
    INFO  : Stage-3 is selected by condition resolver.
    INFO  : Stage-2 is filtered out by condition resolver.
    INFO  : Stage-4 is filtered out by condition resolver.
    INFO  : Starting task [Stage-3:MOVE] in serial mode
    INFO  : Moving data to: hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10002
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Moving data to: /kkb/stu from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.46 sec   HDFS Read: 3315 HDFS Write: 286719 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 460 msec
    INFO  : Completed executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 9.044 seconds
    INFO  : OK
    # 查看hdfs
    [hadoop@node01 /kkb/install/hivedatas/stu3]$ hdfs dfs -cat /kkb/stu/000000_0
    19/11/16 22:26:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1clyang
    
    # 可以指定导出本地格式化分隔符,以导出到本地为例
    0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 0.128 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
    INFO  : Query ID = hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f
    INFO  : Total jobs = 1
    INFO  : Launching Job 1 out of 1
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0005, Tracking URL = http://node01:8088/proxy/application_1573910690864_0005/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0005
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:31:27,083 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:31:28,139 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.93 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 930 msec
    INFO  : Ended Job = job_1573910690864_0005
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Copying data to local directory /kkb/install/hivedatas/stu4 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-31-20_415_1737902713220629568-1/-mr-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.93 sec   HDFS Read: 3526 HDFS Write: 286073 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 930 msec
    INFO  : Completed executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 8.707 seconds
    INFO  : OK
    # 查看本地文件,发现以@分隔
    [hadoop@node01 /kkb/install/hivedatas/stu4]$ cat 000000_0
    1@clyang
    

    2.hadoop命令

    数据使用hive保存后存在于hdfs,也可以直接从hdfs将数据拉到本地,使用get命令。

    hdfs dfs -get /user/hive/warehouse/student/student.txt /opt/bigdata/data

    3.bash shell覆盖追加导出

    使用bin/hive -e sql语句或者bin/hive -f sql脚本,将数据覆盖或者追加导出,这里以前者为例,另外sql脚本本质上主要还是sql语句。

    # 覆盖写
    [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' > /kkb/install/hivedatas/student2.txt
    ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    2019-11-16 22:37:46,342 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/11/16 22:37:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
    OK
    Time taken: 6.966 seconds, Fetched: 1 row(s)
    You have new mail in /var/spool/mail/root
    # 查看结果
    [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
    stu.id	stu.name
    1	clyang
    # 追加写
    [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' >> /kkb/install/hivedatas/student2.txt
    ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    2019-11-16 22:39:03,442 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/11/16 22:39:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
    OK
    Time taken: 6.056 seconds, Fetched: 1 row(s)
    You have new mail in /var/spool/mail/root
    # 查看追加写后结果
    [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
    stu.id	stu.name
    1	clyang
    stu.id	stu.name
    1	clyang
    

    4.export导出到hdfs

    # 导出
    0: jdbc:hive2://node01:10000> export table stu to '/kkb/studentexport';
    INFO  : Compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.126 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
    INFO  : Starting task [Stage-0:COPY] in serial mode
    INFO  : Copying data from file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata to hdfs://node01:8020/kkb/studentexport
    INFO  : Copying file: file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata
    INFO  : Starting task [Stage-1:COPY] in serial mode
    INFO  : Copying data from hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu to hdfs://node01:8020/kkb/studentexport/data
    INFO  : Copying file: hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu/000000_0
    INFO  : Completed executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.604 seconds
    INFO  : OK
    
    # 查看数据
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport
    19/11/17 20:29:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 2 items
    -rwxr-xr-x   3 anonymous supergroup       1330 2019-11-05 09:43 /kkb/studentexport/_metadata
    drwxr-xr-x   - anonymous supergroup          0 2019-11-05 09:43 /kkb/studentexport/data
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport/data
    19/11/17 20:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rwxr-xr-x   3 anonymous supergroup          9 2019-11-05 09:43 /kkb/studentexport/data/000000_0
    You have new mail in /var/spool/mail/root
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -cat /kkb/studentexport/data/000000_0
    19/11/17 20:29:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1clyang
    
    

    分区和分桶的区别

    分区是文件夹范畴的,就是按照文件夹区分,来存储文件,分桶是文件范畴的,将一个文件根据某个字段按hash取余,拆分为几个文件片段保存,它们都有各自的应用场景:

    (1)分区用在按照日期,按照天,或者小时来保存数据,后面查询可以根据需求快速定位到数据,避免了速度慢的全表扫描查询。

    (2)分桶则是更加细粒度的存储,可以指定桶的个数n,这样一份文件保存会划分为n份,如果想快速查找可以用tablesample(bucket x out of y)来指定要抽样查询的桶表。

    另外分区表里面可能有分桶表。

    将数据直接上传到分区目录(hdfs)上,让分区表和数据产生关联有哪些方式?

    当创建分区表并将数据导入到分区后,发现导入的数据就保存在对应的分区目录下,并且可以正常查询表内容。如果先将数据导入到事先准备好的分区,然后再创建分区表,是查不到数据的,因为还没有建立分区表数据和hive表的映射关系,需要使用命令来修复,此外还有2种方法。

    方法1 msck repair table 表名

    提前准备好分区,并将数据上传。

    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /mystudentdatas/month=11/
    19/11/17 12:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rw-r--r--   3 hadoop supergroup        199 2019-11-17 12:36 /mystudentdatas/month=11/student.csv
    

    创建表格

    0: jdbc:hive2://node01:10000> create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas';
    INFO  : Compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.149 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.271 seconds
    INFO  : OK
    

    修复表格,使用msck,修复后就可以查看到表的数据了,映射关系建立。

    # 修复表格
    0: jdbc:hive2://node01:10000> msck repair table student_partition_me;
    INFO  : Compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.011 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.263 seconds
    INFO  : OK
    No rows affected (0.311 seconds)
    # 查询,最后字段为分区字段month
    0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_me;
    INFO  : Compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.133 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
    INFO  : Completed executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.0 seconds
    INFO  : OK
    +-----+-------+-------------+---------+--------+--+
    | id  | name  |    year     | gender  | month  |
    +-----+-------+-------------+---------+--------+--+
    | 01  | 赵雷    | 1990-01-01  | 男       | 11     |
    | 02  | 钱电    | 1990-12-21  | 男       | 11     |
    | 03  | 孙风    | 1990-05-20  | 男       | 11     |
    | 04  | 李云    | 1990-08-06  | 男       | 11     |
    | 05  | 周梅    | 1991-12-01  | 女       | 11     |
    | 06  | 吴兰    | 1992-03-01  | 女       | 11     |
    | 07  | 郑竹    | 1989-07-01  | 女       | 11     |
    | 08  | 王菊    | 1990-01-20  | 女       | 11     |
    +-----+-------+-------------+---------+--------+--+
    8 rows selected (0.214 seconds)
    

    方法2 alter table 表名 add partition(col=xxx)

    将数据上传到hdfs

    # 注意这里hdfs数据目录换成studentdatas
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /studentdatas/month=12/
    19/11/17 16:51:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rw-r--r--   3 hadoop supergroup        199 2019-11-17 16:51 /studentdatas/month=12/student.csv
    

    创建表格

    0: jdbc:hive2://node01:10000> create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas';
    INFO  : Compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.011 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.097 seconds
    INFO  : OK
    

    使用alter table指定分区

    0: jdbc:hive2://node01:10000> alter table student_partition_pa add partition(month='12');
    INFO  : Compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.051 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.116 seconds
    INFO  : OK
    

    查询数据,ok

    0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_pa;
    INFO  : Compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.092 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
    INFO  : Completed executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.001 seconds
    INFO  : OK
    +-----+-------+-------------+---------+--------+--+
    | id  | name  |    year     | gender  | month  |
    +-----+-------+-------------+---------+--------+--+
    | 01  | 赵雷    | 1990-01-01  | 男       | 12     |
    | 02  | 钱电    | 1990-12-21  | 男       | 12     |
    | 03  | 孙风    | 1990-05-20  | 男       | 12     |
    | 04  | 李云    | 1990-08-06  | 男       | 12     |
    | 05  | 周梅    | 1991-12-01  | 女       | 12     |
    | 06  | 吴兰    | 1992-03-01  | 女       | 12     |
    | 07  | 郑竹    | 1989-07-01  | 女       | 12     |
    | 08  | 王菊    | 1990-01-20  | 女       | 12     |
    +-----+-------+-------------+---------+--------+--+
    8 rows selected (0.162 seconds)
    
    

    方法3 load data inpath 'hdfs文件路径' into table 表名 partition(col名='xxx')

    数据上传到hdfs

    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /
    19/11/17 17:21:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 13 items
    # 上传person.txt到hdfs
    -rw-r--r--   3 hadoop      supergroup         68 2019-11-17 17:09 /person.txt
    
    
    

    创建表格

    0: jdbc:hive2://node01:10000> create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas';
    INFO  : Compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.023 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.101 seconds
    INFO  : OK
    
    

    将hdfs上文件加载到分区目录下

    0: jdbc:hive2://node01:10000> load data inpath '/person.txt' into table person_partition partition(age='25');
    INFO  : Compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.082 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Loading data to table myhive.person_partition partition (age=25) from hdfs://node01:8020/person.txt
    INFO  : Starting task [Stage-1:STATS] in serial mode
    INFO  : Partition myhive.person_partition{age=25} stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
    INFO  : Completed executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.382 seconds
    INFO  : OK
    
    

    查询数据,ok

    0: jdbc:hive2://node01:10000> select * from person_partition;
    INFO  : Compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:person_partition.name, type:string, comment:null), FieldSchema(name:person_partition.citys, type:array<string>, comment:null), FieldSchema(name:person_partition.age, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.099 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
    INFO  : Completed executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.001 seconds
    INFO  : OK
    +------------------------+----------------------------------------------+-----------------------+--+
    | person_partition.name  |            person_partition.citys            | person_partition.age  |
    +------------------------+----------------------------------------------+-----------------------+--+
    | yang                   | ["beijing","shanghai","tianjin","hangzhou"]  | 25                    |
    | messi                  | ["changchu","chengdu","wuhan"]               | 25                    |
    +------------------------+----------------------------------------------+-----------------------+--+
    
    

    分桶表是否可以通过直接load将数据导入?

    桶表需要根据某个字段进行hash取余然后拆分数据保存为不同的文件保存到hdfs,需要通过普通中间表的字段值来计算拆分,因此不能直接load导入,直接导入到hdfs只有一个文件。另外通过桶表的文件类型可以看出,它不是原来的格式了,是一个mr计算后的文件,因此也说明不能用hdfs直接导入。

    hive中分区可以提高查询效率,分区是否越多越好,为什么?

    hive查询本质上是执行MapReduce任务,如果分区太多,同样体量的数据会产生更多的小文件block块,则会产生的更多的元数据(块的位置、大小等信息),这样对namenode来说压力很大。

    另外hive sql会转化为mapreduce任务,分区的一个小文件会对应一个的task,一个task对应一个JVM实例,过多的分区会产生大量的JVM实例,导致JVM频繁的创建与销毁,会降低系统整体性能。

    参考博文:
    (1)https://www.cnblogs.com/tele-share/p/9829515.html

  • 相关阅读:
    uva 11248 最大流 ISAP
    【力扣】133. 克隆图
    【力扣】125. 验证回文串
    【力扣】130. 被围绕的区域
    【力扣】337. 打家劫舍 III
    【力扣】104. 二叉树的最大深度-及二叉树的遍历方式
    【力扣】392. 判断子序列
    【力扣】95. 不同的二叉搜索树 II
    【力扣】120. 三角形最小路径和
    【力扣】两个数组的交集 II
  • 原文地址:https://www.cnblogs.com/youngchaolin/p/11877986.html
Copyright © 2011-2022 走看看