zoukankan      html  css  js  c++  java
  • Hive基础练习二

    下面是hive基本练习,持续补充中。

    Hive导出数据有几种方式,如何导出数据

    1.insert

    # 分为导出到本地或者hdfs,还可以格式化输出,指定分隔符
    # 导出到本地
    0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 0.107 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394): insert overwrite local directory '/kkb/install/hivedatas/stu3' select * from stu
    INFO  : Query ID = hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394
    INFO  : Total jobs = 1
    INFO  : Launching Job 1 out of 1
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0002, Tracking URL = http://node01:8088/proxy/application_1573910690864_0002/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0002
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:19:40,957 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:19:42,002 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.51 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 510 msec
    INFO  : Ended Job = job_1573910690864_0002
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Copying data to local directory /kkb/install/hivedatas/stu3 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-19-32_776_5008666227900564137-1/-mr-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.51 sec   HDFS Read: 3381 HDFS Write: 285797 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 510 msec
    INFO  : Completed executing command(queryId=hadoop_20191116221919_74a3d6f7-5995-4a1e-b072-e30d6269d394); Time taken: 10.251 seconds
    INFO  : OK
    No rows affected (10.383 seconds)
    # 查看本地文件
    [hadoop@node01 /kkb/install/hivedatas/stu3]$ cat 000000_0
    1clyang
    
    # 导出到hdfs
    0: jdbc:hive2://node01:10000> insert overwrite directory '/kkb/stu' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 0.173 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852): insert overwrite directory '/kkb/stu' select * from stu
    INFO  : Query ID = hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852
    INFO  : Total jobs = 3
    INFO  : Launching Job 1 out of 3
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0003, Tracking URL = http://node01:8088/proxy/application_1573910690864_0003/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0003
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:24:13,962 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:24:15,018 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 460 msec
    INFO  : Ended Job = job_1573910690864_0003
    INFO  : Starting task [Stage-6:CONDITIONAL] in serial mode
    INFO  : Stage-3 is selected by condition resolver.
    INFO  : Stage-2 is filtered out by condition resolver.
    INFO  : Stage-4 is filtered out by condition resolver.
    INFO  : Starting task [Stage-3:MOVE] in serial mode
    INFO  : Moving data to: hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000 from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10002
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Moving data to: /kkb/stu from hdfs://node01:8020/kkb/stu/.hive-staging_hive_2019-11-16_22-24-06_937_5666063681275061436-1/-ext-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.46 sec   HDFS Read: 3315 HDFS Write: 286719 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 460 msec
    INFO  : Completed executing command(queryId=hadoop_20191116222424_7b753364-9268-42e7-89fb-056424bc6852); Time taken: 9.044 seconds
    INFO  : OK
    # 查看hdfs
    [hadoop@node01 /kkb/install/hivedatas/stu3]$ hdfs dfs -cat /kkb/stu/000000_0
    19/11/16 22:26:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1clyang
    
    # 可以指定导出本地格式化分隔符,以导出到本地为例
    0: jdbc:hive2://node01:10000> insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu;
    INFO  : Compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:stu.id, type:int, comment:null), FieldSchema(name:stu.name, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 0.128 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f): insert overwrite local directory '/kkb/install/hivedatas/stu4' row format delimited fields terminated by '@' select * from stu
    INFO  : Query ID = hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f
    INFO  : Total jobs = 1
    INFO  : Launching Job 1 out of 1
    INFO  : Starting task [Stage-1:MAPRED] in serial mode
    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : Starting Job = job_1573910690864_0005, Tracking URL = http://node01:8088/proxy/application_1573910690864_0005/
    INFO  : Kill Command = /kkb/install/hadoop-2.6.0-cdh5.14.2//bin/hadoop job  -kill job_1573910690864_0005
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-11-16 22:31:27,083 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-11-16 22:31:28,139 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.93 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 930 msec
    INFO  : Ended Job = job_1573910690864_0005
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Copying data to local directory /kkb/install/hivedatas/stu4 from hdfs://node01:8020/tmp/hive/anonymous/2d04ba8e-9799-4a31-a93d-557db4086e81/hive_2019-11-16_22-31-20_415_1737902713220629568-1/-mr-10000
    INFO  : MapReduce Jobs Launched:
    INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 1.93 sec   HDFS Read: 3526 HDFS Write: 286073 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 seconds 930 msec
    INFO  : Completed executing command(queryId=hadoop_20191116223131_ebe796bf-7dcd-4a30-bcba-c63b7366773f); Time taken: 8.707 seconds
    INFO  : OK
    # 查看本地文件,发现以@分隔
    [hadoop@node01 /kkb/install/hivedatas/stu4]$ cat 000000_0
    1@clyang
    

    2.hadoop命令

    数据使用hive保存后存在于hdfs,也可以直接从hdfs将数据拉到本地,使用get命令。

    hdfs dfs -get /user/hive/warehouse/student/student.txt /opt/bigdata/data

    3.bash shell覆盖追加导出

    使用bin/hive -e sql语句或者bin/hive -f sql脚本,将数据覆盖或者追加导出,这里以前者为例,另外sql脚本本质上主要还是sql语句。

    # 覆盖写
    [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' > /kkb/install/hivedatas/student2.txt
    ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    2019-11-16 22:37:46,342 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/11/16 22:37:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
    OK
    Time taken: 6.966 seconds, Fetched: 1 row(s)
    You have new mail in /var/spool/mail/root
    # 查看结果
    [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
    stu.id	stu.name
    1	clyang
    # 追加写
    [hadoop@node01 /kkb/install/hive-1.1.0-cdh5.14.2/bin]$ ./hive -e 'select * from db_hive.stu' >> /kkb/install/hivedatas/student2.txt
    ls: cannot access /kkb/install/spark/lib/spark-assembly-*.jar: No such file or directory
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/kkb/install/hbase-1.2.0-cdh5.14.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/kkb/install/hadoop-2.6.0-cdh5.14.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    2019-11-16 22:39:03,442 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/11/16 22:39:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    
    Logging initialized using configuration in file:/kkb/install/hive-1.1.0-cdh5.14.2/conf/hive-log4j.properties
    OK
    Time taken: 6.056 seconds, Fetched: 1 row(s)
    You have new mail in /var/spool/mail/root
    # 查看追加写后结果
    [hadoop@node01 /kkb/install/hivedatas]$ cat student2.txt
    stu.id	stu.name
    1	clyang
    stu.id	stu.name
    1	clyang
    

    4.export导出到hdfs

    # 导出
    0: jdbc:hive2://node01:10000> export table stu to '/kkb/studentexport';
    INFO  : Compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.126 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a): export table stu to '/kkb/studentexport'
    INFO  : Starting task [Stage-0:COPY] in serial mode
    INFO  : Copying data from file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata to hdfs://node01:8020/kkb/studentexport
    INFO  : Copying file: file:/tmp/hadoop/e951940a-bcb6-4cd4-be17-0baf5d13615f/hive_2019-11-05_09-43-30_802_7299251851779747447-1/-local-10000/_metadata
    INFO  : Starting task [Stage-1:COPY] in serial mode
    INFO  : Copying data from hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu to hdfs://node01:8020/kkb/studentexport/data
    INFO  : Copying file: hdfs://node01:8020/user/hive/warehouse/db_hive.db/stu/000000_0
    INFO  : Completed executing command(queryId=hadoop_20191105094343_87d41d16-e4cd-43ac-9593-86e799d23a6a); Time taken: 0.604 seconds
    INFO  : OK
    
    # 查看数据
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport
    19/11/17 20:29:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 2 items
    -rwxr-xr-x   3 anonymous supergroup       1330 2019-11-05 09:43 /kkb/studentexport/_metadata
    drwxr-xr-x   - anonymous supergroup          0 2019-11-05 09:43 /kkb/studentexport/data
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /kkb/studentexport/data
    19/11/17 20:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rwxr-xr-x   3 anonymous supergroup          9 2019-11-05 09:43 /kkb/studentexport/data/000000_0
    You have new mail in /var/spool/mail/root
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -cat /kkb/studentexport/data/000000_0
    19/11/17 20:29:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    1clyang
    
    

    分区和分桶的区别

    分区是文件夹范畴的,就是按照文件夹区分,来存储文件,分桶是文件范畴的,将一个文件根据某个字段按hash取余,拆分为几个文件片段保存,它们都有各自的应用场景:

    (1)分区用在按照日期,按照天,或者小时来保存数据,后面查询可以根据需求快速定位到数据,避免了速度慢的全表扫描查询。

    (2)分桶则是更加细粒度的存储,可以指定桶的个数n,这样一份文件保存会划分为n份,如果想快速查找可以用tablesample(bucket x out of y)来指定要抽样查询的桶表。

    另外分区表里面可能有分桶表。

    将数据直接上传到分区目录(hdfs)上,让分区表和数据产生关联有哪些方式?

    当创建分区表并将数据导入到分区后,发现导入的数据就保存在对应的分区目录下,并且可以正常查询表内容。如果先将数据导入到事先准备好的分区,然后再创建分区表,是查不到数据的,因为还没有建立分区表数据和hive表的映射关系,需要使用命令来修复,此外还有2种方法。

    方法1 msck repair table 表名

    提前准备好分区,并将数据上传。

    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /mystudentdatas/month=11/
    19/11/17 12:36:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rw-r--r--   3 hadoop supergroup        199 2019-11-17 12:36 /mystudentdatas/month=11/student.csv
    

    创建表格

    0: jdbc:hive2://node01:10000> create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas';
    INFO  : Compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.149 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4): create table student_partition_me(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/mystudentdatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117123838_5b1f3eaf-f2f2-4b2e-b87f-2fdd8415f9d4); Time taken: 0.271 seconds
    INFO  : OK
    

    修复表格,使用msck,修复后就可以查看到表的数据了,映射关系建立。

    # 修复表格
    0: jdbc:hive2://node01:10000> msck repair table student_partition_me;
    INFO  : Compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.011 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631): msck repair table student_partition_me
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117124141_f09531b3-29fd-48a4-95c7-bec7018cf631); Time taken: 0.263 seconds
    INFO  : OK
    No rows affected (0.311 seconds)
    # 查询,最后字段为分区字段month
    0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_me;
    INFO  : Compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.133 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a): select id,name,year,gender,month from student_partition_me
    INFO  : Completed executing command(queryId=hadoop_20191117161313_257c8b24-4e53-4690-b343-b5f532c43e1a); Time taken: 0.0 seconds
    INFO  : OK
    +-----+-------+-------------+---------+--------+--+
    | id  | name  |    year     | gender  | month  |
    +-----+-------+-------------+---------+--------+--+
    | 01  | 赵雷    | 1990-01-01  | 男       | 11     |
    | 02  | 钱电    | 1990-12-21  | 男       | 11     |
    | 03  | 孙风    | 1990-05-20  | 男       | 11     |
    | 04  | 李云    | 1990-08-06  | 男       | 11     |
    | 05  | 周梅    | 1991-12-01  | 女       | 11     |
    | 06  | 吴兰    | 1992-03-01  | 女       | 11     |
    | 07  | 郑竹    | 1989-07-01  | 女       | 11     |
    | 08  | 王菊    | 1990-01-20  | 女       | 11     |
    +-----+-------+-------------+---------+--------+--+
    8 rows selected (0.214 seconds)
    

    方法2 alter table 表名 add partition(col=xxx)

    将数据上传到hdfs

    # 注意这里hdfs数据目录换成studentdatas
    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /studentdatas/month=12/
    19/11/17 16:51:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 1 items
    -rw-r--r--   3 hadoop supergroup        199 2019-11-17 16:51 /studentdatas/month=12/student.csv
    

    创建表格

    0: jdbc:hive2://node01:10000> create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas';
    INFO  : Compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.011 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699): create table student_partition_pa(id string,name string,year string,gender string) partitioned by(month string) row format delimited fields terminated by '	' location '/studentdatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117164141_666aa048-6fec-43fc-9bb1-4ea1ebd51699); Time taken: 0.097 seconds
    INFO  : OK
    

    使用alter table指定分区

    0: jdbc:hive2://node01:10000> alter table student_partition_pa add partition(month='12');
    INFO  : Compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.051 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662): alter table student_partition_pa add partition(month='12')
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117164242_c4ad4e93-7357-46a4-a59e-20d8e66bb662); Time taken: 0.116 seconds
    INFO  : OK
    

    查询数据,ok

    0: jdbc:hive2://node01:10000> select id,name,year,gender,month from student_partition_pa;
    INFO  : Compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:year, type:string, comment:null), FieldSchema(name:gender, type:string, comment:null), FieldSchema(name:month, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.092 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb): select id,name,year,gender,month from student_partition_pa
    INFO  : Completed executing command(queryId=hadoop_20191117170101_3b98c6e3-9756-4e54-b5e9-d3361351bceb); Time taken: 0.001 seconds
    INFO  : OK
    +-----+-------+-------------+---------+--------+--+
    | id  | name  |    year     | gender  | month  |
    +-----+-------+-------------+---------+--------+--+
    | 01  | 赵雷    | 1990-01-01  | 男       | 12     |
    | 02  | 钱电    | 1990-12-21  | 男       | 12     |
    | 03  | 孙风    | 1990-05-20  | 男       | 12     |
    | 04  | 李云    | 1990-08-06  | 男       | 12     |
    | 05  | 周梅    | 1991-12-01  | 女       | 12     |
    | 06  | 吴兰    | 1992-03-01  | 女       | 12     |
    | 07  | 郑竹    | 1989-07-01  | 女       | 12     |
    | 08  | 王菊    | 1990-01-20  | 女       | 12     |
    +-----+-------+-------------+---------+--------+--+
    8 rows selected (0.162 seconds)
    
    

    方法3 load data inpath 'hdfs文件路径' into table 表名 partition(col名='xxx')

    数据上传到hdfs

    [hadoop@node01 /kkb/install/hivedatas]$ hdfs dfs -ls /
    19/11/17 17:21:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Found 13 items
    # 上传person.txt到hdfs
    -rw-r--r--   3 hadoop      supergroup         68 2019-11-17 17:09 /person.txt
    
    
    

    创建表格

    0: jdbc:hive2://node01:10000> create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas';
    INFO  : Compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas'
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.023 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666): create table person_partition(name string,citys array<string>) partitioned by(age string) row format delimited fields terminated by '	' collection items terminated by ',' location '/persondatas'
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20191117171313_ce57a983-f4c2-4147-a94b-0c91ae143666); Time taken: 0.101 seconds
    INFO  : OK
    
    

    将hdfs上文件加载到分区目录下

    0: jdbc:hive2://node01:10000> load data inpath '/person.txt' into table person_partition partition(age='25');
    INFO  : Compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.082 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81): load data inpath '/person.txt' into table person_partition partition(age='25')
    INFO  : Starting task [Stage-0:MOVE] in serial mode
    INFO  : Loading data to table myhive.person_partition partition (age=25) from hdfs://node01:8020/person.txt
    INFO  : Starting task [Stage-1:STATS] in serial mode
    INFO  : Partition myhive.person_partition{age=25} stats: [numFiles=1, numRows=0, totalSize=68, rawDataSize=0]
    INFO  : Completed executing command(queryId=hadoop_20191117172222_1f130af4-c5bd-465c-8720-4a0a32273f81); Time taken: 0.382 seconds
    INFO  : OK
    
    

    查询数据,ok

    0: jdbc:hive2://node01:10000> select * from person_partition;
    INFO  : Compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:person_partition.name, type:string, comment:null), FieldSchema(name:person_partition.citys, type:array<string>, comment:null), FieldSchema(name:person_partition.age, type:string, comment:null)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.099 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392): select * from person_partition
    INFO  : Completed executing command(queryId=hadoop_20191117172222_24fff923-f365-48ae-a9bf-02fdee10b392); Time taken: 0.001 seconds
    INFO  : OK
    +------------------------+----------------------------------------------+-----------------------+--+
    | person_partition.name  |            person_partition.citys            | person_partition.age  |
    +------------------------+----------------------------------------------+-----------------------+--+
    | yang                   | ["beijing","shanghai","tianjin","hangzhou"]  | 25                    |
    | messi                  | ["changchu","chengdu","wuhan"]               | 25                    |
    +------------------------+----------------------------------------------+-----------------------+--+
    
    

    分桶表是否可以通过直接load将数据导入?

    桶表需要根据某个字段进行hash取余然后拆分数据保存为不同的文件保存到hdfs,需要通过普通中间表的字段值来计算拆分,因此不能直接load导入,直接导入到hdfs只有一个文件。另外通过桶表的文件类型可以看出,它不是原来的格式了,是一个mr计算后的文件,因此也说明不能用hdfs直接导入。

    hive中分区可以提高查询效率,分区是否越多越好,为什么?

    hive查询本质上是执行MapReduce任务,如果分区太多,同样体量的数据会产生更多的小文件block块,则会产生的更多的元数据(块的位置、大小等信息),这样对namenode来说压力很大。

    另外hive sql会转化为mapreduce任务,分区的一个小文件会对应一个的task,一个task对应一个JVM实例,过多的分区会产生大量的JVM实例,导致JVM频繁的创建与销毁,会降低系统整体性能。

    参考博文:
    (1)https://www.cnblogs.com/tele-share/p/9829515.html

  • 相关阅读:
    使用 asp.net mvc和 jQuery UI 控件包
    ServiceStack.Redis 使用教程
    HTC T8878刷机手册
    Entity Framework CodeFirst 文章汇集
    2011年Mono发展历程
    日志管理实用程序LogExpert
    使用 NuGet 管理项目库
    WCF 4.0路由服务Routing Service
    精进不休 .NET 4.0 (1) asp.net 4.0 新特性之web.config的改进, ViewStateMode, ClientIDMode, EnablePersistedSelection, 控件的其它一些改进
    精进不休 .NET 4.0 (7) ADO.NET Entity Framework 4.0 新特性
  • 原文地址:https://www.cnblogs.com/youngchaolin/p/11877986.html
Copyright © 2011-2022 走看看