zoukankan      html  css  js  c++  java
  • [Hadoop大数据]——Hive数据的导入导出

    Hive作为大数据环境下的数据仓库工具,支持基于hadoop以sql的方式执行mapreduce的任务,非常适合对大量的数据进行全量的查询分析。

    本文主要讲述下hive载cli中如何导入导出数据:

    导入数据

    第一种方式,直接从本地文件系统导入数据

    我的本机有一个test1.txt文件,这个文件中有三列数据,并且每列都是以' '为分隔

    [root@localhost conf]# cat /usr/tmp/test1.txt
    1	a1	b1
    2	a2	b2
    3	a3	b3
    4	a4	b
    

    创建数据表:

    >create table test1(a string,b string,c string)
    >row format delimited
    >fields terminated by '	'
    >stored as textfile;
    

    导入数据:

    load data local inpath '/usr/tmp/test1.txt' overwrite into table test1;
    

    其中local inpath,表明路径为本机路径
    overwrite表示加载的数据会覆盖原来的内容

    第二种,从hdfs文件中导入数据

    首先上传数据到hdfs中

    hadoop fs -put /usr/tmp/test1.txt /test1.txt
    

    在hive中查看test1.txt文件

    hive> dfs -cat /test1.txt;
    1	a1	b1
    2	a2	b2
    3	a3	b3
    4	a4	b4
    
    

    创建数据表,与前面一样。导入数据的命令有些差异:

    load data inpath '/test1.txt' overwrite into table test2;
    

    第三种,基于查询insert into导入

    首先定义数据表,这里直接创建带有分区的表

    hive> create table test3(a string,b string,c string) partitioned by (d string) row format delimited fields terminated by '	' stored as textfile;
    OK
    Time taken: 0.109 seconds
    hive> describe test3;
    OK
    a                   	string              	                    
    b                   	string              	                    
    c                   	string              	                    
    d                   	string              	                    
    	 	 
    # Partition Information	 	 
    # col_name            	data_type           	comment             
    	 	 
    d                   	string              	                    
    Time taken: 0.071 seconds, Fetched: 9 row(s)
    

    通过查询直接导入数据到固定的分区表中:

    hive> insert into table test3 partition(d='aaaaaa') select * from test2;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160823212718_9cfdbea4-42fa-4267-ac46-9ac2c357f944
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-23 21:27:21,621 Stage-1 map = 100%,  reduce = 0%
    Ended Job = job_local1550375778_0001
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test3/d=aaaaaa/.hive-staging_hive_2016-08-23_21-27-18_739_4058721562930266873-1/-ext-10000
    Loading data to table test.test3 partition (d=aaaaaa)
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 248 HDFS Write: 175 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 3.647 seconds
    

    通过查询观察结果

    hive> select * from test3;
    OK
    1	a1	b1	aaaaaa
    2	a2	b2	aaaaaa
    3	a3	b3	aaaaaa
    4	a4	b4	aaaaaa
    Time taken: 0.264 seconds, Fetched: 4 row(s)
    

    PS:也可以直接通过动态分区插入数据:

    insert into table test4 partition(c) select * from test2;
    

    分区会以文件夹命名的方式存储:

    hive> dfs -ls /user/hive/warehouse/test.db/test4/;
    Found 4 items
    drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b1
    drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b2
    drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b3
    drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b4
    

    第四种,直接基于查询创建数据表

    直接通过查询创建数据表:

    hive> create table test5 as select * from test4;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160823213944_03672168-bc56-43d7-aefb-cac03a6558bf
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-23 21:39:46,030 Stage-1 map = 100%,  reduce = 0%
    Ended Job = job_local855333165_0003
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/.hive-staging_hive_2016-08-23_21-39-44_259_5484795730585321098-1/-ext-10002
    Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test5
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 600 HDFS Write: 466 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 2.184 seconds
    

    查看结果

    hive> select * from test5;
    OK
    1	a1	b1
    2	a2	b2
    3	a3	b3
    4	a4	b4
    Time taken: 0.147 seconds, Fetched: 4 row(s)
    

    导出数据

    导出到本地文件

    执行导出本地文件命令:

    hive> insert overwrite local directory '/usr/tmp/export' select * from test1;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160823221655_05b05863-6273-4bdd-aad2-e80d4982425d
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-23 22:16:57,028 Stage-1 map = 100%,  reduce = 0%
    Ended Job = job_local8632460_0005
    Moving data to local directory /usr/tmp/export
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 794 HDFS Write: 498 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 1.569 seconds
    hive> 
    
    

    在本地文件查看内容:

    [root@localhost export]# ll
    total 4
    -rw-r--r--. 1 root root 32 Aug 23 22:16 000000_0
    [root@localhost export]# cat 000000_0 
    1a1b1
    2a2b2
    3a3b3
    4a4b4
    [root@localhost export]# pwd
    /usr/tmp/export
    [root@localhost export]# 
    
    

    导出到hdfs

    hive> insert overwrite directory '/usr/tmp/test' select * from test1;
    WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    Query ID = root_20160823214217_e8c71bb9-a147-4518-8353-81f9adc54183
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks is set to 0 since there's no reduce operator
    Job running in-process (local Hadoop)
    2016-08-23 21:42:19,257 Stage-1 map = 100%,  reduce = 0%
    Ended Job = job_local628523792_0004
    Stage-3 is selected by condition resolver.
    Stage-2 is filtered out by condition resolver.
    Stage-4 is filtered out by condition resolver.
    Moving data to directory hdfs://localhost:8020/usr/tmp/test/.hive-staging_hive_2016-08-23_21-42-17_778_6818164305996247644-1/-ext-10000
    Moving data to directory /usr/tmp/test
    MapReduce Jobs Launched: 
    Stage-Stage-1:  HDFS Read: 730 HDFS Write: 498 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    Time taken: 1.594 seconds
    

    导出成功,查看导出的hdfs文件

    hive> dfs -cat /usr/tmp/test;
    cat: `/usr/tmp/test': Is a directory
    Command failed with exit code = 1
    Query returned non-zero code: 1, cause: null
    hive> dfs -ls /usr/tmp/test;
    Found 1 items
    -rwxr-xr-x   3 root supergroup         32 2016-08-23 21:42 /usr/tmp/test/000000_0
    
    
    hive> dfs -cat /usr/tmp/test/000000_0;
    1a1b1
    2a2b2
    3a3b3
    4a4b4
    hive> 
    
    

    导出到另一个表

    样例可以参考前面数据导入的部分:

    insert into table test3 select * from test1;
    
  • 相关阅读:
    会话技术
    Http
    tomcat
    xml
    javascript
    css
    Html
    递归
    二叉树的相关复习
    vim学习
  • 原文地址:https://www.cnblogs.com/xing901022/p/5801061.html
Copyright © 2011-2022 走看看