zoukankan      html  css  js  c++  java
  • Hive数据导入导出的几种方式

    一,Hive数据导入的几种方式

    首先列出讲述下面几种导入方式的数据和hive表。

    导入:

    1. 本地文件导入到Hive表;
    2. Hive表导入到Hive表;
    3. HDFS文件导入到Hive表;
    4. 创建表的过程中从其他表导入;
    5. 通过sqoop将mysql库导入到Hive表;示例见《通过sqoop进行mysql与hive的导入导出》和《定时从大数据平台同步HIVE数据到oracle

    导出:

    1. Hive表导出到本地文件系统;
    2. Hive表导出到HDFS;
    3. 通过sqoop将Hive表导出到mysql库;

    Hive表:

    创建testA:

    CREATE TABLE testA (  
        id INT,  
        name string,  
        area string  
    ) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;  

    创建testB:

    CREATE TABLE testB (  
        id INT,  
        name string,  
        area string,  
        code string  
    ) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; 

    数据文件(sourceA.txt):

    1,fish1,SZ  
    2,fish2,SH  
    3,fish3,HZ  
    4,fish4,QD  
    5,fish5,SR  

    数据文件(sourceB.txt):

    1,zy1,SZ,1001  
    2,zy2,SH,1002  
    3,zy3,HZ,1003  
    4,zy4,QD,1004  
    5,zy5,SR,1005  

    (1)本地文件导入到Hive表

    hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08');  
    Copying data from file:/home/hadoop/sourceA.txt  
    Copying file: file:/home/hadoop/sourceA.txt  
    Loading data to table default.testa partition (create_time=2015-07-08)  
    Partition default.testa{create_time=2015-07-08} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]  
    OK  
    Time taken: 0.237 seconds  
    hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');  
    Copying data from file:/home/hadoop/sourceB.txt  
    Copying file: file:/home/hadoop/sourceB.txt  
    Loading data to table default.testb partition (create_time=2015-07-09)  
    Partition default.testb{create_time=2015-07-09} stats: [numFiles=1, numRows=0, totalSize=73, rawDataSize=0]  
    OK  
    Time taken: 0.212 seconds  
    hive> select * from testA;  
    OK  
    1   fish1   SZ  2015-07-08  
    2   fish2   SH  2015-07-08  
    3   fish3   HZ  2015-07-08  
    4   fish4   QD  2015-07-08  
    5   fish5   SR  2015-07-08  
    Time taken: 0.029 seconds, Fetched: 5 row(s)  
    hive> select * from testB;  
    OK  
    1   zy1 SZ  1001    2015-07-09  
    2   zy2 SH  1002    2015-07-09  
    3   zy3 HZ  1003    2015-07-09  
    4   zy4 QD  1004    2015-07-09  
    5   zy5 SR  1005    2015-07-09  
    Time taken: 0.047 seconds, Fetched: 5 row(s)  

    (2)Hive表导入到Hive表

    将testB的数据导入到testA表

    hive> INSERT INTO TABLE testA PARTITION(create_time='2015-07-11') select id, name, area from testB where id = 1;  
    ...(省略)  
    OK  
    Time taken: 14.744 seconds  
    hive> INSERT INTO TABLE testA PARTITION(create_time) select id, name, area, code from testB where id = 2;  
    <pre name="code" class="java">...(省略)  
    OKTime taken: 19.852 secondshive> select * from testA;OK2 zy2 SH 10021 fish1 SZ 2015-07-082 fish2 SH 2015-07-083 fish3 HZ 2015-07-084 fish4 QD 2015-07-085 fish5 SR 2015-07-081 zy1 SZ 2015-07-11Time taken: 0.032 seconds, Fetched: 7 row(s)

    说明:

    1,将testB中id=1的行,导入到testA,分区为2015-07-11

    2,将testB中id=2的行,导入到testA,分区create_time为id=2行的code值。

    (3)HDFS文件导入到Hive表

    将sourceA.txt和sourceB.txt传到HDFS中,路径分别是/home/hadoop/sourceA.txt和/home/hadoop/sourceB.txt中

    hive> LOAD DATA INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08');  
    ...(省略)  
    OK  
    Time taken: 0.237 seconds  
    hive> LOAD DATA INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');  
    <pre name="code" class="java">...(省略)  
    OK  
    Time taken: 0.212 seconds  
    hive> select * from testA;  
    OK  
    1   fish1   SZ  2015-07-08  
    2   fish2   SH  2015-07-08  
    3   fish3   HZ  2015-07-08  
    4   fish4   QD  2015-07-08  
    5   fish5   SR  2015-07-08  
    Time taken: 0.029 seconds, Fetched: 5 row(s)  
    hive> select * from testB;  
    OK  
    1   zy1 SZ  1001    2015-07-09  
    2   zy2 SH  1002    2015-07-09  
    3   zy3 HZ  1003    2015-07-09  
    4   zy4 QD  1004    2015-07-09  
    5   zy5 SR  1005    2015-07-09  
    Time taken: 0.047 seconds, Fetched: 5 row(s)  

    /home/hadoop/sourceA.txt'导入到testA表

    /home/hadoop/sourceB.txt'导入到testB表

    (4)创建表的过程中从其他表导入

    hive> create table testC as select name, code from testB;  
    Total jobs = 3  
    Launching Job 1 out of 3  
    Number of reduce tasks is set to 0 since there's no reduce operator  
    Starting Job = job_1449746265797_0106, Tracking URL = http://hadoopcluster79:8088/proxy/application_1449746265797_0106/  
    Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1449746265797_0106  
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
    2015-12-24 16:40:17,981 Stage-1 map = 0%,  reduce = 0%  
    2015-12-24 16:40:23,115 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.11 sec  
    MapReduce Total cumulative CPU time: 1 seconds 110 msec  
    Ended Job = job_1449746265797_0106  
    Stage-4 is selected by condition resolver.  
    Stage-3 is filtered out by condition resolver.  
    Stage-5 is filtered out by condition resolver.  
    Moving data to: hdfs://hadoop2cluster/tmp/hive-root/hive_2015-12-24_16-40-09_983_6048680148773453194-1/-ext-10001  
    Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/warehouse/testc  
    Table default.testc stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]  
    MapReduce Jobs Launched:   
    Job 0: Map: 1   Cumulative CPU: 1.11 sec   HDFS Read: 297 HDFS Write: 45 SUCCESS  
    Total MapReduce CPU Time Spent: 1 seconds 110 msec  
    OK  
    Time taken: 14.292 seconds  
    hive> desc testC;  
    OK  
    name                    string                                        
    code                    string                                        
    Time taken: 0.032 seconds, Fetched: 2 row(s)  

    二、Hive数据导出的几种方式

    (1)导出到本地文件系统

    hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/output' ROW FORMAT DELIMITED FIELDS TERMINATED by ',' select * from testA;  
    Total jobs = 1  
    Launching Job 1 out of 1  
    Number of reduce tasks is set to 0 since there's no reduce operator  
    Starting Job = job_1451024007879_0001, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0001/  
    Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1451024007879_0001  
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
    2015-12-25 17:04:30,447 Stage-1 map = 0%,  reduce = 0%  
    2015-12-25 17:04:35,616 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.16 sec  
    MapReduce Total cumulative CPU time: 1 seconds 160 msec  
    Ended Job = job_1451024007879_0001  
    Copying data to local directory /home/hadoop/output  
    Copying data to local directory /home/hadoop/output  
    MapReduce Jobs Launched:   
    Job 0: Map: 1   Cumulative CPU: 1.16 sec   HDFS Read: 305 HDFS Write: 110 SUCCESS  
    Total MapReduce CPU Time Spent: 1 seconds 160 msec  
    OK  
    Time taken: 16.701 seconds  

    查看数据结果:

    [hadoop@hadoopcluster78 output]$ cat /home/hadoop/output/000000_0   
    1,fish1,SZ,2015-07-08  
    2,fish2,SH,2015-07-08  
    3,fish3,HZ,2015-07-08  
    4,fish4,QD,2015-07-08  
    5,fish5,SR,2015-07-08  

    通过INSERT OVERWRITE LOCAL DIRECTORY将hive表testA数据导入到/home/hadoop目录,众所周知,HQL会启动Mapreduce完成,其实/home/hadoop就是Mapreduce输出路径,产生的结果存放在文件名为:000000_0。

    (2)导出到HDFS

    导入到HDFS和导入本地文件类似,去掉HQL语句的LOCAL就可以了

    hive> INSERT OVERWRITE DIRECTORY '/home/hadoop/output' select * from testA;   
    Total jobs = 3  
    Launching Job 1 out of 3  
    Number of reduce tasks is set to 0 since there's no reduce operator  
    Starting Job = job_1451024007879_0002, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0002/  
    Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1451024007879_0002  
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
    2015-12-25 17:08:51,034 Stage-1 map = 0%,  reduce = 0%  
    2015-12-25 17:08:59,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.4 sec  
    MapReduce Total cumulative CPU time: 1 seconds 400 msec  
    Ended Job = job_1451024007879_0002  
    Stage-3 is selected by condition resolver.  
    Stage-2 is filtered out by condition resolver.  
    Stage-4 is filtered out by condition resolver.  
    Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/hive-hadoop/hive_2015-12-25_17-08-43_733_1768532778392261937-1/-ext-10000  
    Moving data to: /home/hadoop/output  
    MapReduce Jobs Launched:   
    Job 0: Map: 1   Cumulative CPU: 1.4 sec   HDFS Read: 305 HDFS Write: 110 SUCCESS  
    Total MapReduce CPU Time Spent: 1 seconds 400 msec  
    OK  
    Time taken: 16.667 seconds  

    查看hfds输出文件:

    [hadoop@hadoopcluster78 bin]$ ./hadoop fs -cat /home/hadoop/output/000000_0  
    1fish1SZ2015-07-08  
    2fish2SH2015-07-08  
    3fish3HZ2015-07-08  
    4fish4QD2015-07-08  
    5fish5SR2015-07-08  

    其他

    采用hive的-e和-f参数来导出数据。

    参数为: -e 的使用方式,后面接SQL语句。>>后面为输出文件路径

    [hadoop@hadoopcluster78 bin]$ ./hive -e "select * from testA" >> /home/hadoop/output/testA.txt  
    15/12/25 17:15:07 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead  
      
    Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties  
    OK  
    Time taken: 1.128 seconds, Fetched: 5 row(s)  
    [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testA.txt   
    1   fish1   SZ  2015-07-08  
    2   fish2   SH  2015-07-08  
    3   fish3   HZ  2015-07-08  
    4   fish4   QD  2015-07-08  
    5   fish5   SR  2015-07-08  

    参数为: -f 的使用方式,后面接存放sql语句的文件。>>后面为输出文件路径

    SQL语句文件:

    [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/sql.sql   
    select * from testA  

    使用-f参数执行:

    [hadoop@hadoopcluster78 bin]$ ./hive -f /home/hadoop/output/sql.sql >> /home/hadoop/output/testB.txt  
    15/12/25 17:20:52 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead  
      
    Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties  
    OK  
    Time taken: 1.1 seconds, Fetched: 5 row(s)  

    参看结果:

    [hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testB.txt   
    1   fish1   SZ  2015-07-08  
    2   fish2   SH  2015-07-08  
    3   fish3   HZ  2015-07-08  
    4   fish4   QD  2015-07-08  
    5   fish5   SR  2015-07-08  
  • 相关阅读:
    Coursera机器学习week11 单元测试
    关于 TypeReference 的解释
    getModifiers 方法解释。
    instanceof isInstance isAssignableFrom 比较
    elasticsearch 基础 语法总结
    kibana 启动 关闭 和进程查找
    MD5 SHA1 SHA256 SHA512 SHA1WithRSA 的区别
    spring boot 项目 热启动
    java zip 压缩文件
    Packet for query is too large (1660 > 1024). You can change this value on the server by setting the max_allowed_packet' variable.
  • 原文地址:https://www.cnblogs.com/duanxz/p/9015937.html
Copyright © 2011-2022 走看看