zoukankan      html  css  js  c++  java
  • hive外部表自动读取文件夹里的数据

    我们在创建表的时候可以指定external关键字创建外部表,外部表对应的文件存储在location指定的目录下,向该目录添加新文件的同时,该表也会读取到该文件(当然文件格式必须跟表定义的一致),删除外部表的同时并不会删除location指定目录下的文件.

    1.查看hdfs系统目录/user/hadoop1/myfile下文件

    [hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
    Found 1 items
    -rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt
    

     2.创建外部表指向myfile目录下的文件

    hive (hxl)> create external table tb_class_info_external
              > (id int,
              > class_name string,
              > createtime timestamp ,
              > modifytime timestamp)
              > ROW FORMAT DELIMITED
              > FIELDS TERMINATED BY '|'
              > location '/user/hadoop1/myfile';
    OK
    Time taken: 0.083 seconds 
    

     注意这里的location指向的是hdfs系统上的路径,而不是本地机器上的路径,这里表tb_class_info_external会读取myfile目录下的所有文件

    3.查看外部表

    hive (hxl)> select count(1) from tb_class_info_external;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    Starting Job = job_201410300915_0009, Tracking URL = http://node1:50030/jobdetails.jsp?jobid=job_201410300915_0009
    Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=http://192.168.56.101:9001 -kill job_201410300915_0009
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2014-10-30 15:25:10,652 Stage-1 map = 0%,  reduce = 0%
    2014-10-30 15:25:12,664 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:13,671 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:14,682 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:15,690 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:16,697 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:17,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:18,710 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:19,718 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:25:20,725 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
    2014-10-30 15:25:21,730 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
    2014-10-30 15:25:22,737 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.21 sec
    MapReduce Total cumulative CPU time: 1 seconds 210 msec
    Ended Job = job_201410300915_0009
    MapReduce Jobs Launched:
    Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.21 sec   HDFS Read: 568052 HDFS Write: 6 SUCCESS
    Total MapReduce CPU Time Spent: 1 seconds 210 msec
    OK
    10001
    Time taken: 14.742 seconds 
    

     可以看到这里表记录数是10001,下面我们在myfile目录下添加另外一个文件tb_class_bak.txt

    4.在myfile目录下添加文本

    $hadoop fs -cp /user/hadoop1/myfile/tb_class.txt /user/hadoop1/myfile/tb_class_bak.txt
    

     5.再次查询表记录数

    hive (hxl)> select count(1) from tb_class_info_external;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    Starting Job = job_201410300915_0010, Tracking URL = http://node1:50030/jobdetails.jsp?jobid=job_201410300915_0010
    Kill Command = /usr1/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=http://192.168.56.101:9001 -kill job_201410300915_0010
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2014-10-30 15:32:02,275 Stage-1 map = 0%,  reduce = 0%
    2014-10-30 15:32:04,286 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:05,292 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:06,300 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:07,306 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:08,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:09,319 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:10,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:11,331 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.48 sec
    2014-10-30 15:32:12,338 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
    2014-10-30 15:32:13,343 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
    2014-10-30 15:32:14,350 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.16 sec
    MapReduce Total cumulative CPU time: 1 seconds 160 msec
    Ended Job = job_201410300915_0010
    MapReduce Jobs Launched:
    Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.16 sec   HDFS Read: 1135971 HDFS Write: 6 SUCCESS
    Total MapReduce CPU Time Spent: 1 seconds 160 msec
    OK
    20002
    Time taken: 14.665 seconds 
    

     可以看到记录数加倍了,那就说明表已经读取了新增加的文件

    6.删除表

    hive (hxl)> drop table tb_class_info_external;
    OK
    Time taken: 1.7 seconds
    

     表对应的文件并没有删除

    [hadoop1@node1]$ hadoop fs -ls /user/hadoop1/myfile/
    Found 2 items
    -rw-r--r--   3 hadoop1 supergroup     567839 2014-10-29 16:50 /user/hadoop1/myfile/tb_class.txt
    -rw-r--r--   3 hadoop1 supergroup     567839 2014-10-30 15:28 /user/hadoop1/myfile/tb_class_bak.txt
    
  • 相关阅读:
    BZOJ_3133_[Baltic2013]ballmachine_堆+倍增
    机器学习—朴素贝叶斯及其优化
    机器学习-输出一颗树
    机器学习-决策树
    KNN-综合应用
    KNN-机器学习算法
    [转载]Jupyter notebook调试
    机器学习-归一化
    神经网络模型及反向传播代码完全解析
    [转载]神经网络偏置项(bias)的设置及作用
  • 原文地址:https://www.cnblogs.com/zwgblog/p/5969827.html
Copyright © 2011-2022 走看看