zoukankan      html  css  js  c++  java
  • [转]impala操作hive数据实例

    https://blog.csdn.net/wiborgite/article/details/78813342

    背景说明:

    基于CHD quick VM环境,在一个VM中同时包含了HDFS、YARN、HBase、Hive、Impala等组件。

    本文将一个文本数据从HDFS加载到Hive,同步元数据后,在Impala中进行数据操作。

    -----------------------------------------------------------------------------------------Linux Shell的操作-----------------------------------------------------------

    1、将PC本地的数据文件上传到VM中/home/data目录下

    1.  
      [root@quickstart data]# pwd
    2.  
      /home/data
    3.  
      [root@quickstart data]# ls
    4.  
      p10pco2a.dat stock_data2.csv
    5.  
      [root@quickstart data]# head p10pco2a.dat
    6.  
      WOCE_P10,1993,279.479,-16.442,172.219,24.9544,34.8887,1.0035,363.551,2
    7.  
      WOCE_P10,1993,279.480,-16.440,172.214,24.9554,34.8873,1.0035,363.736,2
    8.  
      WOCE_P10,1993,279.480,-16.439,172.213,24.9564,34.8868,1.0033,363.585,2
    9.  
      WOCE_P10,1993,279.481,-16.438,172.209,24.9583,34.8859,1.0035,363.459,2
    10.  
      WOCE_P10,1993,279.481,-16.437,172.207,24.9594,34.8859,1.0033,363.543,2
    11.  
      WOCE_P10,1993,279.481,-16.436,172.205,24.9604,34.8858,1.0035,363.432,2
    12.  
      WOCE_P10,1993,279.489,-16.417,172.164,24.9743,34.8867,1.0036,362.967,2
    13.  
      WOCE_P10,1993,279.490,-16.414,172.158,24.9742,34.8859,1.0035,362.960,2
    14.  
      WOCE_P10,1993,279.491,-16.412,172.153,24.9747,34.8864,1.0033,362.998,2
    15.  
      WOCE_P10,1993,279.492,-16.411,172.148,24.9734,34.8868,1.0031,363.022,2


    2、将/home/data/p10pco2a.dat文件上传到HDFS

    1.  
      [root@quickstart data]# hdfs dfs -put p10pco2a.dat /tmp/
    2.  
      [root@quickstart data]# hdfs dfs -ls /tmp
    3.  
      -rw-r--r-- 1 root supergroup 281014 2017-12-14 18:47 /tmp/p10pco2a.dat


    -----------------------------------------------------------------------Hive的操作----------------------------------------------------------------------------

    1、启动Hive CLI

    # hive

    2、Hive中创建数据库

    CREATE DATABASE  weather;

    3、Hive中创建表

    1.  
      create table weather.weather_everydate_detail
    2.  
      (
    3.  
      section string,
    4.  
      year bigint,
    5.  
      date double,
    6.  
      latim double,
    7.  
      longit double,
    8.  
      sur_tmp double,
    9.  
      sur_sal double,
    10.  
      atm_per double,
    11.  
      xco2a double,
    12.  
      qf bigint
    13.  
      )
    14.  
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';



    4、将HDFS中的数据加载到已创建的Hive表中

    1.  
      LOAD DATA INPATH '/tmp/p10pco2a.dat' INTO TABLE weather.weather_everydate_detail;
    2.  
       
    3.  
      hive> LOAD DATA INPATH '/tmp/p10pco2a.dat' INTO TABLE weather.weather_everydate_detail;
    4.  
      Loading data to table weather.weather_everydate_detail
    5.  
      Table weather.weather_everydate_detail stats: [numFiles=1, totalSize=281014]
    6.  
      OK
    7.  
      Time taken: 1.983 seconds


    5、查看Hive表确保数据已加载

    1.  
      use weather;
    2.  
      select * from weather.weather_everydate_detail limit 10;
    3.  
      select count(*) from weather.weather_everydate_detail;
    1.  
      hive> select * from weather.weather_everydate_detail limit 10;
    2.  
      OK
    3.  
      WOCE_P10 1993 279.479 -16.442 172.219 24.9544 34.8887 1.0035 363.551 2
    4.  
      WOCE_P10 1993 279.48 -16.44 172.214 24.9554 34.8873 1.0035 363.736 2
    5.  
      WOCE_P10 1993 279.48 -16.439 172.213 24.9564 34.8868 1.0033 363.585 2
    6.  
      WOCE_P10 1993 279.481 -16.438 172.209 24.9583 34.8859 1.0035 363.459 2
    7.  
      WOCE_P10 1993 279.481 -16.437 172.207 24.9594 34.8859 1.0033 363.543 2
    8.  
      WOCE_P10 1993 279.481 -16.436 172.205 24.9604 34.8858 1.0035 363.432 2
    9.  
      WOCE_P10 1993 279.489 -16.417 172.164 24.9743 34.8867 1.0036 362.967 2
    10.  
      WOCE_P10 1993 279.49 -16.414 172.158 24.9742 34.8859 1.0035 362.96 2
    11.  
      WOCE_P10 1993 279.491 -16.412 172.153 24.9747 34.8864 1.0033 362.998 2
    12.  
      WOCE_P10 1993 279.492 -16.411 172.148 24.9734 34.8868 1.0031 363.022 2
    13.  
      Time taken: 0.815 seconds, Fetched: 10 row(s)
    14.  
      hive> select count(*) from weather.weather_everydate_detail;
    15.  
      Query ID = root_20171214185454_c783708d-ad4b-46cc-9341-885c16a286fe
    16.  
      Total jobs = 1
    17.  
      Launching Job 1 out of 1
    18.  
      Number of reduce tasks determined at compile time: 1
    19.  
      In order to change the average load for a reducer (in bytes):
    20.  
      set hive.exec.reducers.bytes.per.reducer=<number>
    21.  
      In order to limit the maximum number of reducers:
    22.  
      set hive.exec.reducers.max=<number>
    23.  
      In order to set a constant number of reducers:
    24.  
      set mapreduce.job.reduces=<number>
    25.  
      Starting Job = job_1512525269046_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1512525269046_0001/
    26.  
      Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1512525269046_0001
    27.  
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    28.  
      2017-12-14 18:55:27,386 Stage-1 map = 0%, reduce = 0%
    29.  
      2017-12-14 18:56:11,337 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 39.36 sec
    30.  
      2017-12-14 18:56:18,711 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 41.88 sec
    31.  
      MapReduce Total cumulative CPU time: 41 seconds 880 msec
    32.  
      Ended Job = job_1512525269046_0001
    33.  
      MapReduce Jobs Launched:
    34.  
      Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 41.88 sec HDFS Read: 288541 HDFS Write: 5 SUCCESS
    35.  
      Total MapReduce CPU Time Spent: 41 seconds 880 msec
    36.  
      OK
    37.  
      4018
    38.  
      Time taken: 101.82 seconds, Fetched: 1 row(s)


    6、执行一个普通查询:

    1.  
      hive> select * from weather_everydate_detail where sur_sal=34.8105;
    2.  
      OK
    3.  
      WOCE_P10 1993 312.148 34.602 141.951 24.0804 34.8105 1.0081 361.29 2
    4.  
      WOCE_P10 1993 312.155 34.602 141.954 24.0638 34.8105 1.0079 360.386 2
    5.  
      Time taken: 0.138 seconds, Fetched: 2 row(s)
    1.  
      hive> select * from weather_everydate_detail where sur_sal=34.8105;
    2.  
      OK
    3.  
      WOCE_P10 1993 312.148 34.602 141.951 24.0804 34.8105 1.0081 361.29 2
    4.  
      WOCE_P10 1993 312.155 34.602 141.954 24.0638 34.8105 1.0079 360.386 2
    5.  
      Time taken: 1.449 seconds, Fetched: 2 row(s)



    -----------------------------------------------------------------------------------------------------Impala的操作-----------------------------------------------------------
    1、启动Impala CLI

    # impala-shell 

    2、在Impala中同步元数据

    1.  
      [quickstart.cloudera:21000] > INVALIDATE METADATA;
    2.  
      Query: invalidate METADATA
    3.  
      Query submitted at: 2017-12-14 19:01:12 (Coordinator: http://quickstart.cloudera:25000)
    4.  
      Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=43460ace5d3a9971:9a50f46600000000
    5.  
       
    6.  
      Fetched 0 row(s) in 3.25s


    3、在Impala中查看Hive中表的结构

    1.  
      [quickstart.cloudera:21000] > use weather;
    2.  
      Query: use weather
    3.  
      [quickstart.cloudera:21000] > desc weather.weather_everydate_detail;
    4.  
      Query: describe weather.weather_everydate_detail
    5.  
      +---------+--------+---------+
    6.  
      | name | type | comment |
    7.  
      +---------+--------+---------+
    8.  
      | section | string | |
    9.  
      | year | bigint | |
    10.  
      | date | double | |
    11.  
      | latim | double | |
    12.  
      | longit | double | |
    13.  
      | sur_tmp | double | |
    14.  
      | sur_sal | double | |
    15.  
      | atm_per | double | |
    16.  
      | xco2a | double | |
    17.  
      | qf | bigint | |
    18.  
      +---------+--------+---------+
    19.  
      Fetched 10 row(s) in 3.70s


    4、查询记录数量

    1.  
      [quickstart.cloudera:21000] > select count(*) from weather.weather_everydate_detail;
    2.  
      Query: select count(*) from weather.weather_everydate_detail
    3.  
      Query submitted at: 2017-12-14 19:03:11 (Coordinator: http://quickstart.cloudera:25000)
    4.  
      Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=5542894eeb80e509:1f9ce37f00000000
    5.  
      +----------+
    6.  
      | count(*) |
    7.  
      +----------+
    8.  
      | 4018 |
    9.  
      +----------+
    10.  
      Fetched 1 row(s) in 2.51s

    说明:对比Impala与Hive中的count查询,2.15 VS 101.82,Impala的优势还是相当明显的 

    5、执行一个普通查询

    1.  
      [quickstart.cloudera:21000] > select * from weather_everydate_detail where sur_sal=34.8105;
    2.  
      Query: select * from weather_everydate_detail where sur_sal=34.8105
    3.  
      Query submitted at: 2017-12-14 19:20:27 (Coordinator: http://quickstart.cloudera:25000)
    4.  
      Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=c14660ed0bda471f:d92fcf0e00000000
    5.  
      +----------+------+---------+--------+---------+---------+---------+---------+---------+----+
    6.  
      | section | year | date | latim | longit | sur_tmp | sur_sal | atm_per | xco2a | qf |
    7.  
      +----------+------+---------+--------+---------+---------+---------+---------+---------+----+
    8.  
      | WOCE_P10 | 1993 | 312.148 | 34.602 | 141.951 | 24.0804 | 34.8105 | 1.0081 | 361.29 | 2 |
    9.  
      | WOCE_P10 | 1993 | 312.155 | 34.602 | 141.954 | 24.0638 | 34.8105 | 1.0079 | 360.386 | 2 |
    10.  
      +----------+------+---------+--------+---------+---------+---------+---------+---------+----+
    11.  
      Fetched 2 row(s) in 0.25s
    1.  
      [quickstart.cloudera:21000] > select * from weather_everydate_detail where sur_tmp=24.0804;
    2.  
      Query: select * from weather_everydate_detail where sur_tmp=24.0804
    3.  
      Query submitted at: 2017-12-14 23:15:32 (Coordinator: http://quickstart.cloudera:25000)
    4.  
      Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=774e2b3b81f4eed7:8952b5b400000000
    5.  
      +----------+------+---------+--------+---------+---------+---------+---------+--------+----+
    6.  
      | section | year | date | latim | longit | sur_tmp | sur_sal | atm_per | xco2a | qf |
    7.  
      +----------+------+---------+--------+---------+---------+---------+---------+--------+----+
    8.  
      | WOCE_P10 | 1993 | 312.148 | 34.602 | 141.951 | 24.0804 | 34.8105 | 1.0081 | 361.29 | 2 |
    9.  
      +----------+------+---------+--------+---------+---------+---------+---------+--------+----+
    10.  
      Fetched 1 row(s) in 3.86s



    6.结论

    对于Hive中需要编译为mapreduce执行的SQL,在Impala中执行是有明显的速度优势的,但是Hive也不是所有的查询都要编译为mapreduce,此类型的查询,impala相比于Hive就没啥优势了。

  • 相关阅读:
    os.mkdir()与 shutil.rmtree()对文件夹的 创建与删除
    tf.assign_add
    任意图像尺寸变成目标尺寸(包含相应的boxes的变换)
    文件的读取(txt文件)
    tensorflow中使用变量作用域及tf.variable(),tf,getvariable()与tf.variable_scope()的用法
    python中字典的建立
    图像上划凸多边形(convexHull()函数)
    Cypress 系列之----01 安装和使用
    Excel应用----制作二级下拉菜单【转】
    Selenium系列之--08 操作已打开的浏览器
  • 原文地址:https://www.cnblogs.com/wincai/p/10431165.html
Copyright © 2011-2022 走看看