zoukankan      html  css  js  c++  java
  • Hadoop-Impala学习笔记之入门

    CDH quickstart vm包含了单节点的全套hadoop服务生态,可从https://www.cloudera.com/downloads/quickstart_vms/5-13.html下载。如下:

    对应的节点如下(不包含Cloudera Navigator):

    要学习完整的hadoop生态,最好是使用8C/32GB以上的服务器,4C/16GB勉强能跑、但是很勉强(最好使用2个以上节点)。

    impala 使用c++编写(Spark使用Scala编写),采用MPP架构(类似于MariaDB Columnstore,也就是之前的infinidb),由下列组件组成:

    Hue是一个Web智能查询分析器,能够进行语法提示,查询Impala、HDFS、HBase。如下:

    其中impala服务器由Impala Daemon(执行SQL)、Impala Statestore(监控Daemon状态)、Impala Catalog(将DDL变更传输给Daemon节点,避免了DDL通过Impala执行时运行REFRESH/INVALIDATE METADATA的必要,通过Hive时,仍然需要)组成。impala-shell和mysql客户端类似,执行SQL。

    Impala使用和Hive一样的元数据,其可以存储在mysql或postgresql中,称为metastore。

    Impala使用HDFS作为主要的存储底层,利用其冗余特性。

    Impala还支持Hbase作为存储,通过定义映射到Hbase的表,可以查询HBase表,还可以关联查询HBase和Impala表。

    impala可以使用Cloudera Manager或命令行启动:

    Cloudera Manager启动如下:

    命令行启动(这种方式启动CM是无法监控到其状态的,而且进程也略有不同):

    service impala-state-store start/restart/stop

    service impala-catalog start/restart/stop

    service impala-server start/restart/stop

    CM启动后进程如下:

    日志信息位于/var/log/impala,如下:

    配置可通过CM修改、也可以修改配置文件/etc/default/impala。

     impala客户端

    [root@quickstart impala]# impala-shell 
    Starting Impala Shell without Kerberos authentication
    Connected to quickstart.cloudera:21000
    Server version: impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4)
    ***********************************************************************************
    Welcome to the Impala shell.
    (Impala Shell v2.10.0-cdh5.13.0 (2511805) built on Wed Oct  4 10:55:37 PDT 2017)
    
    The HISTORY command lists all shell commands in chronological order.
    ***********************************************************************************
    [quickstart.cloudera:21000] > 

    全新的Impala实例包含2个库:default(新创建表的默认库)以及_impala_builtins。

    可通过show database/show table/select version()查看数据库信息(语法兼容SQL 92/MySQL,大部分NoSQL的参考实现,不同于Oracle)

    [quickstart.cloudera:21000] > select version();
    Query: select version()
    Query submitted at: 2019-04-06 04:59:25 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=ef4cbbf93a7662e5:f4e103f500000000
    +-------------------------------------------------------------------------------------------+
    | version()                                                                                 |
    +-------------------------------------------------------------------------------------------+
    | impalad version 2.10.0-cdh5.13.0 RELEASE (build 2511805f1eaa991df1460276c7e9f19d819cd4e4) |
    | Built on Wed Oct  4 10:55:37 PDT 2017                                                     |
    +-------------------------------------------------------------------------------------------+
    Fetched 1 row(s) in 0.34s
    [quickstart.cloudera:21000] > show databases;
    Query: show databases
    +------------------+----------------------------------------------+
    | name             | comment                                      |
    +------------------+----------------------------------------------+
    | _impala_builtins | System database for Impala builtin functions |
    | default          | Default Hive database                        |
    +------------------+----------------------------------------------+
    Fetched 2 row(s) in 0.01s
    [quickstart.cloudera:21000] > show tables;
    Query: show tables
    +------+
    | name |
    +------+
    | tab1 |
    | tab2 |
    | tab3 |
    +------+
    Fetched 3 row(s) in 0.02s
    [quickstart.cloudera:21000] > select * from tab1;
    Query: select * from tab1
    Query submitted at: 2019-04-06 05:05:57 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=dd4debde8a589622:81983b200000000
    +----+-------+------------+-------------------------------+
    | id | col_1 | col_2      | col_3                         |
    +----+-------+------------+-------------------------------+
    | 1  | true  | 123.123    | 2012-10-24 08:55:00           |
    | 2  | false | 1243.5     | 2012-10-25 13:40:00           |
    | 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
    | 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
    | 5  | true  | 243.325    | 1953-04-22 09:11:33           |
    +----+-------+------------+-------------------------------+
    Fetched 5 row(s) in 6.06s  -- 第一次访问特别慢,因为需要加载到内存
    [quickstart.cloudera:21000] > select * from tab1;
    Query: select * from tab1
    Query submitted at: 2019-04-06 05:06:05 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=21486d50db995bf2:a2d061b400000000
    +----+-------+------------+-------------------------------+
    | id | col_1 | col_2      | col_3                         |
    +----+-------+------------+-------------------------------+
    | 1  | true  | 123.123    | 2012-10-24 08:55:00           |
    | 2  | false | 1243.5     | 2012-10-25 13:40:00           |
    | 3  | false | 24453.325  | 2008-08-22 09:33:21.123000000 |
    | 4  | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
    | 5  | true  | 243.325    | 1953-04-22 09:11:33           |
    +----+-------+------------+-------------------------------+
    Fetched 5 row(s) in 0.26s
    [quickstart.cloudera:21000] > desc tab1;
    Query: describe tab1
    +-------+-----------+---------+
    | name  | type      | comment |
    +-------+-----------+---------+
    | id    | int       |         |
    | col_1 | boolean   |         |
    | col_2 | double    |         |
    | col_3 | timestamp |         |
    +-------+-----------+---------+
    Fetched 4 row(s) in 0.03s
    [quickstart.cloudera:21000] > select count(1) from tab1;
    Query: select count(1) from tab1
    Query submitted at: 2019-04-06 05:06:28 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9141cf84d4efd2e3:e667f1d00000000
    +----------+
    | count(1) |
    +----------+
    | 5        |
    +----------+
    Fetched 1 row(s) in 0.17s
    [quickstart.cloudera:21000] > create database my_first_impala_db;
    Query: create database my_first_impala_db
    Fetched 0 row(s) in 0.08s
    [quickstart.cloudera:21000] > create table t1 (x int);
    Query: create table t1 (x int)
    Fetched 0 row(s) in 0.08s
    [quickstart.cloudera:21000] > insert into t1 values (1), (3), (2), (4); --支持mysql语法
    Query: insert into t1 values (1), (3), (2), (4)
    Query submitted at: 2019-04-06 05:07:32 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=8c41d97919423a19:1f7aeeeb00000000
    Modified 4 row(s) in 5.17s
    [quickstart.cloudera:21000] > insert into t1 select * from t1;  --支持insert select
    Query: insert into t1 select * from t1
    Query submitted at: 2019-04-06 05:07:45 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=9c452b951b08da49:92d9fee100000000
    Modified 4 row(s) in 0.23s

    创建基于HDFS文件的表(需要先创建HDFS文件,并加载数据,参见Hadoop-HDFS学习笔记):

    [quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab2;
    Query: drop TABLE IF EXISTS tab2
    [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE tab2
                                > (
                                >    id INT,
                                >    col_1 BOOLEAN,
                                >    col_2 DOUBLE
                                > )
                                > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
                                > LOCATION '/user/cloudera/sample_data/tab2';
    Query: create EXTERNAL TABLE tab2
    (
       id INT,
       col_1 BOOLEAN,
       col_2 DOUBLE
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LOCATION '/user/cloudera/sample_data/tab2'
    Fetched 0 row(s) in 0.09s
    [quickstart.cloudera:21000] > DROP TABLE IF EXISTS tab3;
    Query: drop TABLE IF EXISTS tab3
    [quickstart.cloudera:21000] > CREATE TABLE tab3
                                > (
                                >    id INT,
                                >    col_1 BOOLEAN,
                                >    col_2 DOUBLE,
                                >    month INT,
                                >    day INT
                                > )
                                > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    Query: create TABLE tab3
    (
       id INT,
       col_1 BOOLEAN,
       col_2 DOUBLE,
       month INT,
       day INT
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    Fetched 0 row(s) in 0.09s

     包含关联、子查询、聚合的SQL查询:

    SELECT tab2.*
    FROM tab2,
    (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
    FROM tab2, tab1
    WHERE tab1.id = tab2.id
    GROUP BY col_1) subquery1
    WHERE subquery1.max_col2 = tab2.col_2;

    查看SQL的执行计划:

    explain SELECT tab2.*
    FROM tab2,
    (SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
    FROM tab2, tab1
    WHERE tab1.id = tab2.id
    GROUP BY col_1) subquery1
    WHERE subquery1.max_col2 = tab2.col_2;
    +------------------------------------------------------------------------------------+
    | Explain String                                                                     |
    +------------------------------------------------------------------------------------+
    | Max Per-Host Resource Reservation: Memory=136.00MB                                 |
    | Per-Host Resource Estimates: Memory=4.34GB                                         |
    | WARNING: The following tables are missing relevant table and/or column statistics. |
    | default.tab1, default.tab2                                                         |
    |                                                                                    |
    | PLAN-ROOT SINK                                                                     |
    | |                                                                                  |
    | 10:EXCHANGE [UNPARTITIONED]                                                        |
    | |                                                                                  |
    | 05:HASH JOIN [INNER JOIN, BROADCAST]                                               |
    | |  hash predicates: tab2.col_2 = max(tab2.col_2)                                   |
    | |  runtime filters: RF000 <- max(tab2.col_2)                                       |
    | |                                                                                  |
    | |--09:EXCHANGE [BROADCAST]                                                         |
    | |  |                                                                               |
    | |  08:AGGREGATE [FINALIZE]                                                         |
    | |  |  output: max:merge(tab2.col_2)                                                |
    | |  |  group by: tab1.col_1                                                         |
    | |  |                                                                               |
    | |  07:EXCHANGE [HASH(tab1.col_1)]                                                  |
    | |  |                                                                               |
    | |  04:AGGREGATE [STREAMING]                                                        |
    | |  |  output: max(tab2.col_2)                                                      |
    | |  |  group by: tab1.col_1                                                         |
    | |  |                                                                               |
    | |  03:HASH JOIN [INNER JOIN, BROADCAST]                                            |
    | |  |  hash predicates: tab2.id = tab1.id                                           |
    | |  |  runtime filters: RF001 <- tab1.id                                            |
    | |  |                                                                               |
    | |  |--06:EXCHANGE [BROADCAST]                                                      |
    | |  |  |                                                                            |
    | |  |  02:SCAN HDFS [default.tab1]                                                  |
    | |  |     partitions=1/1 files=1 size=192B                                          |
    | |  |                                                                               |
    | |  01:SCAN HDFS [default.tab2]                                                     |
    | |     partitions=1/1 files=1 size=158B                                             |
    | |     runtime filters: RF001 -> tab2.id                                            |
    | |                                                                                  |
    | 00:SCAN HDFS [default.tab2]                                                        |
    |    partitions=1/1 files=1 size=158B                                                |
    |    runtime filters: RF000 -> tab2.col_2                                            |
    +------------------------------------------------------------------------------------+
    Fetched 41 row(s) in 0.05s

    创建基于Parquet的文件,并转换为内部分区表

    [quickstart.cloudera:21000] > USE airlines_data;
    Query: use airlines_data
    [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
                                > LIKE PARQUET
                                > 'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
                                > STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines';
    Query: create EXTERNAL TABLE airlines_external
    LIKE PARQUET
    'hdfs:/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
    STORED AS PARQUET LOCATION 'hdfs:/user/impala/staging/airlines'
    WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/impala/staging'
    
    Fetched 0 row(s) in 0.82s
    [quickstart.cloudera:21000] > SHOW TABLE STATS airlines_external;
    Query: show TABLE STATS airlines_external
    +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
    | #Rows | #Files | Size   | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                     |
    +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
    | -1    | 8      | 1.34GB | NOT CACHED   | NOT CACHED        | PARQUET | false             | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines |
    +-------+--------+--------+--------------+-------------------+---------+-------------------+--------------------------------------------------------------+
    Fetched 1 row(s) in 3.89s
    [quickstart.cloudera:21000] > SHOW FILES IN airlines_external;
    Query: show FILES IN airlines_external
    +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
    | Path                                                                                                                  | Size     | Partition |
    +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq | 252.99MB |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.1.parq | 13.43MB  |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.0.parq  | 252.84MB |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd984_501176748_data.1.parq  | 63.92MB  |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd985_1199995767_data.0.parq | 183.64MB |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd986_2086627597_data.0.parq | 240.04MB |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd987_1048668565_data.0.parq | 211.35MB |           |
    | hdfs://quickstart.cloudera:8020/user/impala/staging/airlines/4345e5eef217aa1b-c8f16177f35fd988_1432111844_data.0.parq | 151.46MB |           |
    +-----------------------------------------------------------------------------------------------------------------------+----------+-----------+
    Fetched 8 row(s) in 0.04s
    [quickstart.cloudera:21000] > DESCRIBE airlines_external;
    Query: describe airlines_external
    +---------------------+--------+-----------------------------+
    | name                | type   | comment                     |
    +---------------------+--------+-----------------------------+
    | year                | int    | Inferred from Parquet file. |
    | month               | int    | Inferred from Parquet file. |
    | day                 | int    | Inferred from Parquet file. |
    | dayofweek           | int    | Inferred from Parquet file. |
    | dep_time            | int    | Inferred from Parquet file. |
    | crs_dep_time        | int    | Inferred from Parquet file. |

    -- 单表查询速度还是不错的。。。

    [quickstart.cloudera:21000] > SELECT COUNT(*) FROM airlines_external;
    Query: select COUNT(*) FROM airlines_external
    Query submitted at: 2019-04-06 07:08:33 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=a04cd594518cba3a:1666b4e00000000
    +-----------+
    | count(*) |
    +-----------+
    | 123534969 |
    +-----------+
    Fetched 1 row(s) in 0.33s
    [quickstart.cloudera:21000] > SElECT NDV(carrier), NDV(flight_num), NDV(tail_num),
    > NDV(origin), NDV(dest) FROM airlines_external;
    Query: select NDV(carrier), NDV(flight_num), NDV(tail_num),
    NDV(origin), NDV(dest) FROM airlines_external
    Query submitted at: 2019-04-06 07:08:53 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=df4776d3fa8c1f69:e4e09e6100000000
    +--------------+-----------------+---------------+-------------+-----------+
    | ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
    +--------------+-----------------+---------------+-------------+-----------+
    | 29 | 8463 | 3 | 342 | 349 |
    +--------------+-----------------+---------------+-------------+-----------+
    Fetched 1 row(s) in 9.33s
    [quickstart.cloudera:21000] > SELECT tail_num, COUNT(*) AS howmany FROM airlines_external
    > GROUP BY tail_num;
    Query: select tail_num, COUNT(*) AS howmany FROM airlines_external
    GROUP BY tail_num
    Query submitted at: 2019-04-06 07:09:19 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=7d43af89d18c600e:bc464f0f00000000
    +----------+-----------+
    | tail_num | howmany |
    +----------+-----------+
    | 715 | 1 |
    | 0 | 406405 |
    | 112 | 6562 |
    | NULL | 123122001 |
    +----------+-----------+
    Fetched 4 row(s) in 2.42s

    -- 大表关联,内存不足报错
    [quickstart.cloudera:21000] > SELECT DISTINCT dest FROM airlines_external
                                > WHERE dest NOT IN (SELECT origin FROM airlines_external);
    Query: select DISTINCT dest FROM airlines_external
    WHERE dest NOT IN (SELECT origin FROM airlines_external)
    Query submitted at: 2019-04-06 07:09:35 (Coordinator: http://quickstart.cloudera:25000)
    Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=314343c761a55f97:61ce9aa500000000
    WARNINGS: Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002
    Memory left in process limit: -328.00 KB
    Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
      Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB
      Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
        EXCHANGE_NODE (id=7): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=0
        PLAN_ROOT_SINK: Total=0 Peak=0
        CodeGen: Total=0 Peak=0
      Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
        AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB
          Exprs: Total=21.12 KB Peak=21.12 KB
        EXCHANGE_NODE (id=5): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=0
        DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB
        CodeGen: Total=1.86 KB Peak=347.00 KB
      Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
        AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
          Exprs: Total=21.12 KB Peak=21.12 KB
        HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
          Exprs: Total=21.12 KB Peak=21.12 KB
          Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB
            Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB
        HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB
        EXCHANGE_NODE (id=4): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=11.65 MB
        DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB
        CodeGen: Total=12.80 KB Peak=2.00 MB
      Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB
        HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB
        DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB
        CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
      Buffer Pool: Free Buffers: Total=260.00 MB
      Buffer Pool: Clean Pages: Total=40.00 MB
      Buffer Pool: Unused Reservation: Total=-300.00 MB
      RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
      RequestPool=root.root: Total=0 Peak=139.93 MB
      RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
        Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
      RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
        Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
      Untracked Memory: Total=89.56 MB
    
    WARNING: The following tables are missing relevant table and/or column statistics.
    airlines_data.airlines_external
    Memory limit exceeded: Error occurred on backend quickstart.cloudera:22000 by fragment 314343c761a55f97:61ce9aa500000002
    Memory left in process limit: -328.00 KB
    Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
      Unclaimed reservations: Reservation=34.00 MB OtherMemory=0 Total=34.00 MB Peak=108.00 MB
      Fragment 314343c761a55f97:61ce9aa500000000: Reservation=0 OtherMemory=8.00 KB Total=8.00 KB Peak=8.00 KB
        EXCHANGE_NODE (id=7): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=0
        PLAN_ROOT_SINK: Total=0 Peak=0
        CodeGen: Total=0 Peak=0
      Fragment 314343c761a55f97:61ce9aa500000003: Reservation=0 OtherMemory=38.51 KB Total=38.51 KB Peak=383.65 KB
        AGGREGATION_NODE (id=6): Total=21.12 KB Peak=21.12 KB
          Exprs: Total=21.12 KB Peak=21.12 KB
        EXCHANGE_NODE (id=5): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=0
        DataStreamSender (dst_id=7): Total=7.52 KB Peak=7.52 KB
        CodeGen: Total=1.86 KB Peak=347.00 KB
      Fragment 314343c761a55f97:61ce9aa500000002: Reservation=374.00 MB OtherMemory=14.72 MB Total=388.72 MB Peak=388.72 MB
        AGGREGATION_NODE (id=3): Reservation=34.00 MB OtherMemory=5.66 MB Total=39.66 MB Peak=39.66 MB
          Exprs: Total=21.12 KB Peak=21.12 KB
        HASH_JOIN_NODE (id=2): Reservation=340.00 MB OtherMemory=58.25 KB Total=340.06 MB Peak=340.09 MB
          Exprs: Total=21.12 KB Peak=21.12 KB
          Hash Join Builder (join_node_id=2): Total=21.12 KB Peak=29.12 KB
            Hash Join Builder (join_node_id=2) Exprs: Total=21.12 KB Peak=21.12 KB
        HDFS_SCAN_NODE (id=0): Total=8.98 MB Peak=9.27 MB
        EXCHANGE_NODE (id=4): Total=0 Peak=0
        DataStreamRecvr: Total=0 Peak=11.65 MB
        DataStreamSender (dst_id=5): Total=7.52 KB Peak=7.52 KB
        CodeGen: Total=12.80 KB Peak=2.00 MB
      Fragment 314343c761a55f97:61ce9aa500000001: Reservation=0 OtherMemory=0 Total=0 Peak=9.34 MB
        HDFS_SCAN_NODE (id=1): Total=0 Peak=9.32 MB
        DataStreamSender (dst_id=4): Total=0 Peak=7.52 KB
        CodeGen: Total=0 Peak=49.00 KBProcess: memory limit exceeded. Limit=512.00 MB Total=512.32 MB Peak=512.32 MB
      Buffer Pool: Free Buffers: Total=260.00 MB
      Buffer Pool: Clean Pages: Total=40.00 MB
      Buffer Pool: Unused Reservation: Total=-300.00 MB
      RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
      RequestPool=root.root: Total=0 Peak=139.93 MB
      RequestPool=root.cloudera: Total=184.00 B Peak=431.27 KB
        Query(a34bd5934157257d:2e53c5ce00000000): Reservation=0 ReservationLimit=409.60 MB OtherMemory=184.00 B Total=184.00 B Peak=431.27 KB
      RequestPool=root.hdfs: Total=422.76 MB Peak=423.60 MB
        Query(314343c761a55f97:61ce9aa500000000): Reservation=408.00 MB ReservationLimit=409.60 MB OtherMemory=14.76 MB Total=422.76 MB Peak=423.60 MB
      Untracked Memory: Total=89.56 MB (1 of 3 similar)

    impala-server:25000可以查看语句执行进度:

    CREATE TABLE airlines_data.airlines
    (month INT,
    day INT,
    dayofweek INT,
    dep_time INT,
    crs_dep_time INT,
    arr_time INT,
    crs_arr_time INT,
    carrier STRING,
    flight_num INT,
    actual_elapsed_time INT,
    crs_elapsed_time INT,
    airtime INT,
    arrdelay INT,
    depdelay INT,
    origin STRING,
    dest STRING,
    distance INT,
    taxi_in INT,
    taxi_out INT,
    cancelled INT,
    cancellation_code STRING,
    diverted INT,
    carrier_delay INT,
    weather_delay INT,
    nas_delay INT,
    security_delay INT,
    late_aircraft_delay INT)
    PARTITIONED BY (year INT)
    STORED AS PARQUET;
    
    INSERT INTO airlines_data.airlines
    PARTITION (year)
    SELECT
    month,
    day,
    dayofweek,
    dep_time,
    crs_dep_time,
    arr_time,
    crs_arr_time,
    carrier,
    flight_num,
    actual_elapsed_time,
    crs_elapsed_time,
    airtime,
    arrdelay,
    depdelay,
    origin,
    dest,
    distance,
    taxi_in,
    taxi_out,
    cancelled,
    cancellation_code,
    diverted,
    carrier_delay,
    weather_delay,
    nas_delay,
    security_delay,
    late_aircraft_delay,
    year
    FROM airlines_data.airlines_external limit 1000000; -- 1亿的时候内存不足
    [quickstart.cloudera:21000] > SHOW TABLE STATS airlines;
    Query: show TABLE STATS airlines
    +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
    | year  | #Rows   | #Files | Size     | Bytes Cached | Cache Replication | Format  | Incremental stats | Location                                                                                |
    +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
    | 1990  | 897600  | 1      | 6.62MB   | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=1990 |
    | 2002  | 49152   | 1      | 527.13KB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2002 |
    | 2003  | 53248   | 1      | 603.76KB | NOT CACHED   | NOT CACHED        | PARQUET | true              | hdfs://quickstart.cloudera:8020/user/hive/warehouse/airlines_data.db/airlines/year=2003 |
    | Total | 1000000 | 3      | 7.73MB   | 0B           |                   |         |                   |                                                                                         |
    +-------+---------+--------+----------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------------------------------------+
    Fetched 4 row(s) in 0.14s
    [quickstart.cloudera:21000] > COMPUTE INCREMENTAL STATS airlines;
    Query: compute INCREMENTAL STATS airlines
    WARNINGS: No partitions selected for incremental stats update
    
    Fetched 0 row(s) in 0.01s
    [quickstart.cloudera:21000] > COMPUTE STATS airlines;
    Query: compute STATS airlines
    +------------------------------------------+
    | summary                                  |
    +------------------------------------------+
    | Updated 3 partition(s) and 27 column(s). |
    +------------------------------------------+
    Fetched 1 row(s) in 2.02s

    TPCDS:https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen(可生成TPC-DS测试数据集、同时包含TPC-DS测试,可生成10TB级别)

    官方自带的tpcds-kit customer数据无效了(主要是http://www.tpc.org/tpcds/dsgen/dsgen-download-files.asp链接无效了),脚本在https://github.com/sleberknight/impalascripts-0.6/blob/master/tpcds-setup.sh。

     常见问题:

    在使用Hadoop建立文件的时候,出现“Cannot create directory /user/hadoop/input. Name node is in safe mode.”问题的原因及解决方案
    解决方法:https://www.waitig.com/hadoop-name-node-is-in-safe-mode.html

    [cloudera@quickstart ~]$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
    put: Permission denied: user=cloudera, access=WRITE, inode="/user/cloudera/sample_data/tab1":hdfs:cloudera:drwxr-xr-x
    切换到HDFS用户即可,如下:
    -bash-4.1$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
    -bash-4.1$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
    -bash-4.1$ hdfs dfs -ls /user/cloudera/sample_data/tab1
    Found 1 items
    -rw-r--r-- 1 hdfs cloudera 192 2019-04-05 23:06 /user/cloudera/sample_data/tab1/tab1.csv

    WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://quickstart.cloudera:8020/user/cloudera/sample_data'
    报错分析:
    impala-shell运行的时候使用Impala用户,impala对hfds路径没有读写权限;

    问题处理:
    方法一:对hdfs的目录进行赋权:hadoop fs -chomd -R 777 path
    -bash-4.1$ hadoop fs -chmod -R 777 /user/cloudera/sample_data
    -bash-4.1$ exit
    logout

    方法二:创建hadoop用户组,然后将impala加入到hadoop用户组中,同时给impala用户定制权限

    [quickstart.cloudera:21000] > CREATE EXTERNAL TABLE airlines_external
    > LIKE PARQUET
    > 'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
    > STORED AS PARQUET LOCATION 'hdfs:staging/airlines';
    Query: create EXTERNAL TABLE airlines_external
    LIKE PARQUET
    'hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq'
    STORED AS PARQUET LOCATION 'hdfs:staging/airlines'
    ERROR: AnalysisException: null
    CAUSED BY: IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq
    CAUSED BY: URISyntaxException: Relative path in absolute URI: hdfs:staging/airlines/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq

    解决方法:

    路径无效,使用正确的路径,如hdfs:/user/impala/staging/airlines/XXX

  • 相关阅读:
    【图文并茂,点赞收藏哦!】重学巩固你的Vuejs知识体系
    RabbitMQ概念及控制台介绍
    RabbitMQ入门介绍及环境搭建
    C# 利用PdfSharp生成Pdf文件
    利用pdf.js在线展示PDF文档
    Python办公自动化之Excel转Word
    C#利用ServiceStack.Redis访问Redis
    Redis基础之事务
    Redis基础之配置文件
    第0天 | 12天搞定Pyhon,前言
  • 原文地址:https://www.cnblogs.com/zhjh256/p/10662733.html
Copyright © 2011-2022 走看看