zoukankan      html  css  js  c++  java
  • impala+hdfs+parquet格式文件

    [创建目录]
    hdfs dfs -mkdir -p /user/hdfs/sample_data/parquet

    [赋予权限]
    sudo -u hdfs hadoop fs -chown -R impala:supergroup /user/hdfs/sample_data

    [删除目录]
    hdfs dfs -rm -r /user/hdfs/sample_data/parquet

    [上传文件]
    hdfs dfs -put -f device /user/hdfs/sample_data/parquet
    hdfs dfs -put -f metrics /user/hdfs/sample_data/parquet

    [查看文件]
    hdfs dfs -ls /user/hdfs/sample_data/parquet

    [impala建表,不带分区](创建表之后,还需要通过下面的alter语句添加分区)
    DROP TABLE IF EXISTS device_parquet;
    CREATE EXTERNAL TABLE device_parquet
    (
    deviceId STRING,
    deviceName STRING,
    orgId STRING
    )

    STORED AS PARQUET
    LOCATION '/user/hdfs/sample_data/parquet/device';

    [impala建表,带分区]
    DROP TABLE IF EXISTS metrics_parquet;
    CREATE EXTERNAL TABLE metrics_parquet
    (
    deviceId STRING,
    reading BIGINT,
    time STRING
    )
    partitioned by (year string)
    STORED AS PARQUET
    LOCATION '/user/hdfs/sample_data/parquet/metrics';

    [添加表分区]
    alter table metrics_parquet add partition (year="2017");
    alter table metrics_parquet add partition (year="2018");

    [删除分区]
    alter table metrics_parquet drop partition (year="2017");
    alter table metrics_parquet drop partition (year="2018");

    [查看表分区]
    show partitions metrics_parquet;

    [不指定分区查询数据]
    select
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`),
    sum(T_3C75F1.`reading`),
    count(1)
    from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId) as `T_3C75F1`
    group by
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`);

    耗时:device表50条,metrics表1亿条(261M)执行上面的查询语句,耗时平均135秒

    [指定分区查询数据]
    select
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`),
    sum(T_3C75F1.`reading`),
    count(1)
    from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId and year='2017') as `T_3C75F1`
    group by
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`);

    耗时:device表50条,metrics表1亿条(261M)执行上面的查询语句,耗时平均96秒

    [查询多个分区的数据]
    select
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`),
    sum(T_3C75F1.`reading`),
    count(1)
    from (select device_parquet.deviceId,reading,metrics_parquet.time as time from device_parquet,metrics_parquet where device_parquet.deviceId=metrics_parquet.deviceId and year in ('2017','2018')) as `T_3C75F1`
    group by
    T_3C75F1.`deviceId`,
    year(T_3C75F1.`time`),
    month(T_3C75F1.`time`);

    [刷新数据](hdfs中数据发生变化时,需要执行以下命令更新impala)
    refresh device_parquet;
    refresh metrics_parquet;

  • 相关阅读:
    文件加密二进制版本
    文件复制二进制版
    C语言里的文件函数
    文件加密2——基础版(限文本)
    文件加密——可执行,较复杂的抽取(限文本)
    typedef的使用3——使用经过typedef定义的函数构成的函数数组
    typedef的使用2——定义函数
    typedef的使用1——引入
    C语言枚举
    HanLP极致简繁转换详细讲解
  • 原文地址:https://www.cnblogs.com/puroc/p/9233902.html
Copyright © 2011-2022 走看看