zoukankan      html  css  js  c++  java
  • hive查询语句

    一. 为什么hive是数据仓库

    1. hive局限于hdfs, 不能进行记录级别的增删改
    2. hive底层的mapreduce启动耗时很长, 无法做到传统数据库的秒查, 只适合离线分析
    3. hive不支持事务, 无法完成OLTP的要求, OLTP选择hbase或cassandera

    二. hive安装

    1. 每个hive客户端, 都需要有一个元数据服务来存储元信息(表模式,分区信息), 通常用传统数据库的一个表来存储元信息
    2. hive内部默认用derby存储元信息, 由于derby是单进程存储, 使得不允许两个以上的hive cli执行操作

    三.HQL数据操作

    1. 文本文件导入表中

      LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
      
    2. 动态分区插入 :

       insert  OVERWRITE  TABLE  employees  PARTITION (country, state)  SELECT  *  FROM  staged_employees  se ;
       CREATE  TABLE  ca_employees  AS  SELECT  name,slary  FROM  employee  WHERE  se.state='CA'
      
    3. 一次查询多次插入
      这种from 后跟多个insert into的语句, 可以只扫描表一次. 而多次插入表, 效率最高

      FROM from_statement  
      INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1  
      [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
      
      --1. 建立桶表, 分区表  
      CREATE TABLE TESTA  
         (person_name string, person_org_name string, level2_org_name string)  
      PARTITIONED BY (import_time   string)  
      CLUSTERED BY (person_name) INTO 8 BUCKETS  
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '	'  
      STORED AS textfile  
      
      --2.  插入分区数据  
      from (select person_name, person_org_name, level2_org_name from iap_app_log_import_minute where import_time in ('2015-01-16-0000', '2015-01-200000')) applog 
      insert into table testa partition(import_time = '2015-01')
      select applog.person_name,
             applog.person_org_name,
             applog.level2_org_name
      

    四. 查询语句

    1. sort by + distribute by 与 order by + group by
      (1) order by: 查询语句全局有序
      (2) sort by : 每个reducer内的数据有序, 当reducer的个数为1, sort by的数据据才全局有序 (效率高)
      (3) distribute by : mapreduce会把map输入的键计算哈希值, 把相同哈希值的键值对发往一个reducer.
      (4) cluster by : 先distribute by 再order by , 达到全局有序

    2. 查看partition

      show partitions employees;
      
      SHOW PARTITIONS employees PARTITION(country='US');
      
    3. 桶表的抽样查询tablesample

      select * from testa tablesample(bucket 3 out of 10 on  person_name)
      
    4. laterview

      ageid contact_page
      ontact_page [3, 4, 5]
      ont_page [1, 2, 3]
      SELECT pageid, adid
          FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;
      
      ageid adid
      ontact_page 3
      ontact_page 4
      ontact_page 5
      ont_page 1
      ont_page 2
      ont_page 3

    五. 其他形式

    1. 视图 : CRETAE VIEW 视图名 AS SLECTSTATEMENT
    2. 索引 :
    ```sql
    CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS 'index.handler.class.name' [WITH DEFERRED REBUILD] 
    --当表的数据发生变化, 自动更新分区内的全部索引
    [IDXPROPERTIES (property_name=property_value, ...)]
    [IN TABLE index_table_name]
    [PARTITIONED BY (col_name, ...)]
    [
      [ ROW FORMAT ...] STORED AS ...
      | STORED BY ...
    ]
    [LOCATION hdfs_path]
    [TBLPROPERTIES (...)]
    [COMMENT "index comment"]
    ```
  • 相关阅读:
    最大子串和
    [USACO1.5]数字金字塔 Number Triangles
    数字金字塔
    台阶问题
    取余运算
    数列分段pascal程序
    Java 集合-Collection接口和迭代器的实现
    Java 集合-集合介绍
    Java IO流-File类
    Git学习记录
  • 原文地址:https://www.cnblogs.com/72808ljup/p/5220430.html
Copyright © 2011-2022 走看看