zoukankan      html  css  js  c++  java
  • Hive入门教程

    Hive 安装

    相比起很多教程先介绍概念,我喜欢先动手装上,然后用例子来介绍概念。我们先来安装一下Hive

    先确认是否已经安装了对应的yum源,如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339

    Hive是什么

    Hive 提供了一个让大家可以使用sql去查询数据的途径。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢,官方文档说Hive 适用于高延时性的场景而且很费资源。

    举个简单的例子,可以像这样去查询

    hive> select * from h_employee;
    OK
    1   1   peter
    2   2   paul
    Time taken: 9.289 seconds, Fetched: 2 row(s)

     这个h_employee不一定是一个数据库

    metastore

    Hive 中建立的表都叫metastore表。这些表并不真实的存储数据,而是定义真实数据跟hive之间的映射,就像传统数据库中表的meta信息,所以叫做metastore。实际存储的时候可以定义的存储模式有四种:

    内部表(默认)分区表桶表外部表 举个例子,这是一个简历内部表的语句

    CREATE TABLE worker(id INT, name STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '54';

     这个语句的意思是建立一个worker的内部表,内部表是默认的类型,所以不用写存储的模式。并且使用逗号作为分隔符存储 

    建表语句支持的类型

    基本数据类型
    tinyint / smalint / int /bigint
    float / double
    boolean
    string

    复杂数据类型
    Array/Map/Struct

    没有date /datetime

    建完的表存在哪里呢?

    在 /user/hive/warehouse 里面,可以通过hdfs来查看建完的表位置

    $ hdfs dfs -ls /user/hive/warehouse
    Found 11 items
    drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee
    drwxrwxrwt   - root     supergroup          0 2014-12-02 14:42 /user/hive/warehouse/h_employee2
    drwxrwxrwt   - wlsuser  supergroup          0 2014-12-04 17:21 /user/hive/warehouse/h_employee_export
    drwxrwxrwt   - root     supergroup          0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logs
    drwxrwxrwt   - root     supergroup          0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_log
    drwxrwxrwt   - username supergroup          0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1
    drwxrwxrwt   - username supergroup          0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2
    drwxrwxrwt   - username supergroup          0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslog
    drwxrwxrwt   - root     supergroup          0 2014-12-02 15:12 /user/hive/warehouse/hive_employee

     一个文件夹对应一个metastore表

    Hive 各种类型表使用

    CREATE TABLE workers( id INT, name STRING)  
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '54';

     通过这样的语句就建立了一个内部表叫 workers,并且分隔符是逗号, 54 是ASCII 码 
    我们可以通过 show tables; 来看看有多少表,其实hive的很多语句是模仿mysql的,当你们不知道语句的时候,把mysql的语句拿来基本可以用。除了limit比较怪,这个后面会说 

    hive> show tables;
    OK
    h_employee
    h_employee2
    h_employee_export
    h_http_access_logs
    hive_employee
    workers
    Time taken: 0.371 seconds, Fetched: 6 row(s)
    

      建立完后,我们试着插入几条数据。这边要告诉大家Hive不支持单句插入的语句,必须批量,所以不要指望能用insert into workers values (1,'jack') 这样的语句插入数据。hive支持的插入数据的方式有两种: 从文件读取数据从别的表读出数据插入(insert from select) 这里我采用从文件读数据进来。先建立一个叫 worker.csv的文件

    $ cat workers.csv
    1,jack
    2,terry
    3,michael

    用LOAD DATA 导入到Hive的表中

    hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers;
    Copying data from file:/home/alex/workers.csv
    Copying file: file:/home/alex/workers.csv
    Loading data to table default.workers
    Table default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]
    OK
    Time taken: 0.655 seconds

    注意 不要少了那个 LOCAL , LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件,一个是从hdfs上找文件如果加上OVERWRITE可以再导入之前先清空表,比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers; 查询一下数据

    hive> select * from workers;
    OK
    1   jack
    2   terry
    3   michael
    Time taken: 0.177 seconds, Fetched: 3 row(s)

    我们去看下导入后在hive内部表是怎么存的

    # hdfs dfs -ls /user/hive/warehouse/workers/
    Found 1 items
    -rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv

    原来就是原封不动的把文件拷贝进去啊!就是这么土! 我们可以试验再放一个文件 workers2.txt (我故意把扩展名换一个,其实hive是不看扩展名的)

    # cat workers2.txt 
    4,peter
    5,kate
    6,ted

    导入

    hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers;
    Copying data from file:/home/alex/workers2.txt
    Copying file: file:/home/alex/workers2.txt
    Loading data to table default.workers
    Table default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0]
    OK
    Time taken: 0.79 seconds

    去看下文件的存储结构

    # hdfs dfs -ls /user/hive/warehouse/workers/
    Found 2 items
    -rwxrwxrwt   2 root supergroup         25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv
    -rwxrwxrwt   2 root supergroup         21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt

    多出来一个workers2.txt 再用sql查询下

    hive> select * from workers;
    OK
    1   jack
    2   terry
    3   michael
    4   peter
    5   kate
    6   ted
    Time taken: 0.144 seconds, Fetched: 6 row(s)

    分区表

    分区表是用来加速查询的,比如你的数据非常多,但是你的应用场景是基于这些数据做日报表,那你就可以根据日进行分区,当你要做2014-05-05的报表的时候只需要加载2014-05-05这一天的数据就行了。我们来创建一个分区表来看下 

    create table partition_employee(id int, name string) 
    partitioned by(daytime string) 
    row format delimited fields TERMINATED BY '54';

    可以看到分区的属性,并不是任何一个列 我们先建立2个测试数据文件,分别对应两天的数据

    # cat 2014-05-05
    22,kitty
    33,lily
    # cat 2014-05-06
    14,sami
    45,micky

    导入到分区表里面

    hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05');
    Copying data from file:/home/alex/2014-05-05
    Copying file: file:/home/alex/2014-05-05
    Loading data to table default.partition_employee partition (daytime=2014-05-05)
    Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
    Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
    OK
    Time taken: 1.154 seconds
    hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06');
    Copying data from file:/home/alex/2014-05-06
    Copying file: file:/home/alex/2014-05-06
    Loading data to table default.partition_employee partition (daytime=2014-05-06)
    Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
    Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0]
    OK
    Time taken: 0.763 seconds

    导入的时候通过 partition 来指定分区。 
    查询的时候通过指定分区来查询

    hive> select * from partition_employee where daytime='2014-05-05';
    OK
    22  kitty   2014-05-05
    33  lily    2014-05-05
    Time taken: 0.173 seconds, Fetched: 2 row(s)

    我的查询语句并没有什么特别的语法,hive 会自动判断你的where语句中是否包含分区的字段。而且可以使用大于小于等运算符

    hive> select * from partition_employee where daytime>='2014-05-05';
    OK
    22  kitty   2014-05-05
    33  lily    2014-05-05
    14  sami    2014-05-06
    45  mick'   2014-05-06
    Time taken: 0.273 seconds, Fetched: 4 row(s)

    我们去看看存储的结构

    # hdfs dfs -ls /user/hive/warehouse/partition_employee
    Found 2 items
    drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05
    drwxrwxrwt   - root supergroup          0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06

    我们试试二维的分区表

    create table p_student(id int, name string) 
    partitioned by(daytime string,country string) 
    row format delimited fields TERMINATED BY '54';

    查入一些数据

    # cat 2014-09-09-CN 
    1,tammy
    2,eric
    # cat 2014-09-10-CN 
    3,paul
    4,jolly
    # cat 2014-09-10-EN 
    44,ivan
    66,billy

    导入hive

    hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN');
    Copying data from file:/home/alex/2014-09-09-CN
    Copying file: file:/home/alex/2014-09-09-CN
    Loading data to table default.p_student partition (daytime=2014-09-09, country=CN)
    Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
    Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
    OK
    Time taken: 0.736 seconds
    hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN');
    Copying data from file:/home/alex/2014-09-10-CN
    Copying file: file:/home/alex/2014-09-10-CN
    Loading data to table default.p_student partition (daytime=2014-09-10, country=CN)
    Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]
    Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0]
    OK
    Time taken: 0.691 seconds
    hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN');
    Copying data from file:/home/alex/2014-09-10-EN
    Copying file: file:/home/alex/2014-09-10-EN
    Loading data to table default.p_student partition (daytime=2014-09-10, country=EN)
    Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]
    Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0]
    OK
    Time taken: 0.622 seconds

    看看存储结构

    # hdfs dfs -ls /user/hive/warehouse/p_student
    Found 2 items
    drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09
    drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10
    # hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09
    Found 1 items
    drwxr-xr-x   - root supergroup          0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN

    查询一下数据

    hive> select * from p_student;
    OK
    1   tammy   2014-09-09  CN
    2   eric    2014-09-09  CN
    3   paul    2014-09-10  CN
    4   jolly   2014-09-10  CN
    44  ivan    2014-09-10  EN
    66  billy   2014-09-10  EN
    Time taken: 0.228 seconds, Fetched: 6 row(s)
    hive> select * from p_student where daytime='2014-09-10' and country='EN';
    OK
    44  ivan    2014-09-10  EN
    66  billy   2014-09-10  EN
    Time taken: 0.224 seconds, Fetched: 2 row(s)

    桶表

     桶表是根据某个字段的hash值,来将数据扔到不同的“桶”里面。外国人有个习惯,就是分类东西的时候摆几个桶,上面贴不同的标签,所以他们取名的时候把这种表形象的取名为桶表。桶表表专门用于采样分析 
    下面这个例子是官网教程直接拷贝下来的,因为分区表跟桶表是可以同时使用的,所以这个例子中同时使用了分区跟桶两种特性

    CREATE TABLE b_student(id INT, name STRING)
    PARTITIONED BY(dt STRING, country STRING)
    CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETS
    row format delimited 
        fields TERMINATED BY '54';

     意思是根据userid来进行计算hash值,用viewTIme来排序存储 做数据跟导入的过程我就不在赘述了,这是导入后的数据

    hive> select * from b_student;
    OK
    1   tammy   2014-09-09  CN
    2   eric    2014-09-09  CN
    3   paul    2014-09-10  CN
    4   jolly   2014-09-10  CN
    34  allen   2014-09-11  EN
    Time taken: 0.727 seconds, Fetched: 5 row(s)

     从4个桶中采样抽取一个桶的数据

    hive> select * from b_student tablesample(bucket 1 out of 4 on id);
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/
    Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1406097234796_0041
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2014-12-08 17:35:56,995 Stage-1 map = 0%,  reduce = 0%
    2014-12-08 17:36:06,783 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec
    2014-12-08 17:36:07,845 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.9 sec
    MapReduce Total cumulative CPU time: 2 seconds 900 msec
    Ended Job = job_1406097234796_0041
    MapReduce Jobs Launched: 
    Job 0: Map: 1   Cumulative CPU: 2.9 sec   HDFS Read: 482 HDFS Write: 22 SUCCESS
    Total MapReduce CPU Time Spent: 2 seconds 900 msec
    OK
    4   jolly   2014-09-10  CN

    外部表

    外部表就是存储不是由hive来存储的,比如可以依赖Hbase来存储,hive只是做一个映射而已。我用Hbase来举例 
    先建立一张Hbase表叫 employee

    hbase(main):005:0> create 'employee','info' 
    0 row(s) in 0.4740 seconds  
       
    => Hbase::Table - employee  
    hbase(main):006:0> put 'employee',1,'info:id',1  
    0 row(s) in 0.2080 seconds  
       
    hbase(main):008:0> scan 'employee' 
    ROW                                      COLUMN+CELL                                                                                                             
     1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
    1 row(s) in 0.0610 seconds  
       
    hbase(main):009:0> put 'employee',1,'info:name','peter' 
    0 row(s) in 0.0220 seconds  
       
    hbase(main):010:0> scan 'employee' 
    ROW                                      COLUMN+CELL                                                                                                             
     1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
     1                                       column=info:name, timestamp=1417591321072, value=peter                                                                  
    1 row(s) in 0.0450 seconds  
       
    hbase(main):011:0> put 'employee',2,'info:id',2  
    0 row(s) in 0.0370 seconds  
       
    hbase(main):012:0> put 'employee',2,'info:name','paul' 
    0 row(s) in 0.0180 seconds  
       
    hbase(main):013:0> scan 'employee' 
    ROW                                      COLUMN+CELL                                                                                                             
     1                                       column=info:id, timestamp=1417591291730, value=1                                                                        
     1                                       column=info:name, timestamp=1417591321072, value=peter                                                                  
     2                                       column=info:id, timestamp=1417591500179, value=2                                                                        
     2                                       column=info:name, timestamp=1417591512075, value=paul                                                                   
    2 row(s) in 0.0440 seconds

    建立外部表进行映射

    hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string)   
        > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
        > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")  
        > TBLPROPERTIES ("hbase.table.name" = "employee");  
    OK  
    Time taken: 0.324 seconds  
    hive> select * from h_employee;  
    OK  
    1   1   peter  
    2   2   paul  
    Time taken: 1.129 seconds, Fetched: 2 row(s)

    查询语法

    具体语法可以参考官方手册https://cwiki.apache.org/confluence/display/Hive/Tutorial 我只说几个比较奇怪的点

    显示条数

    展示x条数据,用的还是limit,比如

    hive> select * from h_employee limit 1
        > ;
    OK
    1   1   peter
    Time taken: 0.284 seconds, Fetched: 1 row(s)

    但是不支持起点,比如offset 

    (转自:http://www.2cto.com/database/201412/359250.html )

  • 相关阅读:
    c++11之智能指针
    SurfaceFlinger与Surface概述
    android GUI 流程记录
    文章收藏
    android performance
    POJ3349
    java中的volatile和synchronized
    [原创]分期还款的名义利率与真实利率
    Java IO 流总结
    telegram
  • 原文地址:https://www.cnblogs.com/rxbook/p/6294133.html
Copyright © 2011-2022 走看看