zoukankan      html  css  js  c++  java
  • 【Hive】HiveQL实战之分析函数&窗口函数

    本篇主要介绍将存储到Mysql的示例数据库Sakila以Sqoop的方式导入到Hive,然后详细讲解Hive的分析函数和窗口函数。

    一 实战环境

    1 Hive版本
    hive> select version();
    OK
    2.3.3 r8a511e3f79b43d4be41cd231cf5c99e43b248383
    Time taken: 0.944 seconds, Fetched: 1 row(s)
    hive> 
    2 Mysql数据库
    mysql> select database();
    +------------+
    | database() |
    +------------+
    | sakila     |
    +------------+
    1 row in set (0.11 sec)
    注:本实战使用Mysql示例数据库,具体安装可参考:MySQL之示例数据库Sakila下载及安装。

    3 数据准备

    1)方法一:MySQL——>HDFS——>Hive
    --将Sakila下的City导入至HDFS
    [hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
    te" --warehouse-dir /user/sqoop1 -m 1
    --查看已导入的数据
    [hadoop@strong ~]$ hdfs dfs -cat /user/sqoop1/city/part-m-00000
    --在Hive创建city表
    hive> create table city(city_id int, city string,country_id int,last_update string)
        > row format delimited 
        > fields terminated by ',';
    OK
    Time taken: 1.386 seconds
    --加载HDFS中的city文件
    hive> load data inpath '/user/sqoop1/city/' into table city;
    Loading data to table hive.city
    OK
    Time taken: 4.764 seconds
    --查看导入到Hive的数据
    hive> select *from city limit 10;
    OK
    1	A Corua (La Corua)	87	2006-02-15 04:45:25.0
    2	Abha	82	2006-02-15 04:45:25.0
    3	Abu Dhabi	101	2006-02-15 04:45:25.0
    4	Acua	60	2006-02-15 04:45:25.0
    5	Adana	97	2006-02-15 04:45:25.0
    6	Addis Abeba	31	2006-02-15 04:45:25.0
    7	Aden	107	2006-02-15 04:45:25.0
    8	Adoni	44	2006-02-15 04:45:25.0
    9	Ahmadnagar	44	2006-02-15 04:45:25.0
    10	Akishima	50	2006-02-15 04:45:25.0
    Time taken: 10.489 seconds, Fetched: 10 row(s)
    2)方法二:MySQL——>Hive
    --删除HDFS文件
    [hadoop@strong ~]$ hdfs dfs -rm -R /user/hadoop/city
    --删除City表
    hive> drop table city;
    OK
    Time taken: 0.995 seconds
    --将Sakila下的City导入至Hive下的City表
    [hadoop@strong ~]$ sqoop import --connect jdbc:mysql://strong.hadoop.com:3306/sakila --username root --password root --table city --columns "city_id,city,country_id,last_upda
    te" --hive-import --hive-database hive --create-hive-table --fields-terminated-by ',' -m 1
    注:会出现以下错误:Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
    解决办法:[hadoop@strong ~]$ cp /usr/local/apache-hive-2.3.3-bin/lib/hive-common-2.3.3.jar /usr/local/sqoop-1.4.7/lib/
    --查看导入Hive的数据
    hive> select *from city limit 10;
    OK
    1	A Corua (La Corua)	87	2006-02-15 04:45:25.0
    2	Abha	82	2006-02-15 04:45:25.0
    3	Abu Dhabi	101	2006-02-15 04:45:25.0
    4	Acua	60	2006-02-15 04:45:25.0
    5	Adana	97	2006-02-15 04:45:25.0
    6	Addis Abeba	31	2006-02-15 04:45:25.0
    7	Aden	107	2006-02-15 04:45:25.0
    8	Adoni	44	2006-02-15 04:45:25.0
    9	Ahmadnagar	44	2006-02-15 04:45:25.0
    10	Akishima	50	2006-02-15 04:45:25.0
    Time taken: 0.544 seconds, Fetched: 10 row(s)
    --查看HDFS文件
    [hadoop@strong ~]$ hdfs dfs -ls /user/hadoop/city
    ls: `/user/hadoop/city': No such file or directory
    注:查看Sqoop Import执行的信息18/06/28 12:34:38 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.可看出不会创建HDFS文件,执行过程中HDFS文件只是一个过渡,是个临时文件,完成后即删除。

    二 Hive分析函数实战

    1 样例数据
    hive> select *from city where country_id in (101,102,108);
    OK
    city.city_id	city.city	city.country_id	city.last_update
    3	Abu Dhabi	101	2006-02-15 04:45:25.0
    12	al-Ayn	101	2006-02-15 04:45:25.0
    88	Bradford	102	2006-02-15 04:45:25.0
    149	Dundee	102	2006-02-15 04:45:25.0
    280	Kragujevac	108	2006-02-15 04:45:25.0
    312	London	102	2006-02-15 04:45:25.0
    368	Novi Sad	108	2006-02-15 04:45:25.0
    470	Sharja	101	2006-02-15 04:45:25.0
    494	Southampton	102	2006-02-15 04:45:25.0
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0
    496	Southport	102	2006-02-15 04:45:25.0
    500	Stockport	102	2006-02-15 04:45:25.0
    589	York	102	2006-02-15 04:45:25.0
    Time taken: 0.359 seconds, Fetched: 13 row(s)
    2 Row_Number函数
    hive> select t.*,row_number()over(partition by country_id) rnk from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	rnk
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
    470	Sharja	101	2006-02-15 04:45:25.0	2
    12	al-Ayn	101	2006-02-15 04:45:25.0	3
    589	York	102	2006-02-15 04:45:25.0	1
    500	Stockport	102	2006-02-15 04:45:25.0	2
    496	Southport	102	2006-02-15 04:45:25.0	3
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	4
    494	Southampton	102	2006-02-15 04:45:25.0	5
    312	London	102	2006-02-15 04:45:25.0	6
    149	Dundee	102	2006-02-15 04:45:25.0	7
    88	Bradford	102	2006-02-15 04:45:25.0	8
    280	Kragujevac	108	2006-02-15 04:45:25.0	1
    368	Novi Sad	108	2006-02-15 04:45:25.0	2
    Time taken: 63.74 seconds, Fetched: 13 row(s)
    3 Rank函数
    --加上partition by 
    hive> select t.*,rank()over(partition by country_id order by city_id) rnk from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	rnk
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
    12	al-Ayn	101	2006-02-15 04:45:25.0	2
    470	Sharja	101	2006-02-15 04:45:25.0	3
    88	Bradford	102	2006-02-15 04:45:25.0	1
    149	Dundee	102	2006-02-15 04:45:25.0	2
    312	London	102	2006-02-15 04:45:25.0	3
    494	Southampton	102	2006-02-15 04:45:25.0	4
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	5
    496	Southport	102	2006-02-15 04:45:25.0	6
    500	Stockport	102	2006-02-15 04:45:25.0	7
    589	York	102	2006-02-15 04:45:25.0	8
    280	Kragujevac	108	2006-02-15 04:45:25.0	1
    368	Novi Sad	108	2006-02-15 04:45:25.0	2
    Time taken: 57.047 seconds, Fetched: 13 row(s)
    --不加上partition by 
    hive> select t.*,rank()over(order by city_id) rnk from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	rnk
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
    12	al-Ayn	101	2006-02-15 04:45:25.0	2
    88	Bradford	102	2006-02-15 04:45:25.0	3
    149	Dundee	102	2006-02-15 04:45:25.0	4
    280	Kragujevac	108	2006-02-15 04:45:25.0	5
    312	London	102	2006-02-15 04:45:25.0	6
    368	Novi Sad	108	2006-02-15 04:45:25.0	7
    470	Sharja	101	2006-02-15 04:45:25.0	8
    494	Southampton	102	2006-02-15 04:45:25.0	9
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	10
    496	Southport	102	2006-02-15 04:45:25.0	11
    500	Stockport	102	2006-02-15 04:45:25.0	12
    589	York	102	2006-02-15 04:45:25.0	13
    Time taken: 73.697 seconds, Fetched: 13 row(s)
    --使用Rank排序会出现跳号的情况,即
    hive> select t.*,rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	rnk
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
    470	Sharja	101	2006-02-15 04:45:25.0	1
    12	al-Ayn	101	2006-02-15 04:45:25.0	1
    589	York	102	2006-02-15 04:45:25.0	4
    500	Stockport	102	2006-02-15 04:45:25.0	4
    496	Southport	102	2006-02-15 04:45:25.0	4
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	4
    494	Southampton	102	2006-02-15 04:45:25.0	4
    312	London	102	2006-02-15 04:45:25.0	4
    149	Dundee	102	2006-02-15 04:45:25.0	4
    88	Bradford	102	2006-02-15 04:45:25.0	4
    280	Kragujevac	108	2006-02-15 04:45:25.0	12
    368	Novi Sad	108	2006-02-15 04:45:25.0	12
    Time taken: 57.533 seconds, Fetched: 13 row(s)
    注:Rank函数排序会出现跳号。

    4 Dense_rank函数
    hive> select t.*,dense_rank()over(order by country_id) rnk from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	rnk
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	1
    470	Sharja	101	2006-02-15 04:45:25.0	1
    12	al-Ayn	101	2006-02-15 04:45:25.0	1
    589	York	102	2006-02-15 04:45:25.0	2
    500	Stockport	102	2006-02-15 04:45:25.0	2
    496	Southport	102	2006-02-15 04:45:25.0	2
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	2
    494	Southampton	102	2006-02-15 04:45:25.0	2
    312	London	102	2006-02-15 04:45:25.0	2
    149	Dundee	102	2006-02-15 04:45:25.0	2
    88	Bradford	102	2006-02-15 04:45:25.0	2
    280	Kragujevac	108	2006-02-15 04:45:25.0	3
    368	Novi Sad	108	2006-02-15 04:45:25.0	3
    Time taken: 60.528 seconds, Fetched: 13 row(s)
    注:Dense_rank函数排序不会出现跳号。

    5 Lead函数
    hive> select t.*, lead(city_id)over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	next_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	12
    12	al-Ayn	101	2006-02-15 04:45:25.0	470
    470	Sharja	101	2006-02-15 04:45:25.0	NULL
    88	Bradford	102	2006-02-15 04:45:25.0	149
    149	Dundee	102	2006-02-15 04:45:25.0	312
    312	London	102	2006-02-15 04:45:25.0	494
    494	Southampton	102	2006-02-15 04:45:25.0	495
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	496
    496	Southport	102	2006-02-15 04:45:25.0	500
    500	Stockport	102	2006-02-15 04:45:25.0	589
    589	York	102	2006-02-15 04:45:25.0	NULL
    280	Kragujevac	108	2006-02-15 04:45:25.0	368
    368	Novi Sad	108	2006-02-15 04:45:25.0	NULL
    Time taken: 59.491 seconds, Fetched: 13 row(s)
    注: lead(city_id)超出边界,将返回NULL值。

    6 LAG函数
    hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	prior_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL
    12	al-Ayn	101	2006-02-15 04:45:25.0	3
    470	Sharja	101	2006-02-15 04:45:25.0	12
    88	Bradford	102	2006-02-15 04:45:25.0	NULL
    149	Dundee	102	2006-02-15 04:45:25.0	88
    312	London	102	2006-02-15 04:45:25.0	149
    494	Southampton	102	2006-02-15 04:45:25.0	312
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	494
    496	Southport	102	2006-02-15 04:45:25.0	495
    500	Stockport	102	2006-02-15 04:45:25.0	496
    589	York	102	2006-02-15 04:45:25.0	500
    280	Kragujevac	108	2006-02-15 04:45:25.0	NULL
    368	Novi Sad	108	2006-02-15 04:45:25.0	280
    Time taken: 56.631 seconds, Fetched: 13 row(s)
    --指定偏移量,默认偏移量为1
    hive> select t.*, lag(city_id,2,'No Data')over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	next_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL
    12	al-Ayn	101	2006-02-15 04:45:25.0	NULL
    470	Sharja	101	2006-02-15 04:45:25.0	3
    88	Bradford	102	2006-02-15 04:45:25.0	NULL
    149	Dundee	102	2006-02-15 04:45:25.0	NULL
    312	London	102	2006-02-15 04:45:25.0	88
    494	Southampton	102	2006-02-15 04:45:25.0	149
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	312
    496	Southport	102	2006-02-15 04:45:25.0	494
    500	Stockport	102	2006-02-15 04:45:25.0	495
    589	York	102	2006-02-15 04:45:25.0	496
    280	Kragujevac	108	2006-02-15 04:45:25.0	NULL
    368	Novi Sad	108	2006-02-15 04:45:25.0	NULL
    Time taken: 66.487 seconds, Fetched: 13 row(s)
    --返回当前值的前一个和后一个
    hive> select t.*, lag(city_id)over(partition by country_id order by city_id) prior_city,lead(city_id)over(partition by country_id order by city_id) next_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	prior_city	next_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	NULL	12
    12	al-Ayn	101	2006-02-15 04:45:25.0	3	470
    470	Sharja	101	2006-02-15 04:45:25.0	12	NULL
    88	Bradford	102	2006-02-15 04:45:25.0	NULL	149
    149	Dundee	102	2006-02-15 04:45:25.0	88	312
    312	London	102	2006-02-15 04:45:25.0	149	494
    494	Southampton	102	2006-02-15 04:45:25.0	312	495
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	494	496
    496	Southport	102	2006-02-15 04:45:25.0	495	500
    500	Stockport	102	2006-02-15 04:45:25.0	496	589
    589	York	102	2006-02-15 04:45:25.0	500	NULL
    280	Kragujevac	108	2006-02-15 04:45:25.0	NULL	368
    368	Novi Sad	108	2006-02-15 04:45:25.0	280	NULL
    Time taken: 52.446 seconds, Fetched: 13 row(s)
    7 First_Value函数
    hive> select t.*, first_value(city_id)over(partition by country_id order by city_id) first_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	first_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	3
    12	al-Ayn	101	2006-02-15 04:45:25.0	3
    470	Sharja	101	2006-02-15 04:45:25.0	3
    88	Bradford	102	2006-02-15 04:45:25.0	88
    149	Dundee	102	2006-02-15 04:45:25.0	88
    312	London	102	2006-02-15 04:45:25.0	88
    494	Southampton	102	2006-02-15 04:45:25.0	88
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	88
    496	Southport	102	2006-02-15 04:45:25.0	88
    500	Stockport	102	2006-02-15 04:45:25.0	88
    589	York	102	2006-02-15 04:45:25.0	88
    280	Kragujevac	108	2006-02-15 04:45:25.0	280
    368	Novi Sad	108	2006-02-15 04:45:25.0	280
    Time taken: 78.285 seconds, Fetched: 13 row(s)
    8 Last_value函数
    hive> select t.*, last_value(city_id)over(partition by country_id order by city_id rows between unbounded preceding and unbounded following) last_city  from city t where country_id in(101,102,108);
    t.city_id	t.city	t.country_id	t.last_update	last_city
    3	Abu Dhabi	101	2006-02-15 04:45:25.0	470
    12	al-Ayn	101	2006-02-15 04:45:25.0	470
    470	Sharja	101	2006-02-15 04:45:25.0	470
    88	Bradford	102	2006-02-15 04:45:25.0	589
    149	Dundee	102	2006-02-15 04:45:25.0	589
    312	London	102	2006-02-15 04:45:25.0	589
    494	Southampton	102	2006-02-15 04:45:25.0	589
    495	Southend-on-Sea	102	2006-02-15 04:45:25.0	589
    496	Southport	102	2006-02-15 04:45:25.0	589
    500	Stockport	102	2006-02-15 04:45:25.0	589
    589	York	102	2006-02-15 04:45:25.0	589
    280	Kragujevac	108	2006-02-15 04:45:25.0	368
    368	Novi Sad	108	2006-02-15 04:45:25.0	368
    Time taken: 49.807 seconds, Fetched: 13 row(s)
    注:在使用first_value和last_value时需加上窗口rows between unbounded preceding and unbounded following,否则会出现预想不到的结果!!!因为默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW。

    9 Count函数
    hive> select country_id,count(1) cqty from city where country_id in(101,102,108) group by country_id;
    country_id	cqty
    101	3
    102	8
    108	2
    Time taken: 56.987 seconds, Fetched: 3 row(s)
    10 Sum函数
    hive> select country_id,sum(city_id) csum from city where country_id in(101,102,108) group by country_id;
    country_id	csum
    101	485
    102	3123
    108	648
    Time taken: 57.993 seconds, Fetched: 3 row(s)
    11 Avg函数
    hive> select country_id,avg(city_id) cavg from city where country_id in(101,102,108) group by country_id;
    country_id	cavg
    101	161.66666666666666
    102	390.375
    108	324.0
    Time taken: 54.42 seconds, Fetched: 3 row(s)
    12 Max函数
    hive> select country_id,max(city_id) cmax from city where country_id in(101,102,108) group by country_id;
    country_id	cmax
    101	470
    102	589
    108	368
    Time taken: 58.02 seconds, Fetched: 3 row(s)
    13 Min函数
    hive> select country_id,min(city_id) cmax from city where country_id in(101,102,108) group by country_id;
    country_id	cmax
    101	3
    102	88
    108	280
    Time taken: 57.602 seconds, Fetched: 3 row(s)
    注:以上5个函数都可以在后面添加Over和窗口。
    窗口有如下规范:
    (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING



  • 相关阅读:
    ASP.NET MVC学习笔记-----ActionInvoker
    quartz启动报错
    THUSC 2021 游记
    C++下随机数的生成
    友链
    memset一些技巧
    CodeForces Round #705 总结&题解
    php计算两坐标距离
    vue中使用keepAlive组件缓存遇到的坑
    vue 中 keepAlive
  • 原文地址:https://www.cnblogs.com/alen-liu-sz/p/12975629.html
Copyright © 2011-2022 走看看