zoukankan      html  css  js  c++  java
  • Hive实战(1):Hive 函数(3) HiveSQL 高阶函数合集实战(一)窗口函数

    来源:https://mp.weixin.qq.com/s/PLWovsMDxO0wUrDTMaOh4w

    目录

    数据准备

      数据集

      建表语句

    窗口函数

      row_number:使用频率 ★★★★★

      rank :使用频率 ★★★★

      dense_rank:使用频率 ★★★★

      rank/dense_rank/row_number对比

      first_value:使用频率 ★★★

      last_value:使用频率 ★

      lead:使用频率 ★★

      lag:使用频率 ★★

    集合相关

      collect_set:使用频率 ★★★★★

      collect_list:使用频率 ★★★★★

      sort_array:使用频率 ★★★

      URL相关parse_url:使用频率 ★★★★

      reflect:使用频率 ★★

    JSON相关

      get_json_object:使用频率 ★★★★★

    列转行相关

      explode:使用频率 ★★★★★

    Cube相关

      GROUPING SETS:使用频率 ★

    字符相关

      concat:使用频率 ★★★★★

      concat_ws:使用频率 ★★★★★

      instr:使用频率 ★★★★

       length:使用频率 ★★★★★

      size:使用频率 ★★★★★

      trim:使用频率 ★★★★★

      regexp_replace:使用频率 ★★★★★

      regexp_extract:使用频率 ★★★★

      substring_index:使用频率 ★★

      条件判断if:使用频率 ★★★★★

      case when :使用频率 ★★★★★

      coalesce:使用频率 ★★★★★

    数值相关

      round:使用频率 ★★

      ceil:使用频率 ★★★

      floor:使用频率 ★★★

      hex:使用频率 ★

    时间相关(比较简单)

      from_unxitime:使用频率 ★★★★★

      unix_timestamp:使用频率 ★★★★★

      to_date:使用频率 ★★★★★

      year:使用频率 ★★★★★

      month:使用频率 ★★★★★

      day:使用频率 ★★★★★

      date_add:使用频率 ★★★★★

      date_sub:使用频率 ★★★★★

    数据准备

    数据集

     user1,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,10,2020-09-12 02:20:02,2020-09-12
     user1,https://blog.csdn.net/qq_28680977/article/details/108298276?k1=v1&k2=v2#Ref1,2,2020-09-11 11:20:12,2020-09-11
     user1,https://blog.csdn.net/qq_28680977/article/details/108295053?k1=v1&k2=v2#Ref1,4,2020-09-10 08:19:22,2020-09-10
     user1,https://blog.csdn.net/qq_28680977/article/details/108460523?k1=v1&k2=v2#Ref1,5,2020-08-12 19:20:22,2020-08-12
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,29,2020-04-04 12:23:22,2020-04-04
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,30,2020-05-15 12:34:23,2020-05-15
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,30,2020-05-15 13:34:23,2020-05-15
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,19,2020-05-16 19:03:32,2020-05-16
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,10,2020-05-17 06:20:22,2020-05-17
     user3,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,43,2020-04-12 08:02:22,2020-04-12
     user3,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,5,2020-08-02 08:10:22,2020-08-02
     user3,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,6,2020-08-02 10:10:22,2020-08-02
     user3,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,50,2020-08-12 12:23:22,2020-08-12
     user4,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,10,2020-04-12 11:20:22,2020-04-12
     user4,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,30,2020-03-12 10:20:22,2020-03-12
     user4,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,20,2020-02-12 20:26:43,2020-02-12
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,10,2020-04-12 19:12:36,2020-04-12
     user2,https://blog.csdn.net/qq_28680977/article/details/108161655?k1=v1&k2=v2#Ref1,40,2020-05-12 18:24:31,2020-05-12

    建表语句

    create table wedw_tmp.tmp_url_info(
      user_id string comment "用户id",
      visit_url string comment "访问url",
      visit_cnt int comment "浏览次数/pv",
     visit_time timestamp comment "浏览时间",
      visit_date string comment "浏览日期"
     )
     row format delimited
     fields terminated by ','
    stored as textfile;

    窗口函数

    row_number:使用频率 ★★★★★

    row_number函数通常用于分组统计组内的排名,然后进行后续的逻辑处理。

    注意:当遇到相同排名的时候,不会生成同样的序号,且中间不会空位

    该函数经常被大厂问到,可以参考

     1 -- 统计每个用户每天最近一次访问记录
     2 select 
     3  user_id,
     4  visit_time,
     5  visit_cnt
     6 from 
     7 (
     8  select
     9    *,
    10   row_number() over(partition by user_id,visit_date order by visit_time desc) as rank
    11  from wedw_tmp.tmp_url_info
    12 )t
    13 where rank=1
    14 order by user_id,visit_time
    15+----------+------------------------+------------+--+
    16| user_id  |       visit_time       | visit_cnt  |
    17+----------+------------------------+------------+--+
    18| user1    | 2020-08-12 19:20:22.0  | 5          |
    19| user1    | 2020-09-10 08:19:22.0  | 4          |
    20| user1    | 2020-09-11 11:20:12.0  | 2          |
    21| user1    | 2020-09-12 02:20:02.0  | 10         |
    22| user2    | 2020-04-04 12:23:22.0  | 29         |
    23| user2    | 2020-04-12 19:12:36.0  | 10         |
    24| user2    | 2020-05-12 18:24:31.0  | 40         |
    25| user2    | 2020-05-15 13:34:23.0  | 30         |  --该用户同一天访问了多次,但只取了最新一次访问记录
    26| user2    | 2020-05-16 19:03:32.0  | 19         |
    27| user2    | 2020-05-17 06:20:22.0  | 10         |
    28| user3    | 2020-04-12 08:02:22.0  | 43         |
    29| user3    | 2020-08-02 10:10:22.0  | 6          |
    30| user3    | 2020-08-12 12:23:22.0  | 50         |
    31| user4    | 2020-02-12 20:26:43.0  | 20         |
    32| user4    | 2020-03-12 10:20:22.0  | 30         |
    33| user4    | 2020-04-12 11:20:22.0  | 10         |
    34+----------+------------------------+------------+--+

    rank :使用频率 ★★★★

    和row_number功能一样,都是分组内统计排名,但是当出现同样排名的时候,中间会出现空位。这里给一个例子就可以很容易理解了

     1 select 
     2  user_id,
     3  visit_time,
     4  visit_date,
     5  rank() over(partition by user_id order by visit_date desc) as rank --每个用户按照访问时间倒排,通常用于统计用户最近一天的访问记录
     6 from wedw_tmp.tmp_url_info
     7 order by user_id,rank
     8+----------+------------------------+-------------+-------+--+
     9| user_id  |       visit_time       | visit_date  | rank  |
    10+----------+------------------------+-------------+-------+--+
    11| user1    | 2020-09-12 02:20:02.0  | 2020-09-12  | 1     |
    12| user1    | 2020-09-12 02:20:02.0  | 2020-09-12  | 1     | --同一天访问了两次,9月11号访问排名第三
    13| user1    | 2020-09-11 11:20:12.0  | 2020-09-11  | 3     |
    14| user1    | 2020-09-10 08:19:22.0  | 2020-09-10  | 4     |
    15| user1    | 2020-08-12 19:20:22.0  | 2020-08-12  | 5     |
    16| user2    | 2020-05-17 06:20:22.0  | 2020-05-17  | 1     |
    17| user2    | 2020-05-16 19:03:32.0  | 2020-05-16  | 2     |
    18| user2    | 2020-05-15 12:34:23.0  | 2020-05-15  | 3     |
    19| user2    | 2020-05-15 13:34:23.0  | 2020-05-15  | 3     |
    20| user2    | 2020-05-12 18:24:31.0  | 2020-05-12  | 5     |
    21| user2    | 2020-04-12 19:12:36.0  | 2020-04-12  | 6     |
    22| user2    | 2020-04-04 12:23:22.0  | 2020-04-04  | 7     |
    23| user3    | 2020-08-12 12:23:22.0  | 2020-08-12  | 1     |
    24| user3    | 2020-08-02 08:10:22.0  | 2020-08-02  | 2     |
    25| user3    | 2020-08-02 10:10:22.0  | 2020-08-02  | 2     |
    26| user3    | 2020-04-12 08:02:22.0  | 2020-04-12  | 4     |
    27| user4    | 2020-04-12 11:20:22.0  | 2020-04-12  | 1     |
    28| user4    | 2020-03-12 10:20:22.0  | 2020-03-12  | 2     |
    29| user4    | 2020-02-12 20:26:43.0  | 2020-02-12  | 3     |
    30+----------+------------------------+-------------+-------+--+

    dense_rank:使用频率 ★★★★

    和row_number以及rank功能一样,都是分组排名,但是该排名如果出现同次序的话,中间不会留下空位

     1 --还是以rank的sql为例子
     2 select 
     3  user_id,
     4  visit_time,
     5  visit_date,
     6  dense_rank() over(partition by user_id order by visit_date desc) as rank 
     7 from wedw_tmp.tmp_url_info
     8 order by user_id,rank
     9+----------+------------------------+-------------+-------+--+
    10| user_id  |       visit_time       | visit_date  | rank  |
    11+----------+------------------------+-------------+-------+--+
    12| user1    | 2020-09-12 02:20:02.0  | 2020-09-12  | 1     |
    13| user1    | 2020-09-12 02:20:02.0  | 2020-09-12  | 1     |
    14| user1    | 2020-09-11 11:20:12.0  | 2020-09-11  | 2     |--中间不会留下空缺
    15| user1    | 2020-09-10 08:19:22.0  | 2020-09-10  | 3     | 
    16| user1    | 2020-08-12 19:20:22.0  | 2020-08-12  | 4     |
    17| user2    | 2020-05-17 06:20:22.0  | 2020-05-17  | 1     |
    18| user2    | 2020-05-16 19:03:32.0  | 2020-05-16  | 2     |
    19| user2    | 2020-05-15 12:34:23.0  | 2020-05-15  | 3     |
    20| user2    | 2020-05-15 13:34:23.0  | 2020-05-15  | 3     |
    21| user2    | 2020-05-12 18:24:31.0  | 2020-05-12  | 4     |
    22| user2    | 2020-04-12 19:12:36.0  | 2020-04-12  | 5     |
    23| user2    | 2020-04-04 12:23:22.0  | 2020-04-04  | 6     |
    24| user3    | 2020-08-12 12:23:22.0  | 2020-08-12  | 1     |
    25| user3    | 2020-08-02 08:10:22.0  | 2020-08-02  | 2     |
    26| user3    | 2020-08-02 10:10:22.0  | 2020-08-02  | 2     |
    27| user3    | 2020-04-12 08:02:22.0  | 2020-04-12  | 3     |
    28| user4    | 2020-04-12 11:20:22.0  | 2020-04-12  | 1     |
    29| user4    | 2020-03-12 10:20:22.0  | 2020-03-12  | 2     |
    30| user4    | 2020-02-12 20:26:43.0  | 2020-02-12  | 3     |
    31+----------+------------------------+-------------+-------+--+

    rank/dense_rank/row_number对比

    相同点:都是分组排序

    不同点:

    1. Row_number:即便出现相同的排序,排名也不会一致,只会进行累加;即排序次序连续,但不会出现同一排名

    2. rank:当出现相同的排序时,中间会出现一个空缺,即分组内会出现同一个排名,但是排名次序是不连续的

    3. Dense_rank:当出现相同排序时,中间不会出现空缺,即分组内可能会出现同样的次序,且排序名次是连续的

    first_value:使用频率 ★★★

    按照分组排序取截止到当前行的第一个值;通常用于取最新记录或者最早的记录(根据排序字段进行变通即可)

    1 --仍然使用row_number的例子;方便读者理解
    2 select
    3 user_id,
    4 visit_time,
    5 visit_cnt,
    6 first_value(visit_time) over(partition by user_id order by visit_date desc) as first_value_time,
    7 row_number() over(partition by user_id order by visit_date desc) as rank
    8 from  wedw_tmp.tmp_url_info
    9 order by user_id,rank

    last_value:使用频率 ★

    按照分组排序取当前行的最后一个值;这个函数好像没啥卵用

    1 --仍然使用row_number的例子;方便读者理解
    2 select
    3 user_id,
    4 visit_time,
    5 visit_cnt,
    6 last_value(visit_time) over(partition by user_id order by visit_date desc) as first_value_time,
    7 row_number() over(partition by user_id order by visit_date desc) as rank
    8 from  wedw_tmp.tmp_url_info
    9 order by user_id,rank

    lead:使用频率 ★★

    LEAD(col,n,DEFAULT)用于取窗口内往下第n行值;通常用于行值填充;或者和指定行进行差值比较

    第一个参数为列名

    第二个参数为往下第n行(可选),

    第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

     1 select
     2 user_id,
     3 visit_time,
     4 visit_cnt,
     5 row_number() over(partition by user_id order by visit_date desc) as rank,
     6 lead(visit_time,1,'1700-01-01') over(partition by user_id order by visit_date desc) as lead_time
     7 from  wedw_tmp.tmp_url_info
     8 order by user_id
     9+----------+------------------------+------------+-------+------------------------+--+
    10| user_id  |       visit_time       | visit_cnt  | rank  |       lead_time        |
    11+----------+------------------------+------------+-------+------------------------+--+
    12| user1    | 2020-09-12 02:20:02.0  | 10         | 1     | 2020-09-12 02:20:02.0  | --取下一行的值作为当前值
    13| user1    | 2020-09-12 02:20:02.0  | 10         | 2     | 2020-09-11 11:20:12.0  |
    14| user1    | 2020-09-11 11:20:12.0  | 2          | 3     | 2020-09-10 08:19:22.0  |
    15| user1    | 2020-09-10 08:19:22.0  | 4          | 4     | 2020-08-12 19:20:22.0  |
    16| user1    | 2020-08-12 19:20:22.0  | 5          | 5     | 1700-01-01 00:00:00.0  | --这里是最后一条记录,则取默认值
    17| user2    | 2020-05-17 06:20:22.0  | 10         | 1     | 2020-05-16 19:03:32.0  |
    18| user2    | 2020-05-16 19:03:32.0  | 19         | 2     | 2020-05-15 12:34:23.0  |
    19| user2    | 2020-05-15 12:34:23.0  | 30         | 3     | 2020-05-15 13:34:23.0  |
    20| user2    | 2020-05-15 13:34:23.0  | 30         | 4     | 2020-05-12 18:24:31.0  |
    21| user2    | 2020-05-12 18:24:31.0  | 40         | 5     | 2020-04-12 19:12:36.0  |
    22| user2    | 2020-04-12 19:12:36.0  | 10         | 6     | 2020-04-04 12:23:22.0  |
    23| user2    | 2020-04-04 12:23:22.0  | 29         | 7     | 1700-01-01 00:00:00.0  |
    24| user3    | 2020-08-12 12:23:22.0  | 50         | 1     | 2020-08-02 08:10:22.0  |
    25| user3    | 2020-08-02 08:10:22.0  | 5          | 2     | 2020-08-02 10:10:22.0  |
    26| user3    | 2020-08-02 10:10:22.0  | 6          | 3     | 2020-04-12 08:02:22.0  |
    27| user3    | 2020-04-12 08:02:22.0  | 43         | 4     | 1700-01-01 00:00:00.0  |
    28| user4    | 2020-04-12 11:20:22.0  | 10         | 1     | 2020-03-12 10:20:22.0  |
    29| user4    | 2020-03-12 10:20:22.0  | 30         | 2     | 2020-02-12 20:26:43.0  |
    30| user4    | 2020-02-12 20:26:43.0  | 20         | 3     | 1700-01-01 00:00:00.0  |
    31+----------+------------------------+------------+-------+------------------------+--+

    lag:使用频率 ★★

    和lead功能一样,但是是取上n行的值作为当前行值

    1 select
    2 user_id,
    3 visit_time,
    4 visit_cnt,
    5 row_number() over(partition by user_id order by visit_date desc) as rank,
    6 lag(visit_time,1,'1700-01-01') over(partition by user_id order by visit_date desc) as lead_time
    7 from  wedw_tmp.tmp_url_info
    8 order by user_id

  • 相关阅读:
    linux Centos防火墙工具iptables的使用
    SpringBoot中的注解分析
    context:property-placeholder
    HTTPS(SSL)证书下载及配置
    Dubbo之多注册中心以及zookepeer服务的查看
    重要事情老是忘?别急~看这里
    重要事情老是忘?别急~看这里
    多态
    抽象类,接口_05
    常用类
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14224342.html
Copyright © 2011-2022 走看看