zoukankan      html  css  js  c++  java
  • 使用hql-统计连续登陆的三天及以上的用户

    @

    这个问题可以扩展到很多相似的问题:连续几个月充值会员、连续天数有商品卖出、连续打车、连续逾期……

    数据提供

     用户ID、登入日期
     user01,2018-02-28
     user01,2018-03-01
     user01,2018-03-02
     user01,2018-03-04
     user01,2018-03-05
     user01,2018-03-06
     user01,2018-03-07
     user02,2018-03-01
     user02,2018-03-02
     user02,2018-03-03
     user02,2018-03-06
    

    输出字段

    +---------+--------+-------------+-------------+--+
    |   uid   | times  | start_date  |  end_date   |
    +---------+--------+-------------+-------------+--+
    

    解法一

    先对每个用户的登录日期排序,然后拿第n行的日期,减第n-2行的日期,如果等于2,就说明连续三天登录了。

    解法二

    开窗,窗囗内部排序然后做差

    rownumber() oover
    

    建表

    create table wedw_dw.t_login_info(
     user_id string  COMMENT '用户ID'
    ,login_date date COMMENT '登录日期'
    )
    row format delimited fields terminated by ',';
    

    导数据

    hdfs dfs -put /test/login.txt /data/hive/test/wedw/dw/t_login_info/
    

    验证数据

    select * from wedw_dw.t_login_info;
    +----------+-------------+--+
    | user_id  | login_date  |
    +----------+-------------+--+
    | user01   | 2018-02-28  |
    | user01   | 2018-03-01  |
    | user01   | 2018-03-02  |
    | user01   | 2018-03-04  |
    | user01   | 2018-03-05  |
    | user01   | 2018-03-06  |
    | user01   | 2018-03-07  |
    | user02   | 2018-03-01  |
    | user02   | 2018-03-02  |
    | user02   | 2018-03-03  |
    | user02   | 2018-03-06  |
    +----------+-------------+--+
    

    解决方案-使用解法二

    select
     t2.user_id         as user_id,
     count(1)           as times,
     min(t2.login_date) as start_date,
     max(t2.login_date) as end_date
    from
    (
        select
         t1.user_id,
         t1.login_date,
         date_sub(t1.login_date,rn) as date_diff
        from
        (
            select
             user_id,
             login_date,
             row_number() over(partition by user_id order by login_date asc) as rn 
            from
            wedw_dw.t_login_info
        ) t1
    ) t2
    group by 
     t2.user_id, t2.date_diff
    having times >= 3;
    

    结果

    +----------+--------+-------------+-------------+--+
    | user_id  | times  | start_date  |  end_date   |
    +----------+--------+-------------+-------------+--+
    | user01   | 3      | 2018-02-28   | 2018-03-02  |
    | user01   | 4      | 2018-03-04  | 2018-03-07   |
    | user02   | 3      | 2018-03-01   | 2018-03-03  |
    +----------+--------+-------------+-------------+--+
    

    思路

    1. 先把数据按照用户id分组,根据登录日期排序
    select
    	user_id
    	,login_date
    	,row_number() over(partition by user_id order by login_date asc) as rn 
    	from
    	wedw_dw.t_login_info
    
    +----------+-------------+-----+--+
    | user_id  | login_date  | rn  |
    +----------+-------------+-----+--+
    | user01   | 2018-02-28  | 1   |
    | user01   | 2018-03-01  | 2   |
    | user01   | 2018-03-02  | 3   |
    | user01   | 2018-03-04  | 4   |
    | user01   | 2018-03-05  | 5   |
    | user01   | 2018-03-06  | 6   |
    | user01   | 2018-03-07  | 7   |
    | user02   | 2018-03-01  | 1   |
    | user02   | 2018-03-02  | 2   |
    | user02   | 2018-03-03  | 3   |
    | user02   | 2018-03-06  | 4   |
    +----------+-------------+-----+--+
    
    1. 用登录日期减去排序数字rn,得到的差值日期如果是相等的,则说明这两天肯定是连续的
    select
         t1.user_id
        ,t1.login_date
        ,date_sub(t1.login_date,rn) as date_diff
        from
        (
            select
             user_id
            ,login_date
            ,row_number() over(partition by user_id order by login_date asc) as rn 
            from
            wedw_dw.t_login_info
        ) t1
        ;
    
    
    +----------+-------------+-------------+--+
    | user_id  | login_date  |  date_diff  |
    +----------+-------------+-------------+--+
    | user01   | 2018-02-28  | 2018-02-27  |
    | user01   | 2018-03-01  | 2018-02-27  |
    | user01   | 2018-03-02  | 2018-02-27  |
    | user01   | 2018-03-04  | 2018-02-28  |
    | user01   | 2018-03-05  | 2018-02-28  |
    | user01   | 2018-03-06  | 2018-02-28  |
    | user01   | 2018-03-07  | 2018-02-28  |
    | user02   | 2018-03-01  | 2018-02-28  |
    | user02   | 2018-03-02  | 2018-02-28  |
    | user02   | 2018-03-03  | 2018-02-28  |
    | user02   | 2018-03-06  | 2018-03-02  |
    +----------+-------------+-------------+--+
    
    1. 根据user_id和日期差date_diff 分组,最小登录日期即为此次连续登录的开始日期start_date,最大登录日期即为结束日期end_date,登录次数即为分组后的count(1)
    select
     t2.user_id         as user_id
    ,count(1)           as times
    ,min(t2.login_date) as start_date
    ,max(t2.login_date) as end_date
    from
    (
        select
         t1.user_id
        ,t1.login_date
        ,date_sub(t1.login_date,rn) as date_diff
        from
        (
            select
             user_id
            ,login_date
            ,row_number() over(partition by user_id order by login_date asc) as rn 
            from
            wedw_dw.t_login_info
        ) t1
    ) t2
    group by 
     t2.user_id
    ,t2.date_diff
    having times >= 3
    ;
    
    +----------+--------+-------------+-------------+--+
    | user_id  | times  | start_date  |  end_date   |
    +----------+--------+-------------+-------------+--+
    | user01   | 3      | 2018-02-28   | 2018-03-02  |
    | user01    | 4      | 2018-03-04  | 2018-03-07  |
    | user02   | 3      | 2018-03-01   | 2018-03-03  |
    +----------+--------+-------------+-------------+--+
    
    如果真的不知道做什么 那就做好眼前的事情吧 你所希望的事情都会慢慢实现...
  • 相关阅读:
    几种类间关系:继承、实现、依赖、关联、聚合、组合及UML变现图(转)
    StarUML指导手册(转)
    hook(转)
    DancingLinks简介(转)
    (转)用宏获取函数名
    UML中关联和依赖的区别
    “智商平平”学软件(转)
    生产者消费者问题Semaphore
    [转] UML中关联、依赖、聚集等关系的异同
    最近要看的东西
  • 原文地址:https://www.cnblogs.com/sunbr/p/14423962.html
Copyright © 2011-2022 走看看