zoukankan      html  css  js  c++  java
  • Hive 窗口与分析型函数拓展

    1:ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW

    unbounded:无界限
    preceding:从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量
    following :与preceding相反,到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量
    current row:顾名思义,当前行,偏移量为0;

    2.CUME_DIST函数

    示例:

    CREATE TABLE t_employee (id INT, emp_name VARCHAR(20), dep_name VARCHAR(20),
    salary DECIMAL(7, 2), age DECIMAL(3, 0));
    
    INSERT INTO t_employee VALUES
    ( 1,  'Matthew', 'Management',  4500, 55),
    ( 2,  'Olivia',  'Management',  4400, 61),
    ( 3,  'Grace',   'Management',  4000, 42),
    ( 4,  'Jim',     'Production',  3700, 35),
    ( 5,  'Alice',   'Production',  3500, 24),
    ( 6,  'Michael', 'Production',  3600, 28),
    ( 7,  'Tom',     'Production',  3800, 35),
    ( 8,  'Kevin',   'Production',  4000, 52),
    ( 9,  'Elvis',   'Service',     4100, 40),
    (10,  'Sophia',  'Sales',       4300, 36),
    (11,  'Samantha','Sales',       4100, 38);
    

     

    累积分布:
    我们可以计算整个公司员工薪水的累积分布。如,4000 元的累计分布百分比是 0.55,表示有 55% 的员工薪资低于或等于 4000 元。计算时,我们先统计不同薪资的频数,再用窗口查询做一次累计求和操作:

    方法一:
    
    SELECT
      salary
      ,SUM(cnt) OVER (ORDER BY salary)
      / SUM(cnt) OVER (ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING
                       AND UNBOUNDED FOLLOWING)
    FROM (
      SELECT salary, count(*) AS cnt
      FROM t_employee
      GROUP BY salary
    ) a;
    
    方法二:
    我们还可以使用 Hive 提供的 CUME_DIST() 来完成相同的计算。PERCENT_RANK() 函数则可以百分比的形式展现薪资所在排名。
    
    SELECT
      salary
      ,CUME_DIST() OVER (ORDER BY salary) AS pct_cum
      ,PERCENT_RANK() OVER (ORDER BY salary) AS pct_rank
    FROM t_employee;
    

      结果示例:

     3.lag、lead函数 (统计环比,股票波峰波谷)

    示例一(统计波峰波谷):

    select * from 
    		(select id,time,price, "Peak" as t 
    			from (select id,price,time,t3.r1,t3.r2,t3.m t 
    				from (select id,price,`time`,maked m,
    				lag(price) over (partition by id order by `time`) lu,
    				lead(price) over (partition by id order by `time`) ll,
    			round(price-lag(price) over (partition by id order by `time`),2) r1,
    			round( price-lead(price) over (partition by id order by `time`),2) r2 
    			from t22) t3 where t3.r1>0 and t3.r2>0) t4 
    union all 
    		select id,time,price, "Botton" as t 
    			from (select id,price,time,t3.r1,t3.r2,t3.m t 
    				from (select id,price,`time`,maked m,
    					lag(price) over (partition by id order by `time`) lu,
    					lead(price) over (partition by id order by `time`) ll,
    			round(price-lag(price) over (partition by id order by `time`),2) r1,
    			round( price-lead(price) over (partition by id order by `time`),2) r2 
    			from t22) t3 where t3.r1<0 and t3.r2<0) t5) t6 order by time;
    

      

    示例二(left join统计波峰波谷):

    CREATE TABLE IF NOT EXISTS lp.study_2(
    `id` int commit "主键",
    `code` bigint commit "编号",
    `time` string commit "时间",
    `price` double commit "价格")
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY 'u0001'
    LINES TERMINATED BY '
    '
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
    

      

    实现sql:

     select a.id,a.price,
     case when b.price is null then "未知"
          when c.price is null then "未知"
          when a.price>b.price and a.price>c.price then "波峰"
          when a.price<b.price and a.price<c.price then "波谷"
     else "未知" end as mark
     from lp.study_2 a 
     left join 
     lp.study_2 b on a.code=b.code and b.id=a.id-1
     left join
     lp.study_2 c on a.code=c.code and c.id=a.id+1

     示例3(统计环比):

    select ds ,group_base, (num_1-lag(num_1) over (partition by group_base order by ds))/(lag(num_1) over (partition by group_base order by ds)) from 
    (select ds, count (role_uid) as num_1 , group_base
    from  
    dwb.dwb_active_role 
    where gn='${gn}' and ds>='2021-08-08' and ds<='2021-08-09' group by ds,group_base 
    ) a 

    参考:

    Hive分析窗口函数(一) SUM,AVG,MIN,MAX (ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW)

    Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

    Hive 窗口与分析型函数

    找出每个id在在⼀天之内所有的波峰与波⾕值

    用hive实现判断股票价格的波峰 波谷 (此方法不用开窗,效率快)

    https://www.jianshu.com/p/75a263d92b0d (hive各版本分析函数差异)

  • 相关阅读:
    为什么 要弄清楚 mysql int(5) int(11) bigint 自建mysql主键id python random 科学计数法
    .m2 epositoryorgspringframeworkspring-beans4.1.4.RELEASEspring-beans-4.1.4.RELEASE.jar!orgspringframeworkeansfactoryxmlspring-beans-4.1.xsd
    IoC COntainer Create Javabeans 可以通过读取beans.xml 文件来创建一个应用程序上下文对象 依赖反转
    Information:java: Errors occurred while compiling module 'spring'
    dynamic programming 动态规划
    collision weaknesses
    求字符串平均长度
    Servlet + JSP 时代
    这个世界上有各种各样的框架,设计这些五花八门框架的初衷到底是什么?我们该不该学习框架,该如何学习使用这些框架?
    开窗函数
  • 原文地址:https://www.cnblogs.com/-courage/p/15120094.html
Copyright © 2011-2022 走看看