zoukankan      html  css  js  c++  java
  • hive 之 调优

    一、结论

    每个窗口函数都有orderby

    • 结论:改成全局一次orderby

    子查询:MR

    • 结论:最好不要子查询

    优先干死多余数据

    • 【对hive来说】=> 先inner join(不会MR),再where(会MR)
    • 【对mysql来说】=> 先 where(筛数据) 再 on(where在前,多个条件,从右向左,先干大的)

    三个innerjoin会不会三个mr

    • 结论:不启动mr,可以使用 innerjoin 

    group by 不影响mr

    • 结论:有 groupby 不影响性能,可以使用

    in不如大于小于:因为in要做全表扫描

    • 结论:范围><代替in

    from XXX insert XXX

    • hive独有写法,提前预加载tmp

    二、案例

    链接中的DM层:https://www.cnblogs.com/sabertobih/p/13965010.html

    >>>

    需求:当天-> 顾客,产品,日期,订单个数,当天金额  && 近两天 -> 订单个数,近两天金额

    <<<

    原始hql:

    select
    d_date,customer_sk,product_sk,
    `order_num`,
    `order_dailyamount`,
    sum(`order_dailyamount`) over(rows between 1 PRECEDING and current row) as recent_amount,
    sum(`order_num`) over(rows between 1 PRECEDING and current row) as recent_num
    from 
    (
    select 
    dss.d_date,
    d.customer_sk,
    d.product_sk,
    count(d.order_sk) as `order_num`,
    sum(d.order_amount) as `order_dailyamount`
    from 
    dw_sales_source.dwd_fact_sales_order d
    inner join dw_sales_source.dwd_dim_date dss 
    on d.date_sk = dss.date_sk
    group by 
    dss.d_date,d.customer_sk,d.product_sk
    order by dss.d_date
    )T

    改进:

    • 不想要子查询: sum(order_dailyamount) over() 有错,但可以 sum(sum(d.order_amount)) over() 
    • 窗口函数里有重复order by,挪到全局
    select 
    dss.d_date,d.customer_sk,d.product_sk,
    count(d.order_sk) as order_num,
    sum(d.order_amount) as order_dailyamount,
    sum(sum(d.order_amount)) over(rows between 1 PRECEDING and current row) as recent_amount,
    sum(count(d.order_sk)) over(rows between 1 PRECEDING and current row) as recent_num
    from 
    dw_sales_source.dwd_fact_sales_order d
    inner join dw_sales_source.dwd_dim_date dss 
    on d.date_sk = dss.date_sk
    group by 
    dss.d_date,d.customer_sk,d.product_sk
    order by dss.d_date

     >>>

    需求:2018-10-20 -> 顾客,产品,日期,订单个数,当天金额  && 近两天 -> 订单个数,近两天金额

    <<<

    使用窗口函数还是group by?

    取决于需求!

    • groupby => 一组一个
    • 窗口函数 => 逐日连续

    PS: case when?见行转列 https://www.cnblogs.com/sabertobih/p/13589760.html

    -- groupby:每一组中的order数量,这种情况适合用groupby
    select 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
    sum(case when datediff('2018-10-20',ddd.d_date)=0 then 1 else 0 end) current_count,
    sum(case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end) two_count,
    sum(case when datediff('2018-10-20',ddd.d_date)=0 then dfo.order_amount else 0 end) current_money,
    sum(case when datediff('2018-10-20',ddd.d_date)<=1 then dfo.order_amount else 0 end) two_count
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
    group by 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year;
    
    -- 使用窗口函数,还要过滤!麻烦,但如果要看连续不断的,股票图三日均线,很有用
    select 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
    count(dfo.order_sk) over(partition by dfo.customer_sk,dfo.product_sk,dfo.date_sk order by ddd.d_date rows between 1 PRECEDING and current row) 
    as recent_amount
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where 
    ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
  • 相关阅读:
    对指定文件生成数字摘要的MD5工具类
    shell脚本学习积累笔记(第一篇)
    java项目打成jar包时引用了第三方jar,此时我们该如何解决呢
    分享关于学习new BufferedWriter()方法时常遇到的一个无厘头的问题
    WebService学习整理(一)——客户端三种调用方式整理
    TZOJ 挑战题库随机训练02
    TZOJ 挑战题库随机训练01
    TZOJ 2943 Running Median(动态中位数)
    TZOJ 3927 Circular Sequence(环形最大子段和)
    TZOJ 3698 GCD depth(数学)
  • 原文地址:https://www.cnblogs.com/sabertobih/p/14041854.html
Copyright © 2011-2022 走看看