zoukankan      html  css  js  c++  java
  • hive 之 调优

    一、结论

    每个窗口函数都有orderby

    • 结论:改成全局一次orderby

    子查询:MR

    • 结论:最好不要子查询

    优先干死多余数据

    • 【对hive来说】=> 先inner join(不会MR),再where(会MR)
    • 【对mysql来说】=> 先 where(筛数据) 再 on(where在前,多个条件,从右向左,先干大的)

    三个innerjoin会不会三个mr

    • 结论:不启动mr,可以使用 innerjoin 

    group by 不影响mr

    • 结论:有 groupby 不影响性能,可以使用

    in不如大于小于:因为in要做全表扫描

    • 结论:范围><代替in

    from XXX insert XXX

    • hive独有写法,提前预加载tmp

    二、案例

    链接中的DM层:https://www.cnblogs.com/sabertobih/p/13965010.html

    >>>

    需求:当天-> 顾客,产品,日期,订单个数,当天金额  && 近两天 -> 订单个数,近两天金额

    <<<

    原始hql:

    select
    d_date,customer_sk,product_sk,
    `order_num`,
    `order_dailyamount`,
    sum(`order_dailyamount`) over(rows between 1 PRECEDING and current row) as recent_amount,
    sum(`order_num`) over(rows between 1 PRECEDING and current row) as recent_num
    from 
    (
    select 
    dss.d_date,
    d.customer_sk,
    d.product_sk,
    count(d.order_sk) as `order_num`,
    sum(d.order_amount) as `order_dailyamount`
    from 
    dw_sales_source.dwd_fact_sales_order d
    inner join dw_sales_source.dwd_dim_date dss 
    on d.date_sk = dss.date_sk
    group by 
    dss.d_date,d.customer_sk,d.product_sk
    order by dss.d_date
    )T

    改进:

    • 不想要子查询: sum(order_dailyamount) over() 有错,但可以 sum(sum(d.order_amount)) over() 
    • 窗口函数里有重复order by,挪到全局
    select 
    dss.d_date,d.customer_sk,d.product_sk,
    count(d.order_sk) as order_num,
    sum(d.order_amount) as order_dailyamount,
    sum(sum(d.order_amount)) over(rows between 1 PRECEDING and current row) as recent_amount,
    sum(count(d.order_sk)) over(rows between 1 PRECEDING and current row) as recent_num
    from 
    dw_sales_source.dwd_fact_sales_order d
    inner join dw_sales_source.dwd_dim_date dss 
    on d.date_sk = dss.date_sk
    group by 
    dss.d_date,d.customer_sk,d.product_sk
    order by dss.d_date

     >>>

    需求:2018-10-20 -> 顾客,产品,日期,订单个数,当天金额  && 近两天 -> 订单个数,近两天金额

    <<<

    使用窗口函数还是group by?

    取决于需求!

    • groupby => 一组一个
    • 窗口函数 => 逐日连续

    PS: case when?见行转列 https://www.cnblogs.com/sabertobih/p/13589760.html

    -- groupby:每一组中的order数量,这种情况适合用groupby
    select 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
    sum(case when datediff('2018-10-20',ddd.d_date)=0 then 1 else 0 end) current_count,
    sum(case when datediff('2018-10-20',ddd.d_date)<=1 then 1 else 0 end) two_count,
    sum(case when datediff('2018-10-20',ddd.d_date)=0 then dfo.order_amount else 0 end) current_money,
    sum(case when datediff('2018-10-20',ddd.d_date)<=1 then dfo.order_amount else 0 end) two_count
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
    group by 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year;
    
    -- 使用窗口函数,还要过滤!麻烦,但如果要看连续不断的,股票图三日均线,很有用
    select 
    ddc.customer_sk,ddc.customer_number,ddc.customer_name,ddc.customer_street_address,ddc.custom_zip_code,ddc.customer_city,ddc.customer_state,   
    ddp.product_sk,ddp.product_code,ddp.product_name,ddp.product_category,
    ddd.d_date,ddd.d_month,ddd.d_month_name,ddd.d_quarter,ddd.d_year,
    count(dfo.order_sk) over(partition by dfo.customer_sk,dfo.product_sk,dfo.date_sk order by ddd.d_date rows between 1 PRECEDING and current row) 
    as recent_amount
    from dw_sales_source.dwd_fact_sales_order dfo
    inner join dwd_dim_date ddd on dfo.date_sk = ddd.date_sk
    inner join dwd_dim_customer ddc on dfo.customer_sk = ddc.customer_sk
    inner join dwd_dim_product ddp on dfo.product_sk = ddp.product_sk
    where 
    ddd.d_date>='2018-10-19' and ddd.d_date<='2018-10-20'
  • 相关阅读:
    【NOIp 2004】【DFS+剪枝】虫食算
    【NOIp 2014】【二维dp】飞扬的小鸟
    【NOIp 2003】【树结构·搜索】传染病防治
    【模板】匈牙利算法——二分图最大匹配
    【模板】网络流——Dinic
    【NOIp复习】STL
    【NOIp 2002】【BFS+STL】字串变换
    【vijos】【贪心】最小差距
    TensorFlow 矩阵变量初始化后的计算例子
    TensorFlow 变量初始化
  • 原文地址:https://www.cnblogs.com/sabertobih/p/14041854.html
Copyright © 2011-2022 走看看