zoukankan      html  css  js  c++  java
  • hive SQL优化之distribute by和sort by

    近期在优化hiveSQL。

    以下是一段排序,分组后取每组第一行记录的SQL

    1. INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}'
    2. SELECT 
    3.     bussiness_id, 
    4.     cookie_id, 
    5.     session_id, 
    6.     funnel_id, 
    7.     group_first(funnel_name) funnel_name, 
    8.     step_id, 
    9.     group_first(step_name) step_name, 
    10.     group_first(log_type) log_type, 
    11.     group_first(url_pattern) url_pattern, 
    12.     group_first(url) url, 
    13.     group_first(refer) refer, 
    14.     group_first(log_time) log_time, 
    15.     group_first(is_new_visitor) is_new_visitor, 
    16.     group_first(is_mobile_traffic) is_mobile_traffic, 
    17.     group_first(is_bounce) is_bounce, 
    18.     group_first(campaign_name) campaign_name, 
    19.     group_first(group_name) group_name, 
    20.     group_first(slot_name) slot_name, 
    21.     group_first(source_type) source_type, 
    22.     group_first(next_page) next_page, 
    23.     group_first(continent) continent, 
    24.     group_first(sub_continent_region) sub_continent_region, 
    25.     group_first(country) country, 
    26.     group_first(region) region, 
    27.     group_first(city) city, 
    28.     group_first(language) language, 
    29.     group_first(browser) browser, 
    30.     group_first(os) os, 
    31.     group_first(screen_color) screen_color, 
    32.     group_first(screen_resolution) screen_resolution, 
    33.     group_first(flash_version) flash_version, 
    34.     group_first(java) java, 
    35.     group_first(host) host 
    36. FROM 
    37.   SELECT 
    38.     FROM r_wa_funnel  
    39.     WHERE pt='${SRCTIME}' 
    40.     ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC 
    41. t1 
    42. GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id; 

    group_first: 自己定义函数。用户取每组第一个字段
    ${SRCTIME}: 由外部oozie调度传入, 作为时间分区,精确到小时.eg: 2011.11.01.21

    以下在hive上以
    SRCTIME = 2011.11.01.21 运行以上SQL. 2011.11.01.21小时分区记录数有10435486

    运行时间:

    从上面能够看出,reduce阶段仅仅有一个reduce, 这是由于ORDER BY是全局排序,hive仅仅能通过一个reduce进行排序
    从业务需求来看, 仅仅要按
    bussiness_id, cookie_id, session_id, funnel_id, step_id分组,组内按
    log_time升序排序就可以.

    OK, 这样能够採用hive提供的distribute by 和 sort by,这样能够充分利用hadoop资源, 在多个
    reduce中局部按log_time 排序

    优化有的hive代码:

    1. INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21'
    2. SELECT 
    3.     bussiness_id, 
    4.     cookie_id, 
    5.     session_id, 
    6.     funnel_id, 
    7.     group_first(funnel_name) funnel_name, 
    8.     step_id, 
    9.     group_first(step_name) step_name, 
    10.     group_first(log_type) log_type, 
    11.     group_first(url_pattern) url_pattern, 
    12.     group_first(url) url, 
    13.     group_first(refer) refer, 
    14.     group_first(log_time) log_time, 
    15.     group_first(is_new_visitor) is_new_visitor, 
    16.     group_first(is_mobile_traffic) is_mobile_traffic, 
    17.     group_first(is_bounce) is_bounce, 
    18.     group_first(campaign_name) campaign_name, 
    19.     group_first(group_name) group_name, 
    20.     group_first(slot_name) slot_name, 
    21.     group_first(source_type) source_type, 
    22.     group_first(next_page) next_page, 
    23.     group_first(continent) continent, 
    24.     group_first(sub_continent_region) sub_continent_region, 
    25.     group_first(country) country, 
    26.     group_first(region) region, 
    27.     group_first(city) city, 
    28.     group_first(language) language, 
    29.     group_first(browser) browser, 
    30.     group_first(os) os, 
    31.     group_first(screen_color) screen_color, 
    32.     group_first(screen_resolution) screen_resolution, 
    33.     group_first(flash_version) flash_version, 
    34.     group_first(java) java, 
    35.     group_first(host) host 
    36. FROM 
    37.   SELECT 
    38.     FROM r_wa_funnel  
    39.     WHERE pt='2011.11.01.21' 
    40.     distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC 
    41. t1 
    42. GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id; 

    运行时间:

    第一个须要运行6:43, 而优化有仅仅要运行0:35秒。性能得到大幅提升
  • 相关阅读:
    C++11写算法之顺序查找
    Codeforces Round #632 (Div. 2) 题解
    AtCoder Beginner Contest 161 题解
    Codeforces Round #630 (Div. 2) 部分题解
    2019-2020 ICPC Northwestern European Regional Programming Contest (NWERC 2019) 部分题解
    Codeforces Round #629 (Div. 3) 题解
    Educational Codeforces Round 84 (Rated for Div. 2) 部分题解
    AtCoder Beginner Contest 159 题解
    Codeforces Round #628 (Div. 2) 题解
    Codeforces Round #627 (Div. 3) 题解
  • 原文地址:https://www.cnblogs.com/yjbjingcha/p/7008130.html
Copyright © 2011-2022 走看看