zoukankan      html  css  js  c++  java
  • hive SQL优化之distribute by和sort by

    近期在优化hiveSQL。

    以下是一段排序,分组后取每组第一行记录的SQL

    1. INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}'
    2. SELECT 
    3.     bussiness_id, 
    4.     cookie_id, 
    5.     session_id, 
    6.     funnel_id, 
    7.     group_first(funnel_name) funnel_name, 
    8.     step_id, 
    9.     group_first(step_name) step_name, 
    10.     group_first(log_type) log_type, 
    11.     group_first(url_pattern) url_pattern, 
    12.     group_first(url) url, 
    13.     group_first(refer) refer, 
    14.     group_first(log_time) log_time, 
    15.     group_first(is_new_visitor) is_new_visitor, 
    16.     group_first(is_mobile_traffic) is_mobile_traffic, 
    17.     group_first(is_bounce) is_bounce, 
    18.     group_first(campaign_name) campaign_name, 
    19.     group_first(group_name) group_name, 
    20.     group_first(slot_name) slot_name, 
    21.     group_first(source_type) source_type, 
    22.     group_first(next_page) next_page, 
    23.     group_first(continent) continent, 
    24.     group_first(sub_continent_region) sub_continent_region, 
    25.     group_first(country) country, 
    26.     group_first(region) region, 
    27.     group_first(city) city, 
    28.     group_first(language) language, 
    29.     group_first(browser) browser, 
    30.     group_first(os) os, 
    31.     group_first(screen_color) screen_color, 
    32.     group_first(screen_resolution) screen_resolution, 
    33.     group_first(flash_version) flash_version, 
    34.     group_first(java) java, 
    35.     group_first(host) host 
    36. FROM 
    37.   SELECT 
    38.     FROM r_wa_funnel  
    39.     WHERE pt='${SRCTIME}' 
    40.     ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC 
    41. t1 
    42. GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id; 

    group_first: 自己定义函数。用户取每组第一个字段
    ${SRCTIME}: 由外部oozie调度传入, 作为时间分区,精确到小时.eg: 2011.11.01.21

    以下在hive上以
    SRCTIME = 2011.11.01.21 运行以上SQL. 2011.11.01.21小时分区记录数有10435486

    运行时间:

    从上面能够看出,reduce阶段仅仅有一个reduce, 这是由于ORDER BY是全局排序,hive仅仅能通过一个reduce进行排序
    从业务需求来看, 仅仅要按
    bussiness_id, cookie_id, session_id, funnel_id, step_id分组,组内按
    log_time升序排序就可以.

    OK, 这样能够採用hive提供的distribute by 和 sort by,这样能够充分利用hadoop资源, 在多个
    reduce中局部按log_time 排序

    优化有的hive代码:

    1. INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21'
    2. SELECT 
    3.     bussiness_id, 
    4.     cookie_id, 
    5.     session_id, 
    6.     funnel_id, 
    7.     group_first(funnel_name) funnel_name, 
    8.     step_id, 
    9.     group_first(step_name) step_name, 
    10.     group_first(log_type) log_type, 
    11.     group_first(url_pattern) url_pattern, 
    12.     group_first(url) url, 
    13.     group_first(refer) refer, 
    14.     group_first(log_time) log_time, 
    15.     group_first(is_new_visitor) is_new_visitor, 
    16.     group_first(is_mobile_traffic) is_mobile_traffic, 
    17.     group_first(is_bounce) is_bounce, 
    18.     group_first(campaign_name) campaign_name, 
    19.     group_first(group_name) group_name, 
    20.     group_first(slot_name) slot_name, 
    21.     group_first(source_type) source_type, 
    22.     group_first(next_page) next_page, 
    23.     group_first(continent) continent, 
    24.     group_first(sub_continent_region) sub_continent_region, 
    25.     group_first(country) country, 
    26.     group_first(region) region, 
    27.     group_first(city) city, 
    28.     group_first(language) language, 
    29.     group_first(browser) browser, 
    30.     group_first(os) os, 
    31.     group_first(screen_color) screen_color, 
    32.     group_first(screen_resolution) screen_resolution, 
    33.     group_first(flash_version) flash_version, 
    34.     group_first(java) java, 
    35.     group_first(host) host 
    36. FROM 
    37.   SELECT 
    38.     FROM r_wa_funnel  
    39.     WHERE pt='2011.11.01.21' 
    40.     distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC 
    41. t1 
    42. GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id; 

    运行时间:

    第一个须要运行6:43, 而优化有仅仅要运行0:35秒。性能得到大幅提升
  • 相关阅读:
    课堂作业04 2017.10.27
    课程作业 03 动手动脑 2017.10.20
    课程作业 03 2017.10.20
    HDU 3974 Assign the task
    POJ 2155 Matrix
    POJ 2481 Cows
    HDU 3038 How Many Answers Are Wrong
    CS Academy Array Removal
    POJ_1330 Nearest Common Ancestors LCA
    CF Round 427 D. Palindromic characteristics
  • 原文地址:https://www.cnblogs.com/yjbjingcha/p/7008130.html
Copyright © 2011-2022 走看看