zoukankan      html  css  js  c++  java
  • MapReduce执行计划及优化

    WordCount:

    select rank, count(*) cnt from city group by rank;

    Map与Reduce 都是一个JVM进程,可以理解为都是一个独立的应用程序

    MapReduce 框架的作用就是自动帮你把map端的输出,经过shuffle后变为reduce端的输入

    方法:

    public void map(Text value,Context context);

    public void reduce(Text key, Iterable<IntWritable> values, Context context)

    Shuffle:

     

    common join

    Map Join

     

    MapReduce

    word count:


    ### 执行计划
    explain select
    a.name,a.entity_id
    from
    tangbao.shop a
    join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage |
    | Stage-3 depends on stages: Stage-4 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | a |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | a |
    | TableScan |
    | alias: a |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | HashTable Sink Operator |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: b |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Map Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+

    explain select
    a.name,a.entity_id
    from
    tangbao.shop a
    left join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage , consists of Stage-1 |
    | Stage-1 |
    | Stage-0 depends on stages: Stage-1 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Conditional Operator |
    | |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: name (type: string) |
    | TableScan |
    | alias: b |
    | Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 4821977155 Data size: 284496652145 Basic stats: PARTIAL Column stats: PARTIAL |
    | value expressions: pt (type: string) |
    | Reduce Operator Tree: |
    | Join Operator |
    | condition map: |
    | Left Outer Join0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38, _col117 |
    | Statistics: Num rows: 5304174985 Data size: 312946324142 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: (_col117 = '20180501') (type: boolean) |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 2652087492 Data size: 156473162041 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+

    explain select
    avg(fee),max(fee)
    from
    olap_pc_history_report.tmp_tp_p_join
    group by
    entity_id,curr_date,kindpay_id;

    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-1 is a root stage |
    | Stage-0 depends on stages: Stage-1 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: tmp_tp_p_join |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string), fee (type: double) |
    | outputColumnNames: entity_id, curr_date, kindpay_id, fee |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Group By Operator |
    | aggregations: avg(fee), max(fee) |
    | keys: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
    | mode: hash |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col1) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: mergepartial |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col3 (type: double), _col4 (type: double) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +-------------------------------------------------------------------------------------------------------------------------------+--+

    set hive.map.aggr=false;

    set hive.groupby.skewindata=true;

    explain select
    avg(fee),max(fee)
    from
    olap_pc_history_report.tmp_tp_p_join
    group by
    entity_id,curr_date,kindpay_id;

    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +-------------------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-1 is a root stage |
    | Stage-2 depends on stages: Stage-1 |
    | Stage-0 depends on stages: Stage-2 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: tmp_tp_p_join |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: curr_date (type: string), entity_id (type: string), fee (type: double), kindpay_id (type: string) |
    | outputColumnNames: curr_date, entity_id, fee, kindpay_id |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string), curr_date (type: string), kindpay_id (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: rand() (type: double) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: fee (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col0) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: partial1 |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: true |
    | table: |
    | input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
    | |
    | Stage: Stage-2 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | Reduce Output Operator |
    | key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | sort order: +++ |
    | Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
    | Statistics: Num rows: 155435158 Data size: 40145589195 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: _col3 (type: struct<count:bigint,sum:double,input:double>), _col4 (type: double) |
    | Reduce Operator Tree: |
    | Group By Operator |
    | aggregations: avg(VALUE._col0), max(VALUE._col1) |
    | keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
    | mode: final |
    | outputColumnNames: _col0, _col1, _col2, _col3, _col4 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col3 (type: double), _col4 (type: double) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 77717579 Data size: 20072794597 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +-------------------------------------------------------------------------------------------------------------------------------+--+

    set hive.optimize.skewjoin=true;

    explain select
    a.name,a.entity_id
    from
    ods_shop_org.shop a
    join
    ods_order_org.instancedetail b
    on a.entity_id=b.entity_id
    where b.pt='20180501';

    +------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-7 is a root stage , consists of Stage-1 |
    | Stage-1 |
    | Stage-4 depends on stages: Stage-1 , consists of Stage-8 |
    | Stage-8 |
    | Stage-3 depends on stages: Stage-8 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-7 |
    | Conditional Operator |
    | |
    | Stage: Stage-1 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 443697 Data size: 88739406 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 221849 Data size: 44369803 Basic stats: COMPLETE Column stats: NONE |
    | value expressions: name (type: string) |
    | TableScan |
    | alias: b |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 164539402 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Output Operator |
    | key expressions: entity_id (type: string) |
    | sort order: + |
    | Map-reduce partition columns: entity_id (type: string) |
    | Statistics: Num rows: 82269701 Data size: 8226970112 Basic stats: COMPLETE Column stats: NONE |
    | Reduce Operator Tree: |
    | Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | handleSkewJoin: true |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 entity_id (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | |
    | Stage: Stage-4 |
    | Conditional Operator |
    | |
    | Stage: Stage-8 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | 1 |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | 1 |
    | TableScan |
    | HashTable Sink Operator |
    | keys: |
    | 0 reducesinkkey0 (type: string) |
    | 1 reducesinkkey0 (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | Map Join Operator |
    | condition map: |
    | Inner Join 0 to 1 |
    | keys: |
    | 0 reducesinkkey0 (type: string) |
    | 1 reducesinkkey0 (type: string) |
    | outputColumnNames: _col9, _col38 |
    | Select Operator |
    | expressions: _col9 (type: string), _col38 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 90496673 Data size: 9049667319 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    +------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +------------------------------------------------------------------------------------------------------------------+--+
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +------------------------------------------------------------------------------------------------------------------+--+

    explain select
    a.name,a.entity_id
    from
    ods_order_org.instancedetail a
    left semi join
    tangbao.shop b
    on a.entity_id=b.entity_id
    where a.pt='20180501';

    +--------------------------------------------------------------------------------------------------------------------+--+
    | Explain |
    +--------------------------------------------------------------------------------------------------------------------+--+
    | STAGE DEPENDENCIES: |
    | Stage-4 is a root stage |
    | Stage-3 depends on stages: Stage-4 |
    | Stage-0 depends on stages: Stage-3 |
    | |
    | STAGE PLANS: |
    | Stage: Stage-4 |
    | Map Reduce Local Work |
    | Alias -> Map Local Tables: |
    | b |
    | Fetch Operator |
    | limit: -1 |
    | Alias -> Map Local Operator Tree: |
    | b |
    | TableScan |
    | alias: b |
    | filterExpr: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 1000 Data size: 235780 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: entity_id (type: string) |
    | outputColumnNames: entity_id |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | Group By Operator |
    | keys: entity_id (type: string) |
    | mode: hash |
    | outputColumnNames: _col0 |
    | Statistics: Num rows: 500 Data size: 117890 Basic stats: COMPLETE Column stats: NONE |
    | HashTable Sink Operator |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 _col0 (type: string) |
    | |
    | Stage: Stage-3 |
    | Map Reduce |
    | Map Operator Tree: |
    | TableScan |
    | alias: a |
    | filterExpr: (entity_id is not null and (pt = '20180501')) (type: boolean) |
    | Statistics: Num rows: 82269701 Data size: 16453940224 Basic stats: COMPLETE Column stats: NONE |
    | Filter Operator |
    | predicate: entity_id is not null (type: boolean) |
    | Statistics: Num rows: 41134851 Data size: 8226970212 Basic stats: COMPLETE Column stats: NONE |
    | Map Join Operator |
    | condition map: |
    | Left Semi Join 0 to 1 |
    | keys: |
    | 0 entity_id (type: string) |
    | 1 _col0 (type: string) |
    | outputColumnNames: _col9, _col29 |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | Select Operator |
    | expressions: _col9 (type: string), _col29 (type: string) |
    | outputColumnNames: _col0, _col1 |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | File Output Operator |
    | compressed: false |
    | Statistics: Num rows: 45248337 Data size: 9049667429 Basic stats: COMPLETE Column stats: NONE |
    | table: |
    | input format: org.apache.hadoop.mapred.TextInputFormat |
    | output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
    | serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
    | Local Work: |
    | Map Reduce Local Work |
    | |
    | Stage: Stage-0 |
    | Fetch Operator |
    | limit: -1 |
    | Processor Tree: |
    | ListSink |
    | |
    +--------------------------------------------------------------------------------------------------------------------+--+



  • 相关阅读:
    Leetcode NO.110 Balanced Binary Tree 平衡二叉树
    Leetcode NO.226 Invert Binary Tree 翻转二叉树
    Leetcode NO.215 Kth Largest Element In An Array 数组中的第K个最大元素
    根据特征的浏览器判断
    Cygwin在打开在当前目录
    【转帖】科学对待 健康养猫 打造快乐孕妇
    解决chrome浏览器安装扩展、应用程序一直处在“检查中”的问题
    对【SQL SERVER 分布式事务解决方案】的心得补充
    关于“点击这里继续访问您选择的百度XXX”
    VBA一例:如何保持文本框焦点
  • 原文地址:https://www.cnblogs.com/hyc123-/p/9190515.html
Copyright © 2011-2022 走看看