zoukankan      html  css  js  c++  java
  • distinct用group by优化

    当数据量非常大,在同一个query中计算多个不相关列的distinct时,往往很容易出现数据倾斜现象,导致运行半天都不能得到结果。

    比如以下的SQL语句(a, b, c没有相关性):

    select distinct(a), distinct(b), distinct(c) from tableName;

    改进方案:把计算的distinct先group by,然后再计算:

    select distinct(t.a), distinct(t.b), distinct(t.c) 
    from(
        select a, b, c from tableName
        group by a, b, c;
    )t;

    原理:distinct是在reduce里面计算的,再进行merge,而group by的字段会在map这里分N路hash。group by越多,hash分的越随机。

    group by中avg以及sum的处理方式:

    select 
        count(distinct a) as a, count(distinct b) as b, 
        count(distinct c) as c, count(distinct d) as d,
        avg(e) as e, 
        sum(f) as f
    from tablename;

    处理方式如下:

    select 
        count(distinct a) as a, count(distinct b) as b, 
        count(distinct c) as c, count(distinct d) as d,
        sum(e)/sum(e_count) as e,  // avg的处理
        sum(f) as f  // sum的处理
    from (
        select v.a, v.b, v.c, v.d, 
            sum(v.e) as e, 
            sum(v.f) as f,
            sum(v.e_count) as e_count
        from(
            select 
                a, b, c, d, e, f,
                case when e=null then 0 else 1 end as e_count
            from tablename
        )v
        group by v.a, v.b, v.c, v.d
    )t;
     
  • 相关阅读:
    Visual C++ 2010 SP1 x86&x64
    MVC拦截
    自定义HTTP消息拦截
    转mysql半主从同步
    mysql主从搭建之诡异事件
    snapshot相关
    分布式系统唯一ID生成方案汇总
    mysql监控利器mysqlmtop部署安装
    mysql日常运维
    MySQL索引背后的数据结构及算法原理
  • 原文地址:https://www.cnblogs.com/yejinru/p/4154119.html
Copyright © 2011-2022 走看看