zoukankan      html  css  js  c++  java
  • Hive学习小记-(7)group by原理&tips

    原理

    参考:https://blog.csdn.net/u013668852/article/details/79866931

    group by多字段

    select rank, isonline, count(*) from city group by rank, isonline;

    将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)

    image

    group by单字段

    group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多):
    SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;

    hive> SELECT * FROM logs;
    a   苹果  5
    a   橙子  3
    a   苹果  2
    b   烧鸡  1
    
    hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
    a   10
    b   1

    其过程如下图:
    image

    默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.

    image

    TIPS

    减少group by字段

    措施一:先group by ;再关联

    参考:https://blog.51cto.com/imysqldba/1364808

    措施二:若是其中有些字段是固定取值,可以限定死

    参考:https://www.cnblogs.com/hoojjack/p/7460574.html

    措施三:本身在其他字段取了group by 就已经能确保取值唯一的字段,如张三项目组的职责一定是洗碗,李四组的职责一定是刷盘子,可以给字段打上聚合函数max,min等:select proj_group, duty, sum(salary) from test_table group by proj_group, duty  可以改写为:select proj_group, max(duty), sum(salary) from test_table group by proj_group

    hive与MySQL group by的区别

    1.hive执行顺序group 在前,select在后;MySQL select在前,group 在后

    参考:https://blog.csdn.net/m0_48714980/article/details/109136659

    eg:

    --在year(date)起别名
    -- 1.在Mysql中顺利执行
    select count(*),year(date) a from student group by a ;
    
    -- 2.但在hive中出现报错
    select count(*),year(date) a from student group by a ;
    Invalid table alias or column reference ‘a’: (possible column names are: s_id, s_name, s_birth, s_sex
    
    -- 原因分析:select执行顺序中,group by 在select前先执行,所以此时别名在解析器中没法识别

    2.hive要将非聚合字段都加到group中;MySQL没加到group的字段可能会随机取:

    eg:

    mysql> select * from youhua1;
    +--------+--------------+----------------+----------------+
    | custid | jijin_change | baoxian_change | cunkuan_change |
    +--------+--------------+----------------+----------------+
    |      1 |          1.1 |            1.2 |            1.3 |
    |      2 |         2.67 |           2.34 |            2.1 |
    |      3 |          3.1 |            3.4 |           3.78 |
    |      3 |          1.1 |            2.3 |            7.8 |
    +--------+--------------+----------------+----------------+
    4 rows in set (0.00 sec)
    
    mysql> select custid,jijin_change,sum(baoxian_change) from youhua1 group by custid;;
    +--------+--------------+---------------------+
    | custid | jijin_change | sum(baoxian_change) |
    +--------+--------------+---------------------+
    |      1 |          1.1 |  1.2000000476837158 |
    |      2 |         2.67 |  2.3399999141693115 |
    |      3 |          3.1 |   5.700000047683716 |
    +--------+--------------+---------------------+
    3 rows in set (0.00 sec)
  • 相关阅读:
    windows安装psycopg2问题解决方案
    error: Setup script exited with error: Unable to find vcvarsall.bat
    python web server comparison
    重置mysql root密码
    Nginx + uWSGI + web.py 搭建示例
    《凉州曲》——吴践道
    C#代理服务器
    OpenGL 分形 Sierpinski镂垫
    Chap02_递归与分治_判断两棵二叉树是否同构
    OpenGL C#绘图环境配置
  • 原文地址:https://www.cnblogs.com/foolangirl/p/14208537.html
Copyright © 2011-2022 走看看