原理
参考:https://blog.csdn.net/u013668852/article/details/79866931
group by多字段
select rank, isonline, count(*) from city group by rank, isonline;
将GroupBy的字段组合为map的输出key值,利用MapReduce的排序,在reduce阶段保存LastKey区分不同的key。MapReduce的过程如下(当然这里只是说明Reduce端的非Hash聚合过程)
group by单字段
group by 单字段和多字段时的不同在于key上,以如下例子为例(出处太多):SELECT uid, SUM(COUNT) FROM logs GROUP BY uid;
hive> SELECT * FROM logs; a 苹果 5 a 橙子 3 a 苹果 2 b 烧鸡 1 hive> SELECT uid, SUM(COUNT) FROM logs GROUP BY uid; a 10 b 1
其过程如下图:
默认设置了hive.map.aggr=true,所以会在mapper端先group by一次,最后再把结果merge起来,为了减少reducer处理的数据量。注意看explain的mode是不一样的。mapper是hash,reducer是mergepartial。如果把hive.map.aggr=false,那将groupby放到reducer才做,他的mode是complete.
TIPS
减少group by字段
措施一:先group by ;再关联
参考:https://blog.51cto.com/imysqldba/1364808
措施二:若是其中有些字段是固定取值,可以限定死
参考:https://www.cnblogs.com/hoojjack/p/7460574.html
措施三:本身在其他字段取了group by 就已经能确保取值唯一的字段,如张三项目组的职责一定是洗碗,李四组的职责一定是刷盘子,可以给字段打上聚合函数max,min等:select proj_group, duty, sum(salary) from test_table group by proj_group, duty 可以改写为:select proj_group, max(duty), sum(salary) from test_table group by proj_group
hive与MySQL group by的区别
1.hive执行顺序group 在前,select在后;MySQL select在前,group 在后
参考:https://blog.csdn.net/m0_48714980/article/details/109136659
eg:
--在year(date)起别名 -- 1.在Mysql中顺利执行 select count(*),year(date) a from student group by a ; -- 2.但在hive中出现报错 select count(*),year(date) a from student group by a ; Invalid table alias or column reference ‘a’: (possible column names are: s_id, s_name, s_birth, s_sex -- 原因分析:select执行顺序中,group by 在select前先执行,所以此时别名在解析器中没法识别
2.hive要将非聚合字段都加到group中;MySQL没加到group的字段可能会随机取:
eg:
mysql> select * from youhua1; +--------+--------------+----------------+----------------+ | custid | jijin_change | baoxian_change | cunkuan_change | +--------+--------------+----------------+----------------+ | 1 | 1.1 | 1.2 | 1.3 | | 2 | 2.67 | 2.34 | 2.1 | | 3 | 3.1 | 3.4 | 3.78 | | 3 | 1.1 | 2.3 | 7.8 | +--------+--------------+----------------+----------------+ 4 rows in set (0.00 sec) mysql> select custid,jijin_change,sum(baoxian_change) from youhua1 group by custid;; +--------+--------------+---------------------+ | custid | jijin_change | sum(baoxian_change) | +--------+--------------+---------------------+ | 1 | 1.1 | 1.2000000476837158 | | 2 | 2.67 | 2.3399999141693115 | | 3 | 3.1 | 5.700000047683716 | +--------+--------------+---------------------+ 3 rows in set (0.00 sec)