Hive SQL练习之成绩分析
数据:[id, 学号,班级,科目,成绩]
1,1,1,yuwen,80
2,1,1,shuxue,85
3,2,1,yuwen,75
4,2,1,shuxue,70
5,3,1,yuwen,86
6,3,1,shuxue,72
7,4,2,yuwen,88
8,4,2,shuxue,99
9,5,2,yuwen,86
10,5,2,shuxue,94
11,6,2,yuwen,56
12,6,2,shuxue,96
题目:
(1)求每个班级前三名的同学(组内topN问题);
分析:按班级分组,同时取每个同学的平均成绩(或总分)还需要对学生分组。
1)按班级+学生分组,取每个同学平均成绩
select cid, sid, avg(score) avg_score from t_course group by cid, sid order by cid, avg_score desc;
+------+------+------------+
| cid | sid | avg_score |
+------+------+------------+
| 1 | 1 | 82.5 |
| 1 | 3 | 79.0 |
| 1 | 2 | 72.5 |
| 2 | 4 | 93.5 |
| 2 | 5 | 90.0 |
| 2 | 6 | 76.0 |
+------+------+------------+
注意,这里因为数据量太少(每个班就三名同学),直接按上面排序就取出了前三名学生。
这个题实际考察的是多分组(班级+学生),组内TopN(成绩前三名),应该使用dense_rank函数。
- 请自行回顾row_number、rank、dense_rank三个函数的区别。
例如取前两名,就得用上dense_rank
select * from (select cid, sid, avg(score) as avg_score, dense_rank() over (partition by cid order by avg(score) desc) as rank from t_course group by cid, sid order by cid) t where rank<=2;
+--------+--------+--------------+---------+
| t.cid | t.sid | t.avg_score | t.rank |
+--------+--------+--------------+---------+
| 1 | 1 | 82.5 | 1 |
| 1 | 3 | 79.0 | 2 |
| 2 | 4 | 93.5 | 1 |
| 2 | 5 | 90.0 | 2 |
+--------+--------+--------------+---------+
(2)求语文成绩比数学成绩高的同学,要求使用两种方式完成。
提示:case when,collect_set,concat,concat_set
本人自创方法(也是最简单的方法)
0: jdbc:hive2://bigboss3:10000> select * from(select id,sid,cid,score,lead(score) over() as shuxue from t_course)t where id%2=1 and score>shuxue;
+-------+--------+--------+----------+-----------+
| t.id | t.sid | t.cid | t.score | t.shuxue |
+-------+--------+--------+----------+-----------+
| 3 | 2 | 1 | 75 | 70 |
| 5 | 3 | 1 | 86 | 72 |
+-------+--------+--------+----------+-----------+
解析:每个学生都是相邻行科目与成绩不同,使用lead(score) over()
把成绩那一列整体向上提升一行,那么奇数行就有该同学语文和数学成绩,这样比起来不就很简单了吗,这个方法最简单了,也是我自己想的,得出答案也是正确的。
方法二:使用collect_set,concat,concat_set,str_to_map
collect_set函数将多行转换为一行,使用concat,concat_set,可以将字段拼接为map格式
0: jdbc:hive2://bigboss3:10000> select * from (select sid,cid,str_to_map(concat_ws(",",collect_set(sc))) as scmap from (select sid,cid,concat(course,":",score) as sc from t_course) t group by sid,cid)s where s.scmap["yuwen"]>s.scmap["shuxue"];
+--------+--------+-------------------------------+
| s.sid | s.cid | s.scmap |
+--------+--------+-------------------------------+
| 3 | 1 | {"yuwen":"86","shuxue":"72"} |
| 2 | 1 | {"yuwen":"75","shuxue":"70"} |
+--------+--------+-------------------------------+
2 rows selected (1.369 seconds)
解析:
1)使用concat方法,将学科和成绩进行拼接
select sid, concat(course, ":", score) from t_course;
+------+------------+
| sid | _c1 |
+------+------------+
| 1 | yuwen:80 |
| 1 | shuxue:85 |
| 2 | yuwen:75 |
| 2 | shuxue:70 |
2)使用collect_set函数,每个同学的多行成绩合并到一行
注意:多行变一行是聚集操作,需要分组
select sid, collect_set(concat(course, ":", score)) from t_course group by sid;
+------+---------------------------+
| sid | _c1 |
+------+---------------------------+
| 4 | ["yuwen:88","shuxue:99"] |
| 6 | ["yuwen:56","shuxue:96"] |
| 2 | ["yuwen:75","shuxue:70"] |
| 1 | ["yuwen:80","shuxue:85"] |
| 3 | ["yuwen:86","shuxue:72"] |
| 5 | ["yuwen:86","shuxue:94"] |
+------+---------------------------+
3)转换为map格式,方便获取成绩
注意:上面获取的["yuwen:88","shuxue:99"]是数组,无法直接转换为map,需要先使用concat_ws将数组转换为字符串
select sid, concat_ws(",", collect_set(concat(course, ":", score))) from t_course group by sid;
+------+---------------------+
| sid | _c1 |
+------+---------------------+
| 4 | yuwen:88,shuxue:99 |
| 6 | yuwen:56,shuxue:96 |
| 2 | yuwen:75,shuxue:70 |
再将字符串转换为map
select sid, str_to_map(concat_ws(",", collect_set(concat(course, ":", score)))) as score_map from t_course group by sid;
+------+-------------------------------+
| sid | score_map |
+------+-------------------------------+
| 4 | {"yuwen":"88","shuxue":"99"} |
| 6 | {"yuwen":"56","shuxue":"96"} |
| 2 | {"yuwen":"75","shuxue":"70"} |
| 1 | {"yuwen":"80","shuxue":"85"} |
| 3 | {"yuwen":"86","shuxue":"72"} |
| 5 | {"yuwen":"86","shuxue":"94"} |
+------+-------------------------------+
4)通过map访问学科和成绩,并进行过滤
select sid, score_map["yuwen"] as yuwen, score_map["shuxue"] as shuxue from (select sid, str_to_map(concat_ws(",", collect_set(concat(course, ":", score)))) as score_map from t_course group by sid) t where score_map["yuwen"]>score_map["shuxue"];
+------+--------+---------+
| sid | yuwen | shuxue |
+------+--------+---------+
| 2 | 75 | 70 |
| 3 | 86 | 72 |
+------+--------+---------+
方法三:case when
这个方法比较巧妙,一般想不出来
select sid, max(yuwen) as score_yw, max(shuxue) as score_sx from (select *, case course when "yuwen" then score else 0 end as yuwen, case course when "shuxue" then score else 0 end as shuxue from t_course) t group by sid having score_yw>score_sx;
+------+-----------+-----------+
| sid | score_yw | score_sx |
+------+-----------+-----------+
| 2 | 75 | 70 |
| 3 | 86 | 72 |
+------+-----------+-----------+
解析
思路:由于各科成绩在不同行,无法直接进行比较,可以使用case when新增yuwen、shuxue两列,然后再将各科成绩转换为一行进行比较。
1)新增yuwen、shuxue两列
select *, case course when "yuwen" then score else 0 end as yuwen, case course when "shuxue" then score else 0 end as shuxue from t_course;
+--------------+---------------+---------------+------------------+-----------------+--------+---------+
| t_course.id | t_course.sid | t_course.cid | t_course.course | t_course.score | yuwen | shuxue |
+--------------+---------------+---------------+------------------+-----------------+--------+---------+
| 1 | 1 | 1 | yuwen | 80 | 80 | 0 |
| 2 | 1 | 1 | shuxue | 85 | 0 | 85 |
| 3 | 2 | 1 | yuwen | 75 | 75 | 0 |
| 4 | 2 | 1 | shuxue | 70 | 0 | 70 |
| 5 | 3 | 1 | yuwen | 86 | 86 | 0 |
| 6 | 3 | 1 | shuxue | 72 | 0 | 72 |
| 7 | 4 | 2 | yuwen | 88 | 88 | 0 |
| 8 | 4 | 2 | shuxue | 99 | 0 | 99 |
| 9 | 5 | 2 | yuwen | 86 | 86 | 0 |
| 10 | 5 | 2 | shuxue | 94 | 0 | 94 |
| 11 | 6 | 2 | yuwen | 56 | 56 | 0 |
| 12 | 6 | 2 | shuxue | 96 | 0 | 96 |
+--------------+---------------+---------------+------------------+-----------------+--------+---------+
case course when "yuwen" then score else 0 end as yuwen解析:
对每一行,当course取值为"yuwen"时,则新增的yuwen字段值为score(语文成绩),否则(course值为shuxue),取值为0;
2)对每一个学生做分组,目的是将每个学生的成绩显示到一行
select sid, max(yuwen) as score_yw, max(shuxue) as score_sx from (select *, case course when "yuwen" then score else 0 end as yuwen, case course when "shuxue" then score else 0 end as shuxue from t_course) t group by sid order by sid;
+------+-----------+-----------+
| sid | score_yw | score_sx |
+------+-----------+-----------+
| 1 | 80 | 85 |
| 2 | 75 | 70 |
| 3 | 86 | 72 |
| 4 | 88 | 99 |
| 5 | 86 | 94 |
| 6 | 56 | 96 |
+------+-----------+-----------+
3)查找语文成绩更高的学生,注意分组条件过滤使用having,而不是where
select sid, max(yuwen) as score_yw, max(shuxue) as score_sx from (select *, case course when "yuwen" then score else 0 end as yuwen, case course when "shuxue" then score else 0 end as shuxue from t_course) t group by sid having score_yw>score_sx;
+------+-----------+-----------+
| sid | score_yw | score_sx |
+------+-----------+-----------+
| 2 | 75 | 70 |
| 3 | 86 | 72 |
+------+-----------+-----------+