hive常见问题以及解析

zoukankan html css js c++ java

hive常见问题以及解析

1：数据倾斜

理论

hive数据倾斜可能的原因有哪些？主要解决方法有哪些？

原因

1：数据倾斜多由于脏数据/特殊数据（某一类数据集中）
2：大小表join
3：小文件过多；

解决方案

1:脏数据不参与关联，给特数据数据做随机（建表时）
2:使用mapjoin将小表加入内存。
3：合并小文件，通过set hive.merge.mapredfiles=true 解决；或者增加map数；（计算量大）

code

解决方法1：id为空的不参与关联
 比如：select * from log a
join users b
on a.id is not null and a.id = b.id
union all
select * from log a
where a.id is null;
解决方法2：给空值分配随机的key值
 如：select * from log a
left outer join users b
on
case when a.user_id is null
then concat(‘hive’,rand() )
else a.user_id end = b.user_id;

2：行列互换

行转列

students_info(stu_id,name,depart);
1、张三、语文
1、张三、数学
1、张三、英语
2、李四、语文
2、李四、数学
实现：
1、张三、语文|数学|英语
2、李四、语文|数学

答案

select stu_id,name,concat_ws('|',collect_set(depart)) as departs from students_info group by stu_id;
1: group by
2：collect_set 打平放成set
3: concat_ws 连接

列转行

students_info(stu_id,name,departs);
1、张三、语文|数学|英语
2、李四、语文|数学
实现：
1、张三、语文
1、张三、数学
1、张三、英语
2、李四、语文
2、李四、数学

答案

select stu_id, name,depart from students_info lateral view explode(split(depart,'|')) as depart;
1: 拆成数组（split），如果是数组类型的，不需要。 Array [1,2,3]
2: 把数组分行（explode）
3: 虚拟分行数据为视图（记得别称），同时放置到查询里。

3：TopN

海量数据处理，10亿个数中找出最大的10000个数，知道几种说几种。
1：全量排序，占存储（空间复杂度）
2：分治分成100份，快排（基准数）
3：容器取前1w（排序），后边依次比较，又叫最小堆。10000

4：连续三天登录

过去一周，有过连续三天以及上登录的用户有哪些。
pv_detail：uid，login_time ;
101、2021/1/1
101、2021/1/2
101、2021/1/3
102、2021/1/3
103、2021/1/3
103、2021/1/4
101、2021/1/5
102、2021/1/6

第一层（uid排序，且生成rownumber）

select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail ;

101 、2021/1/1、1
101 、2021/1/2、2
101 、2021/1/3、3
101、2021/1/5、4
102、2021/1/3、1
102、2021/1/6、2

第二层（相减）

select uid,login_time,sort,date_sub(login_time,sort) from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail );

101 、2021/1/1、1、2020/12/31
101 、2021/1/2、2、2020/12/31
101 、2021/1/3、3、2020/12/31
101、2021/1/5、4、2021/1/1
102、2021/1/3、1、2021/1/2
102、2021/1/6、2、2021/1/4

进行统计

select uid,min(login_time),max(login_time),date_sub(login_time,sort) as login_group,count(1) as continue_days
from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail )
group by uid ,date_sub(login_time,sort) ;

101 、2020/12/31、2021/1/1、2021/1/3、3
101、2021/1/1、2021/1/5、2021/1/5、1

三天以上

select distinct uid from (
select uid,min(login_time),max(login_time),date_sub(login_time,sort) as login_group,count(1) as continue_days
from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail )
group by uid ,date_sub(login_time,sort) ) where continue_days >=3;

5：学号、分数取30%高的学号

score_info (id,score);

select id,socre from (
select id,score, ntile(10) over (order by score desc) as level from score_info ) as a
where a.level <=3;

6： 5个有序的大文件合并成一个文件并排序。

借用归并排序中归并的方法（多路归并）.
对每个已经排好序的大文件，读取其第一个元素，放到内存中，按顺序组成一个列表；
取列表中最小的元素作为追加到输出文件中。
再从最小元素所在的文件中读取一个元素，放到列表的相应位置。
如此反复，知道所有文件被读完。

7:两次select合并到同一张表

【grouping sets()、with cube、with rollup】
1：同时获取用户的性别分布、城市分布、等级分布
grouping sets() 在 group by 查询中，根据不同的维度组合进行聚合，等价于将不同维度的 group by 结果集进行 union all。聚合规则在括号中进行指定。
select sex, city, level, count(distinct user_id) from user_info group by sex,city,level grouping sets (sex,city,level);
2：同时获取用户的性别分布以及每个性别的城市分布
grouping__id : (两个下划线) 结果属于哪一个分组集合
select sex, city, count(distinct user_id), grouping__id from user_info group by sex,city grouping sets(sex,(sex,city));

问题：性别、城市、等级的各种组合的用户分布
根据 group by 维度的所有组合进行聚合
select sex, city, level, count(distinct user_id) from user_info group by sex,city,level with cube;
问题：同时计算每个月的支付金额，以及每年的总支付金额
以最左侧的维度为主，进行层级聚合，是 cube 的子集
select year(dt) as year, month(dt) as month, sum(pay_amount) as pay_total from user_trade where dt > "0" group by year(dt),month(dt) with rollup;

8：row_number中嵌套子查询

原日志格式:
uid url datetime
求每日热门访问人数 top100 的url？

select dt,url,cnt,row
from(
select *
,row_number() over(partition by url,dt order by cnt desc) as top
from
(
select url
,to_date(datetime)as dt
,count(1)as cnt
from log_info
group by url,to_date(datetime)
)
)final
where final.top <=100;

9:昨日登录用户今日留存率

select count(b.uid)/count(a.uid) from(select distinct datetime,uid from 表 where datetime ='2021-04-22' ) a left join(select distinct datetime,uid from 表 where datetime ='2021-04-23' )b on a.uid=b.uid

10:一个表两个字段，x,y轴，求添加两个字段，波峰波谷

11：linux操作命令

tail top ps du awk sort
du 会显示指定的目录或文件所占用的磁盘空间
查看进程:ps-ef |grep
查看端口号:lsof -i:8000
log文件滚动输出 tail -f
把log打印到文件中:nohub java -jar x.jar>1.log &
find 文件查找
find -name
find -path
查看磁盘空间：df -h
查看内存使用空间:free -m
sort命令用于将文本文件内容加以排序

10：hive命令

from_unixtime（）
unix_timestamp（）

11：hive数据质量

1：四个方面评估数据质量：完整性、准确性、一致性、及时性
2：保障体系：
a：完整性、准确性通过抽验、字段内容覆盖率
b：一致性：结合元数据链路分析，数据差分
b：及时性：风险点监控：离线DQC校验，规则校验

12：sql

1 语文 78 张三
2 数学 85 张三
3 语文 90 李四
4 数学 85 李四
6 英语 90 王五

分数大于60的学生姓名
select name from b GROUP BY name HAVING min(score)>=80;

学科大于60的有多少学生
select count(name),a.course from (select * from b where score>=60) as a group by a.course;

存在于a表不存在与b表
SELECT a.key,a.value
FROM a
WHERE a.key not in (SELECT b.key FROM b)

select a.key,a.value
from a
left join b
where a.key=b.key and b.key is null;

12：元数据管理

工具 altas
datawork

13：数据建模

星型模型（一个事实表、N个维度表，维表只跟事实表关联）
雪花模型（一个事实表、N个维度表，部分维表跟维表有关联的）
星座模型，N个星型模型

14:hive中order by 、sort by、distribute by、cluster by、group by操作

order by是全局进行排序
sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapreduce.job.reduces>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
distribute by类似于MapReduce中分区partation，对数据进行分区，结合sort by进行使用，distribute by控制在map端如何拆分数据给reduce端。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。
cluster b除了具有distribute by的功能外，还会对该字段进行排序。当distribute by和sort by 字段相同时，可以使用cluster by 代替
即 cluster by col <==> distribute by col sort by col

13：spark

https://blog.csdn.net/qq_32595075/article/details/79918644

6：人群包标签表

users uid bigint, tags array 123, [1,2,3,.....] 1000个标签 tags tag_id, tag_name, tag_type_id, tag_type_name 1, 北京，101，地域 2，18，201，年龄 3，科技，301，兴趣人群包 ~ 1亿地域、年龄、兴趣北京天津 18 20 二次元 tag_type_id, tag_type_name, num

7：出一张报表，展示各个区的销售金额

订单表：城区、区域、品类、金额：

8：统计满足最近7天，归属高档门店数大于500家的城市

交易表 trade_info（iterm_id,shop_id,sales,price,dt）,门店表：shop_info (shop_id,provice,city)

9：数据加工时序问题

10：kafka怎么保证同一个id放在一起

11：常见的hive优化

12：数据建模有几种

13：范式相关

面试准备

create by cphmvp email:cphmvp@163.com 爬虫技术交流_crawler QQ群：167047843

查看全文

相关阅读:
SpringMVC 中整合之JSON、XML
解决Django扩展用户表后新表内增加用户密码存储为明文
 python之路：始解（四）
python作业：select版本FTP
python作业：高级FTP程序
 python归档：笔记转化
 python之路：始解（三）
python之路：始解（二）
python作业：FTP程序
 python文件操作：pickle模块多次dump后出现的读取问题

原文地址：https://www.cnblogs.com/cphmvp/p/14674886.html

hive常见问题以及解析

1：数据倾斜

理论

原因

解决方案

code

2：行列互换

行转列

答案

列转行

答案

3：TopN

4：连续三天登录

第一层（uid排序，且生成rownumber）

第二层（相减）

进行统计

三天以上

5：学号、分数 取30%高的学号

6： 5个有序的大文件 合并成一个文件并排序。

7:两次select合并到同一张表

8：row_number中嵌套子查询

9:昨日登录用户今日留存率

10:一个表两个字段，x,y轴，求添加两个字段，波峰波谷

11：linux操作命令

10：hive命令

11：hive数据质量

12：sql

12：元数据管理

13：数据建模

14:hive中order by 、sort by、distribute by、cluster by、group by操作

13：spark

6：人群包标签表

7： 出一张报表，展示各个区的销售金额

8： 统计满足最近7天，归属高档门店数大于500家 的城市

9：数据加工时序问题

10：kafka怎么保证同一个id放在一起

11：常见的hive优化

12：数据建模有几种

13：范式相关

面试准备

5：学号、分数取30%高的学号

6： 5个有序的大文件合并成一个文件并排序。

7：出一张报表，展示各个区的销售金额

8：统计满足最近7天，归属高档门店数大于500家的城市