hive学习07-常见的优化 - 走看看

zoukankan html css js c++ java

hive学习07-常见的优化

基础每天学习：

1.行转列：

case ... when ...then....else ...end as xxx

2.

“fields terminated by”：字段与字段之间的分隔符。
“collection items terminated by”：一个字段中各个子元素 item 的分隔符。

3.数据仓库中常见的分区

数据仓库分区：时间（天）、数据来源（app、m、pc）

　　--数据库：用户的属性、年龄、性别、收藏、购买的记录　　
　　--每天有新增用户，修改信息dt=2018922 存在大量信息冗余
　　--overwrite 7 每天做overwrite dt=20180922,
　　--当天之前的所有全量数据，有7个分区，冗余7分

4.hive查看数据时查看表头：

set hive.cli.print.header = true;

5.分桶使用:cluster by(xxx) into 4 buckets;

如果需要分桶必须事先设置参数：
set hive.enforce.bucketing = true
或者用户可以自主设置mapred.reduce.tasks通过reduce的个数来适配bucket

buctet的作用:
1、数据采样,如果采样列：select * from student tablesample(bucket x out of y on user_id)
hive根据y的大小决定抽样的比例

6.hive 优化

1.作业依赖于input的目录产生map的个数，set dfs.block.size

--小文件太多的时候，合并小文件，减少map个数

---set mapred.map.tasks = 10

---map聚合 set hive.map.aggr=true

reduce 优化：
---hive.exec.reducers.bytes.per.reducer= ; 每个reduce任务处理的数据量优先级第三
---hive.exec.reducers.max= ;reduce的最大个数优先级最大
---设置reduce的个数 set mapred.reduce.tasks = 10 优先级第二

一个reduce：
--order by (使用distribute by+ sort by 或者 cluster by 代替)
--笛卡尔积 a join b (没有on，或者无效的on条件，直接变成笛卡尔连接，触发一个reduce；一定要避免笛卡尔积，一个reduce)

hive优化：
-where 中的分区条件，会提前生效，不必特意做子查询，直接做join和group by

-Map join时候，小表放在最前边
- /*+MAPJOIN(TABLElist)*/,必须是小表，小于1G或者50条记录

-union all/distinct

-先做union all 再做join或者group by 等操作可以有效减少MR过程

查看全文

相关阅读:
Linux命令-chmod、chown和chgrp
UUID是如何保证全局唯一的
 Java实现HTML转换为PDF的常见方法
 Java内存溢出详解
 Java 版本6下载大全
 spring 标签
 java 静态成员访问
 Java开发之@PostConstruct执行顺序
 Java集合和数组的区别
 集合转数组的toArray()和toArray(T[] a)方法

原文地址：https://www.cnblogs.com/students/p/10952776.html

Copyright © 2011-2022 走看看