一、Fetch抓取
Fetch 抓取是指,Hive 中对某些情况的查询可以不必使用 MapReduce 计算。例如:SELECT * FROM employees;在这种情况下,Hive 可以简单地读取 employee 对应的存储目录下的文件,然后输出查询结果到控制台。
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property>
①查询默认抓取模式
hive (default)> set hive.fetch.task.conversion; hive.fetch.task.conversion=more
②select * 不走mr
hive (default)> select * from score; OK score.name score.subject score.score 孙悟空 语文 87 孙悟空 数学 95 ...省略... 婷婷 数学 85 婷婷 英语 78
③关闭抓取
hive (default)> set hive.fetch.task.conversion=none;
④再次查询,需要走mr
hive (default)> select * from score; Query ID = atguigu_20200425011511_d4d9f365-e96c-48b2-9bf6-7818f69e18da Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1587748417298_0001) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 4.48 s -------------------------------------------------------------------------------- OK score.name score.subject score.score 孙悟空 语文 87 孙悟空 数学 95 ...省略... 婷婷 数学 85 婷婷 英语 78 Time taken: 6.177 seconds, Fetched: 12 row(s)
二、本地模式
大多数的 Hadoop Job 是需要 Hadoop 提供的完整的可扩展性来处理大数据集的。不过,有时 Hive 的输入数据量是非常小的。在这种情况下,为查询触发执行任务消耗的时间可能会比实际 job 的执行时间要多的多。对于大多数这种情况,Hive 可以通过本地模式在单台机器上处理所有的任务。对于小数据集,执行时间可以明显被缩短。
启用本地模式有两个前提条件,文件大小不能超过hive.exec.mode.local.auto.inputbytes.max,文件数量不能超过hive.exec.mode.local.auto.input.files.max
①用户可以通过设置 hive.exec.mode.local.auto 的值为 true,来让 Hive 在适当的时候自动启动这个优化。
hive (default)> set hive.exec.mode.local.auto=true;
②测试
hive (default)> select count(*) from score; Automatically selecting local only mode for query Query ID = atguigu_20200425012518_35634c83-8b18-4703-b36d-2dfdea881305 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2020-04-25 01:25:22,746 Stage-1 map = 100%, reduce = 100% Ended Job = job_local2060501220_0001 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 426 HDFS Write: 3 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK _c0 12 Time taken: 3.974 seconds, Fetched: 1 row(s)
三、表的优化
1.小表、大表join
insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;
②有时虽然某个 key 为空对应的数据很多,但是相应的数据不是异常数据,必须要包含在join 的结果中,此时我们可以表 a 中 key 为空的字段赋一个随机的值,使得数据随机均匀地分不到不同的 reducer 上,防止数据倾斜,任务失败。
insert overwrite table jointable select n.* from nullidtable n full join ori o on case when n.id is null then concat('hive', rand()) else n.id end = o.id;
3.MapJoin
set hive.auto.convert.join = true;
②大表小表的阈值设置(默认 25M 以下认为是小表)
set hive.mapjoin.smalltable.filesize=25000000;
4.group by
set hive.map.aggr = true
②在 map 端进行聚合操作的条目数目,超过该值则进行分拆,默认是100000;
set hive.groupby.mapaggr.checkinterval = 100000
③有数据倾斜的时候进行负载均衡,默认是 false
set hive.groupby.skewindata = true
select count(distinct id) from bigtable;
先group by再count的方式
select count(id) from (select id from bigtable group by id) a;
6.笛卡尔积
select b.id from bigtable b join (select id from ori where id <= 10 ) o on b.id = o.id;
8.动态分区调整
set hive.exec.dynamic.partition=true
②设置为非严格模式(动态分区的模式,默认 strict,表示必须指定至少一个分区为静态分区,nonstrict 模式表示允许所有的分区字段都可以使用动态分区。)
set hive.exec.dynamic.partition.mode=nonstrict
③在所有执行 MR 的节点上,最大一共可以创建多少个动态分区。
set hive.exec.max.dynamic.partitions=1000
④在每个执行 MR 的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即 day 字段有 365 个值,那么该参数就需要设置成大于 365,如果使用默认值 100,则会报错。可以与总动态分区数一致。
set hive.exec.max.dynamic.partitions.pernode=100
⑤整个 MR Job 中,最大可以创建多少个 HDFS 文件。
set hive.exec.max.created.files=100000
⑥当有空分区生成时(分区字段为null),是否抛出异常。一般不需要设置。
set hive.error.on.empty.partition=false
动态分区插入数据时,只需要指明分区字段,不需要指明分区字段的值。
insert overwrite table ori_partitioned_target partition (p_time) select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
静态分区插入数据时,需要指明具体分区的值。
insert overwrite table student partition(month='201708') select id, name from student where month='201709';
9.分桶
10.分区
四、MR优化
1.合理设置map数
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
3.复杂文件增加 Map 数
set mapreduce.input.fileinputformat.split.maxsize=100;
4.合理设置reduce数
方式一:
①设置每个 Reduce 处理的数据量默认是 256MB,参数1
set hive.exec.reducers.bytes.per.reducer=256000000
②设置每个任务最大的 reduce 数,默认为 1009,参数2
set hive.exec.reducers.max=1009
set mapreduce.job.reduces = 15;
set hive.exec.parallel=true; //打开任务并行执行 set hive.exec.parallel.thread.number=16; //同一个 sql 允许最大并行度,默认为 8。
六、严格模式
<property> <name>hive.mapred.mode</name> <value>strict</value> <description> The mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run. They include: Cartesian Product. No partition being picked up for a query. Comparing bigints and strings. Comparing bigints and doubles. Orderby without limit. </description> </property>
<property> <name>mapreduce.job.jvm.numtasks</name> <value>10</value> <description> How many tasks to run per jvm. If set to -1, there is no limit. </description> </property>
<property> <name>mapreduce.map.speculative</name> <value>true</value> <description>If true, then multiple instances of some map tasks may be executed in parallel.</description> </property> <property> <name>mapreduce.reduce.speculative</name> <value>true</value> <description>If true, then multiple instances of some reduce tasks may be executed in parallel.</description> </property>
不过 hive 本身也提供了配置项来控制 reduce-side 的推测执行:
<property> <name>hive.mapred.reduce.tasks.speculative.execution</name> <value>true</value> <description>Whether speculative execution for reducers should be turned on. </description> </property>
explain extended select deptno, avg(sal) avg_sal from emp group by deptno;