4.hive优化 1)跑sql的时候会出现的参数: In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> 如果大于<number>,就会多生成一个reduce <number> =1024 <1k 一个reduce 1m 10个reduce set hive.exec.reducers.bytes.per.reducer=20000; select user_id,count(1) as order_cnt from orders group by user_id limit 10; --结果number of mappers: 1; number of reducers: 1009 In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> set hive.exec.reducers.max=10; -- number of mappers: 1; number of reducers: 10 In order to set a constant number of reducers: set mapreduce.job.reduces=<number> set mapreduce.job.reduces=5; --number of mappers: 1; number of reducers: 5 set mapreduce.job.reduces=15; --number of mappers: 1; number of reducers: 15 对你当前窗口,或者执行任务(脚本)过程中生效 2)where条件使得group by冗余 map 和 reduce执行过程是一个同步的过程 同步:打电话 异步:发短信 1:map执行完 reduce在执行 1+2=》3:reduce 2:map reduce map 60% reduce=3% 3)只有一个reduce a.没有group by set mapreduce.job.reduces=5; select count(1) from orders where order_dow='0'; --number of mappers: 1; number of reducers: 1 b.order by set mapreduce.job.reduces=5; select user_id,order_dow from orders where order_dow='0' order by user_id limit 10; -- number of mappers: 1; number of reducers: 1 c.笛卡尔积 cross product tmp_d 1 2 3 4 5 select * from tmp_d join (select * from tmp_d)t where tmp_d.user_id=t.user_id; --相当于on join没有on的字段关联 1 1 2 1 3 1 1 2 2 2 3 2 1 3 2 3 3 3 user product(库中所有商品中调小部分觉得这个用户喜欢 召回(match) 候选集1000) top10 users 母婴类 products 要同时考虑users和products信息来给它们做一个筛选(粗粒度) 5)map join select /*+ MAPJOIN(aisles) */ a.aisle as aisle,p.product_id as product_id from aisles a join product p on a.aisle_id=p.aisle_id limit 10; dict hashMap {aisle_id : aisle} for line in products: ss = line.split(' ') aisle_id = ss[0] product_id = ss[1] aisle = dict[aisle_id] print '%s %s'%(aisle,product_id) 6)union all + distinct == union --运行时间:74.712 seconds 2job select count( *) c from ( select order_id,user_id,order_dow from orders where order_dow='0' union all select order_id,user_id,order_dow from orders where order_dow='0' union all select order_id,user_id,order_dow from orders where order_dow='1' )t; --运行时间122.996 seconds 3 job select * from( select order_id,user_id,order_dow from orders where order_dow='0' union select order_id,user_id,order_dow from orders where order_dow='0' union select order_id,user_id,order_dow from orders where order_dow='1')t; 7) set hive.groupby.skewindata=true; 将一个map reduce拆分成两个map reduce ‘-’(‘’,-1,0,null)1亿条 到一个reduce上面, 1个reduce处理6000w ‘-’ 1% 200w求和 =》1条 29 reduce处理剩余的4000w 99% 1.随机分发到不同的reduce节点,进行聚合(count) 2. 最终的一个reduce做最终结果的聚合(200w求和 =》1条) select add_to_cart_order,count(1) as cnt from order_products_prior group by add_to_cart_order limit 10; select user_id,count(1) as cnt from order_products_prior group by user_id limit 10; -- 没指定set hive.groupby.skewindata=true; --Launching Job 1 out of 1 -- 1m 41s --指定了set hive.groupby.skewindata=true; --Launching Job 1 out of 2 -- 2m 50s 如果在不导致reduce一直失败起不来的时候,就不用这个变量 如果确实出现了其中一个reduce的处理数据量太多,导致任务一直出问题,运行时间长。这种情况需要设置这个变量。 凌晨定时任务,近一周报表,跑了3个小时。 洗出来的基础表,3点出来,7点出来,后面接了70任务 8)MR的数量 --Launching Job 1 out of 1 select ord.order_id order_id, tra.product_id product_id, pri.reordered reordered from orders ord join train tra on ord.order_id=tra.order_id join order_products_prior pri on ord.order_id=pri.order_id limit 10; --两个MR任务 select ord.order_id, tra.product_id, pro.aisle_id from orders ord join trains tra on ord.order_id=tra.order_id join products pro on tra.product_id=pro.product_id limit 10; 9)/*+ STREAMTABLE(a) */ a是大表 类似map join 放到select中的,区别:它是指定大表 select /*+STREAMTABLE(pr)*/ ord.order_id,pr.product_id,pro.aisle_id from orders ord join order_products_prior pr on ord.order_id=pr.order_id join products pro on pr.product_id=pro.product_id limit 10; 10)LEFT OUTER JOIN select od.user_id, od.order_id, tr.product_id from (select user_id,order_id,order_dow from orders limit 100)od left outer join (select order_id,product_id,reordered from train)tr on (od.order_id=tr.order_id and od.order_dow='0' and tr.reordered=1) limit 30; --join默认是inner 11)set hive.exec.parallel=true 1:map执行完 reduce在执行 1+2=》3:reduce 2:map reduce 12) 1. '-' ,where age<>'-' 直接丢掉这个数据 select age,count(1) group by age where age<>'-' 1_- 2_- 3_- 怎么定位具体哪几个key发生倾斜? sample SELECT COUNT(1) FROM (SELECT * FROM lxw1 TABLESAMPLE (200 ROWS)) x; SELECT * FROM udata TABLESAMPLE (50 PERCENT); select * from table_name where col=xxx order by rand() limit num; SELECT * FROM lxw1 TABLESAMPLE (30M); 长尾数据