HIVE优化 - 走看看

zoukankan html css js c++ java

HIVE优化

1.hive小文件合并
cd hive /conf/hive-default
输出合并
合并输出小文件。输出时，若是太多小文件，每个小文件会与一个block进行对应，而block存在的意义是为了方便在namenode中存储，那么过多的block将会充斥namenode的表中，待集群规模增大和运行次数增大，那么维护block的表将会过大，严重降低namenode性能。

set hive.merge.mapfiles = true #在Map-only的任务结束时合并小文件
set hive.merge.mapredfiles = true #在Map-Reduce的任务结束时合并小文件
set hive.merge.size.per.task = 256*1000*1000 #合并文件的大小
set hive.merge.smallfiles.avgsize=16000000 #当输出文件的平均大小小于该值时，启动一个独立的map-reduce任务进行文件merge

我们要做的就是设置hive.merge.smallfiles.avgsize ，这里建议设置为5000000 = 5M ，即当输出文件的平均大小小于该值时，启动一个独立的map-reduce任务进行文件merge

1.2增加reduce数量，提高hive运行速度

set mapred.reduce.tasks=10;

2.map join

如下hive sql：
select t.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

该语句中B表有30亿行记录，A表只有100行记录，而且B表中数据倾斜特别严重，有一个key上有15亿行记录，在运行过程中特别的慢，而且在reduece的过程中遇有内存不够而报错。

为了解决用户的这个问题，考虑使用mapjoin,mapjoin的原理：

MAPJION会把小表全部读入内存中，在map阶段直接拿另外一个表的数据和内存中表数据做匹配，由于在map是进行了join操作，省去了reduce运行的效率也会高很多

这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。

select /*+ mapjoin(A)*/ f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)
再运行发现执行的效率比以前的写法高了好多。

3.hive 索引
索引是标准的数据库技术，hive 0.7版本之后支持索引。Hive提供有限的索引功能，这不像传统的关系型数据库那样有“键(key)”的概念，用户可以在某些列上创建索引来加速某些操作，给一个表创建的索引数据被保存在另外的表中。 Hive的索引功能现在还相对较晚，提供的选项还较少。但是，索引被设计为可使用内置的可插拔的java代码来定制，用户可以扩展这个功能来满足自己的需求

3.1
hive> create table user( id int, name string)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY ' '
     STORED AS TEXTFILE;

[hadoop@hadoop110 ~]$ cat h1.txt
101     zs
102     ls
103     ww
901     zl
902     zz
903     ha

hive> load data local inpath '/home/hadoop/h1.txt'
      overwrite into table user;

创建索引
hive> create index user_index on table user(id)
     as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
     with deferred rebuild
     IN TABLE user_index_table;

hive> alter index user_index on user rebuild;

hive> select * from user_index_table limit 5;

4.
Hive是将符合SQL语法的字符串解析生成可以在Hadoop上执行的MapReduce的工具。使用Hive尽量按照分布式计算的一些特点来设计sql，和传统关系型数据库有区别，

所以需要去掉原有关系型数据库下开发的一些固有思维。
基本原则：
尽量尽早地过滤数据，减少每个阶段的数据量,对于分区表要加分区，同时只选择需要使用到的字段

select ... from A
join B
on A.key = B.key
where A.userid>10
     and B.userid<10
        and A.dt='20120417'
        and B.dt='20120417';

应该改写为：
select .... from (select .... from A
                  where dt='201200417'
                                    and userid>10
                              ) a
join ( select .... from B
       where dt='201200417'
                     and userid < 10
     ) b
on a.key = b.key;

5、对历史库的计算经验 (这项是说根据不同的使用目的优化使用方法)
   历史库计算和使用，分区

3：尽量原子化操作，尽量避免一个SQL包含复杂逻辑

可以使用中间表来完成复杂的逻辑

4 jion操作   小表要注意放在join的左边（目前TCL里面很多都小表放在join的右边）。

否则会引起磁盘和内存的大量消耗

5：如果union all的部分个数大于2，或者每个union部分数据量大，应该拆成多个insert into 语句，实际测试过程中，执行时间能提升50%
insert overwite table tablename partition (dt= ....)
select ..... from (
                   select ... from A
                   union all
                   select ... from B
                   union all
                   select ... from C
                               ) R
where ...;

可以改写为：
insert into table tablename partition (dt= ....)
select .... from A
WHERE ...;

insert into table tablename partition (dt= ....)
select .... from B
WHERE ...;

insert into table tablename partition (dt= ....)
select .... from C
WHERE ...;

hive join

hive> create table a1(id int,name string)
    row format delimited
    fields terminated by ' '
    stored as textfile;

hive> create table a2(id int,city string)
    row format delimited
    fields terminated by ' '
    stored as textfile;

hive> create table a3(city string,level int)
    row format delimited
    fields terminated by ' '
    stored as textfile;

[hadoop@h91 ~]$ cat a1.txt
101     zs
102     ls
103     ww

[hadoop@h91 ~]$ cat a2.txt
101     bj
102     sh
109     sh

[hadoop@h91 ~]$ cat a3.txt
bj      99999
sh      11111
gz      22222

hive> load data local inpath '/home/hadoop/a1.txt' into table a1;

hive> load data local inpath '/home/hadoop/a2.txt' into table a2;

hive> load data local inpath '/home/hadoop/a3.txt' into table a3;

-----------------------------------------------------------------------
1.等连接
hive> select a1.name,a2.city from a1 join a2 on(a1.id=a2.id);

如果有多条件
hive> select a1.name,a2.city from a1 join a2 on(a1.id=a2.id and a1.id2=a2.id2);

2.多表连接
hive> select a1.name,a2.city,a3.level from a1 join a2 on(a1.id=a2.id) join a3 on(a3.city=a2.city);

3.多表外连接

左外
hive> select a1.name,a2.city from a1 left outer join a2 on(a1.id=a2.id);

右外
hive> select a1.name,a2.city from a1 right outer join a2 on(a1.id=a2.id);

全外
hive> select a1.name,a2.city from a1 full outer join a2 on(a1.id=a2.id);

4.join过滤
(先连接后在过滤)
hive> select a1.name,a2.city from a1 join a2 on(a1.id=a2.id) where a1.id>101 and a2.id<105;

或者
hive> select a1.name,a2.city from a1 join a2 on(a1.id=a2.id and a1.id>101 and a2.id<105);
（连接前过滤）

5.map join
作用
1.加上map join的话先行在map段进行比较，减轻reduce的压力。
2.把小表放在前面先运行小表，然后在匹配大表，如果没有小表的字段直接就把大表遗弃。
如下hive sql：
select f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)

该语句中B表有30亿行记录，A表只有100行记录，而且B表中数据倾斜特别严重，有一个key上有15亿行记录，在运行过程中特别的慢，而且在reduece的过程中遇有内存不够而报错。

为了解决用户的这个问题，考虑使用mapjoin,mapjoin的原理：

MAPJION会把小表全部读入内存中，在map阶段直接拿另外一个表的数据和内存中表数据做匹配，由于在map是进行了join操作，省去了reduce运行的效率也会高很多

这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。

select /*+ mapjoin(A)*/ f.a,f.b from A t join B f on ( f.a=t.a and f.ftime=20110802)
再运行发现执行的效率比以前的写法高了好多。

hive:
https://www.csdn.net/article/2015-01-13/2823530

查看全文

相关阅读:
MyEclipse里运行时报错
 Django中Template does not exit
Django简单界面开发
 Django安装过程
 搭建NFS服务器和客户端过程中遇到的问题
 URL传值中文乱码的解决
 结合《需求征集系统》谈MVC框架
 对于微信小程序登录的理解图
 FpSpread基本句法
 sql，lambda，linq语句

原文地址：https://www.cnblogs.com/jieran/p/9038338.html