zoukankan      html  css  js  c++  java
  • Hive索引

    1、        Hive索引概述

    Hive的索引目的是提高Hive表指定列的查询速度。

    没有索引时。类似'WHERE tab1.col1 = 10' 的查询。Hive会载入整张表或分区。然后处理全部的rows,可是假设在字段col1上面存在索引时。那么仅仅会载入和处理文件的一部分。

    与其它传统数据库一样。添加索引在提升查询速度时。会消耗额外资源去创建索引和须要很多其它的磁盘空间存储索引。

    Hive 0.7.0版本号中,添加了索引。Hive 0.8.0版本号中添加了bitmap索引。

    2、        索引相关的配置參数

    hive.index.compact.file.ignore.hdfs

    Default Value: false

    Added In: Hive 0.7.0 withHIVE-1889

    在索引文件里存储的hdfs地址将在执行时被忽略,假设开启的话;假设数据被迁移。那么索引文件依旧可用,默认是false

    hive.optimize.index.filter

    Default Value: false

    Added In: Hive 0.8.0 withHIVE-1644

    是否自己主动使用索引, 默认是false

    hive.optimize.index.filter.compact.minsize

    Default Value: 5368709120

    Added In: Hive 0.8.0 withHIVE-1644

    压缩索引自己主动应用的最小输入大小

    hive.optimize.index.filter.compact.maxsize

    Default Value: -1

    Added In: Hive 0.8.0 withHIVE-1644

    压缩索引自己主动应用的最大输入大小,负值代表正无穷

    hive.index.compact.query.max.size

    Default Value: 10737418240

    Added In: Hive 0.8.0 withHIVE-2096

    一个使用压缩索引做的查询能取到的最大数据量。默认是10737418240 个byte;负值代表无穷大;

    hive.index.compact.query.max.entries

    Default Value: 10000000

    Added In: Hive 0.8.0 withHIVE-2096

    使用压缩索引查询时能读到的最大索引项数,默认是10000000;负值代表无穷大;

    hive.exec.concatenate.check.index

    Default Value: true

    Added In: Hive 0.8.0 withHIVE-2125

    假设设置为true,那么在做ALTER TABLE tbl_name CONCATENATE on a table/partition(有索引) 操作时,抛出错误;能够帮助用户避免index的删除和重建;

    hive.optimize.index.groupby

    Default Value: false

    Added In: Hive 0.8.1 withHIVE-1694

    hive.index.compact.binary.search

    Default Value: true

    Added In: Hive 0.8.1with HIVE-2535

    在索引表中是否开启二分搜索进行索引项查询,默认是true。

    3、        索引演示样例

    注意:在Hive 0.12.0以及之前版本号中,索引名称在create index和drop index语句中是大写和小写敏感的。然而,alter index 须要一个小写的索引名字。

    此bug在Hive 0.13.0解决,此版本号開始使索引名字大写和小写不敏感。

    对于Hive 0.13.0之前的版本号,最好使用小写的索引名字。

    以下介绍索引的常见使用方法:

    A、       Create/build,show和drop index

    create index table01_index ontable table01(column2) as 'COMPACT' with deferred rebuild;

    show index on table01;

    drop index table01_index ontable01;

    B、       Create then build。show formatted和drop index

    create index table02_index ontable table02(column3) as 'compact' with deferred rebuild;

    alter index table02_index ontable02 rebuild;

    show formatted index ontable02;

    drop index table02_index ontable02;

    C、       创建bitmap索引,build,show 和drop

    createindex table03_index on table table03 (column4) as 'bitmap' with deferred rebuild;

    alter index table03_index ontable03 rebuild;

    show formatted index ontable03;

    drop index table03_index on table03;

    D、       在一张新表上创建索引

    createindex table04_index on table table04 (column5) as 'compact'with deferred rebuild in tabletable04_index_table;

    E、        创建索引,存储格式为RCFile

    create index table05_index ontable table05 (column6) as 'compact' with deferred rebuildstored as rcfile;

    F、        创建索引。存储格式为TextFile

    create index table06_index ontable table06 (column7) as 'compact' with deferredrebuild row format delimited fields terminated by ' ' stored as textfile;

    G、       创建带有索引属性的索引

    create index table07_index ontable table07 (column8) as 'compact' with deferred rebuild idxproperties("prop1"="value1", "prop2"="value2");

    H、       创建带有表属性的索引

    create index table08_index ontable table08 (column9) as 'compact' withdeferred rebuild tblproperties("prop3"="value3", "prop4"="value4");

    I、        假设索引存在,则删除

    drop index if exists table09_indexon table09;

    J、        在分区上重建索引

    alter index table10_index on table10partition (columnx='valueq', columny='valuer') rebuild;

    4、        索引測试

    (1)  查询表中行数

    hive (hive)> select count(1)from userbook;

    4409365

    (2)  表中未创建索引前查询

    hive (hive)> select * fromuserbook where book_id = '15999998838';

    Query ID =hadoop_20150627165551_595da79a-0e27-453b-9142-7734912934c4

    Total jobs = 1

    Launching Job 1 out of 1

    Number of reduce tasks is setto 0 since there's no reduce operator

    Starting Job =job_1435392961740_0012, Tracking URL =http://gpmaster:8088/proxy/application_1435392961740_0012/

    Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0012

    Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0

    2015-06-27 16:56:04,666 Stage-1map = 0%,  reduce = 0%

    2015-06-27 16:56:28,974 Stage-1map = 50%,  reduce = 0%, Cumulative CPU4.36 sec

    2015-06-27 16:56:31,123 Stage-1map = 78%,  reduce = 0%, Cumulative CPU6.21 sec

    2015-06-27 16:56:34,698 Stage-1map = 100%,  reduce = 0%, Cumulative CPU7.37 sec

    MapReduce Total cumulative CPUtime: 7 seconds 370 msec

    Ended Job =job_1435392961740_0012

    MapReduce Jobs Launched:

    Stage-Stage-1: Map: 2   Cumulative CPU: 7.37 sec   HDFS Read: 348355875 HDFS Write: 76 SUCCESS

    Total MapReduce CPU Time Spent:7 seconds 370 msec

    OK

    userbook.book_id    userbook.book_name    userbook.author      userbook.public_date     userbook.address

    15999998838     uviWfFJ KwCrDOA    2009-12-27  3b74416d-eb69-48e2-9d0d-09275064691b

    Time taken: 45.678 seconds, Fetched: 1 row(s)

    (3)  创建索引

    hive (hive)> create indexuserbook_bookid_idx on table userbook(book_id) as 'COMPACT' WITH DEFERREDREBUILD;

    (4)  创建索引后再运行查询

    hive (hive)> select * fromuserbook where book_id = '15999998838';

    Query ID =hadoop_20150627170019_5bb5514a-4c8e-4c47-9347-ed0657e1f2ff

    Total jobs = 1

    Launching Job 1 out of 1

    Number of reduce tasks is setto 0 since there's no reduce operator

    Starting Job =job_1435392961740_0013, Tracking URL = http://gpmaster:8088/proxy/application_1435392961740_0013/

    Kill Command =/home/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1435392961740_0013

    Hadoop job information forStage-1: number of mappers: 2; number of reducers: 0

    2015-06-27 17:00:30,429 Stage-1map = 0%,  reduce = 0%

    2015-06-27 17:00:54,003 Stage-1map = 50%,  reduce = 0%, Cumulative CPU7.43 sec

    2015-06-27 17:00:56,181 Stage-1map = 78%,  reduce = 0%, Cumulative CPU9.66 sec

    2015-06-27 17:00:58,417 Stage-1map = 100%,  reduce = 0%, Cumulative CPU10.83 sec

    MapReduce Total cumulative CPUtime: 10 seconds 830 msec

    Ended Job =job_1435392961740_0013

    MapReduce Jobs Launched:

    Stage-Stage-1: Map: 2   Cumulative CPU: 10.83 sec   HDFS Read: 348356271 HDFS Write: 76 SUCCESS

    Total MapReduce CPU Time Spent:10 seconds 830 msec

    OK

    userbook.book_id    userbook.book_name    userbook.author      userbook.public_date     userbook.address

    15999998838     uviWfFJ KwCrDOA    2009-12-27  3b74416d-eb69-48e2-9d0d-09275064691b

    Time taken: 40.549 seconds, Fetched: 1 row(s)

    能够看到创建索引后,速度还是稍快一点的。

    事实上对于这样的简单的查询,通过我们的设置,能够不用启动Map/Reduce的,而是启动Fetch task,直接从HDFS文件里filter过滤出须要的数据。须要设置例如以下參数:

    set hive.fetch.task.conversion=more;

    hive (hive)> select * fromuserbook where book_id = '15999998838';

    OK

    userbook.book_id    userbook.book_name    userbook.author      userbook.public_date     userbook.address

    15999998838     uviWfFJ KwCrDOA    2009-12-27  3b74416d-eb69-48e2-9d0d-09275064691b

    Time taken: 0.093 seconds,Fetched: 1 row(s)

    能够看到速度更快了。毕竟省略掉了开启MR任务,运行效率提高不少。



    參考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing


  • 相关阅读:
    xamp配置多域名站点
    POJ1611-The Suspects-ACM
    POJ2524-宗教问题-并查集-ACM
    POJ3274-牛的属性-HASH-ACM
    拓扑排序-DFS
    拓扑排序
    POJ1007-DNA Sorting-ACM
    POJ1258-Agri-Net-ACM
    wdcp-apache配置错误导致进程淤积进而内存吃紧
    wdcp-apache开启KeepAlive提高响应速度
  • 原文地址:https://www.cnblogs.com/yxwkf/p/5197802.html
Copyright © 2011-2022 走看看