zoukankan      html  css  js  c++  java
  • 深入了解Hive Index具体实现

    索引是标准的数据库技术,hive 0.7版本之后支持索引。hive索引采用的不是'one size fites all'的索引实现方式,而是提供插入式接口,并且提供一个具体的索引实现作为参考。Hive的Index接口如下:

    public interface HiveIndexHandler extends Configurable {
      /**
       * Determines whether this handler implements indexes by creating an index
       * table.
       * 
       * @return true if index creation implies creation of an index table in Hive;
       *         false if the index representation is not stored in a Hive table
       */
      boolean usesIndexTable();
    
      /**
       * Requests that the handler validate an index definition and fill in
       * additional information about its stored representation.
    
       * @throw HiveException if the index definition is invalid with respect to
       *        either the base table or the supplied index table definition
       */
      void analyzeIndexDefinition(
          org.apache.hadoop.hive.metastore.api.Table baseTable,
          org.apache.hadoop.hive.metastore.api.Index index,
          org.apache.hadoop.hive.metastore.api.Table indexTable)
          throws HiveException;
    
      /**
       * Requests that the handler generate a plan for building the index; the plan
       * should read the base table and write out the index representation.
    */
      List<Task<?>> generateIndexBuildTaskList(
          org.apache.hadoop.hive.ql.metadata.Table baseTbl,
          org.apache.hadoop.hive.metastore.api.Index index,
          List<Partition> indexTblPartitions, List<Partition> baseTblPartitions,
          org.apache.hadoop.hive.ql.metadata.Table indexTbl,
          Set<ReadEntity> inputs, Set<WriteEntity> outputs)
          throws HiveException;
    
    }

    创建索引的时候,Hive首先调用接口的usesIndexTable方法,判断索引是否是已Hive Table的方式存储(默认的实现是存储在Hive中的)。然后调用analyzeIndexDefinition分析索引创建语句是否合法,如果没有问题将在元数据标IDXS中添加索引表,否则抛出异常。如果索引创建语句中使用with deferred rebuild,在执行alter index xxx_index on xxx rebuild时将调用generateIndexBuildTaskList获取Index的MapReduce,并执行为索引填充数据。

    下面是借鉴别人设计的测试索引的例子:

    首先生成测试数据:

    #! /bin/bash  
    #generating 350M raw data.  
    i=0  
    while [ $i -ne 1000000 ]  
    do  
            echo -e "$i\tA decade ago, many were predicting that Cooke, a New York City prodigy, would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando, Fla. There was a time, however fleeting, when he was more heralded, or perhaps merely hyped, than any other high school player in America."  
            i=$(($i+1))  
    done

    创建测试表:
    hive> create table table01( id int, name string)  
        > ROW FORMAT DELIMITED  
        > FIELDS TERMINATED BY '\t';
    OK
    Time taken: 0.371 seconds
    hive> load data local inpath '/home/hadoop/hive_index_test/dual.txt' overwrite into table table01;
    Copying data from file:/home/hadoop/hive_index_test/dual.txt
    Copying file: file:/home/hadoop/hive_index_test/dual.txt
    Loading data to table default.table01
    Deleted hdfs://localhost:9000/user/hive/warehouse/table01
    OK
    Time taken: 13.492 seconds
    hive> create table table02 as select id,name as text from table01;
    Total MapReduce jobs = 2
    Launching Job 1 out of 2
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0006
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0006
    2013-01-22 11:21:19,639 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:21:25,678 Stage-1 map = 33%,  reduce = 0%
    2013-01-22 11:21:37,754 Stage-1 map = 67%,  reduce = 0%
    2013-01-22 11:21:43,788 Stage-1 map = 100%,  reduce = 0%
    2013-01-22 11:21:46,828 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0006
    Ended Job = -663277165, job is filtered out (removed at runtime).
    Moving data to: hdfs://localhost:9000/tmp/hive-hadoop/hive_2013-01-22_11-21-13_661_2061036951988537032/-ext-10001
    Moving data to: hdfs://localhost:9000/user/hive/warehouse/table02
    1000000 Rows loaded to hdfs://localhost:9000/tmp/hive-hadoop/hive_2013-01-22_11-21-13_661_2061036951988537032/-ext-10000
    OK
    Time taken: 33.904 seconds
    hive> dfs -ls /user/hive/warehouse/table02;
    Found 6 items
    -rw-r--r--   3 hadoop supergroup   67109134 2013-01-22 11:21 /user/hive/warehouse/table02/000000_0
    -rw-r--r--   3 hadoop supergroup   67108860 2013-01-22 11:21 /user/hive/warehouse/table02/000001_0
    -rw-r--r--   3 hadoop supergroup   67108860 2013-01-22 11:21 /user/hive/warehouse/table02/000002_0
    -rw-r--r--   3 hadoop supergroup   67108860 2013-01-22 11:21 /user/hive/warehouse/table02/000003_0
    -rw-r--r--   3 hadoop supergroup   67108860 2013-01-22 11:21 /user/hive/warehouse/table02/000004_0
    -rw-r--r--   3 hadoop supergroup   21344316 2013-01-22 11:21 /user/hive/warehouse/table02/000005_0
    hive> select * from table02 where id=500000;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0007
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0007
    2013-01-22 11:22:26,865 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:22:28,884 Stage-1 map = 33%,  reduce = 0%
    2013-01-22 11:22:31,905 Stage-1 map = 67%,  reduce = 0%
    2013-01-22 11:22:34,921 Stage-1 map = 100%,  reduce = 0%
    2013-01-22 11:22:37,943 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0007
    OK
    500000    A decade ago, many were predicting that Cooke, a New York City prodigy, would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando, Fla. There was a time, however fleeting, when he was more heralded, or perhaps merely hyped, than any other high school player in America.
    Time taken: 18.551 seconds

    创建索引:
    hive> create index table02_index on table table02(id)  
        >     as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'  
        >     with deferred rebuild;
    OK
    Time taken: 0.503 seconds

    填充索引数据:
    hive> alter index table02_index on table02 rebuild;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=<number>
    Starting Job = job_201301221042_0008, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0008
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0008
    2013-01-22 11:23:56,870 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:24:02,902 Stage-1 map = 33%,  reduce = 0%
    2013-01-22 11:24:08,929 Stage-1 map = 67%,  reduce = 0%
    2013-01-22 11:24:11,944 Stage-1 map = 67%,  reduce = 11%
    2013-01-22 11:24:14,966 Stage-1 map = 100%,  reduce = 11%
    2013-01-22 11:24:21,007 Stage-1 map = 100%,  reduce = 22%
    2013-01-22 11:24:27,043 Stage-1 map = 100%,  reduce = 67%
    2013-01-22 11:24:30,056 Stage-1 map = 100%,  reduce = 86%
    2013-01-22 11:24:33,089 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0008
    Loading data to table default.default__table02_table02_index__
    Deleted hdfs://localhost:9000/user/hive/warehouse/default__table02_table02_index__
    Table default.default__table02_table02_index__ stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 74701985]
    OK
    Time taken: 61.203 seconds
    hive> dfs -ls /user/hive/warehouse/default*;
    Found 1 items
    -rw-r--r--   3 hadoop supergroup   74701985 2013-01-22 11:24 /user/hive/warehouse/default__table02_table02_index__/000000_0

    可以看到索引内存储的数据:
    hive> select * from default__table02_table02_index__ limit 3;
    OK
    0    hdfs://localhost:9000/user/hive/warehouse/table02/000000_0    [0]
    1    hdfs://localhost:9000/user/hive/warehouse/table02/000000_0    [352]
    2    hdfs://localhost:9000/user/hive/warehouse/table02/000000_0    [704]
    Time taken: 0.156 seconds

    自己做一个索引文件测试:
    hive> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
    hive> Insert overwrite directory "/tmp/table02_index_data" select `_bucketname`, `_offsets` from   default__table02_table02_index__ where id =500000;  
    Total MapReduce jobs = 2
    Launching Job 1 out of 2
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0009, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0009
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0009
    2013-01-22 11:30:23,859 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:30:26,872 Stage-1 map = 100%,  reduce = 0%
    2013-01-22 11:30:29,904 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0009
    Ended Job = -489547412, job is filtered out (removed at runtime).
    Launching Job 2 out of 2
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0010, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0010
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0010
    2013-01-22 11:30:35,861 Stage-2 map = 0%,  reduce = 0%
    2013-01-22 11:30:38,882 Stage-2 map = 100%,  reduce = 0%
    2013-01-22 11:30:41,907 Stage-2 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0010
    Moving data to: /tmp/table02_index_data
    1 Rows loaded to /tmp/table02_index_data
    OK
    Time taken: 25.173 seconds
    hive> select * from table02 where id =500000;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0011, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0011
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0011
    2013-01-22 11:31:06,055 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:31:09,066 Stage-1 map = 33%,  reduce = 0%
    2013-01-22 11:31:12,083 Stage-1 map = 67%,  reduce = 0%
    2013-01-22 11:31:15,102 Stage-1 map = 100%,  reduce = 0%
    2013-01-22 11:31:18,127 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0011
    OK
    500000    A decade ago, many were predicting that Cooke, a New York City prodigy, would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando, Fla. There was a time, however fleeting, when he was more heralded, or perhaps merely hyped, than any other high school player in America.
    Time taken: 17.533 seconds
    hive> Set hive.index.compact.file=/tmp/table02_index_data;
    hive> Set hive.optimize.index.filter=false;
    hive> Set hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;  
    hive> select * from table02 where id =500000;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201301221042_0012, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201301221042_0012
    Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:9001 -kill job_201301221042_0012
    2013-01-22 11:32:14,929 Stage-1 map = 0%,  reduce = 0%
    2013-01-22 11:32:17,942 Stage-1 map = 100%,  reduce = 0%
    2013-01-22 11:32:20,968 Stage-1 map = 100%,  reduce = 100%
    Ended Job = job_201301221042_0012
    OK
    500000    A decade ago, many were predicting that Cooke, a New York City prodigy, would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando, Fla. There was a time, however fleeting, when he was more heralded, or perhaps merely hyped, than any other high school player in America.
    Time taken: 11.222 seconds

    总结:索引表的基本包含几列:1. 源表的索引列;2. _bucketname hdfs中文件地址 3. 索引列在hdfs文件中的偏移量。原理是通过记录索引列在HDFS中的偏移量,精准获取数据,避免全表扫描。

    参考资料:http://blog.csdn.net/liwei_1988/article/details/7319030

  • 相关阅读:
    1046 Shortest Distance (20 分)(模拟)
    1004. Counting Leaves (30)PAT甲级真题(bfs,dfs,树的遍历,层序遍历)
    1041 Be Unique (20 分)(hash散列)
    1036 Boys vs Girls (25 分)(查找元素)
    1035 Password (20 分)(字符串处理)
    1044 Shopping in Mars (25 分)(二分查找)
    onenote使用小Tip总结^_^(不断更新中...)
    1048 Find Coins (25 分)(hash)
    三个故事
    领导者的举止
  • 原文地址:https://www.cnblogs.com/end/p/2871147.html
Copyright © 2011-2022 走看看