zoukankan      html  css  js  c++  java
  • TPC-H数据导入Hive方案

    参考博文:

    https://blog.csdn.net/haoxiaoyan/article/details/53033870    -->在这篇博文中,前面步骤对我有很大帮助,但后面将tpc-h生成的数据导入到hive的步骤,我没用看懂,我会在后面介绍另外一种方案导入数据。

    https://github.com/maomao1994/TPC-H/blob/master/Hive-HiveQL/Hive-HiveQL.md    -->我使用了该作者提供的源代码,实现了往hive中导入tpc-h测试集数据。

    https://www.ibm.com/support/knowledgecenter/STXKQY_BDA_SHR/bl1bda_tpch.htm    -->ibm介绍的关于将tpch数据导入hive方案。

    使用系统:centos 7

    正文:

    1:下载TPC-H源码,用来生成测试数据:

          地址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp  -->选择下载tpc-h那栏.zip格式的文件。

    2:将下载好的.zip文件解压到一个文件夹下:

          最好将该解压文件夹放到一个存储空间大的文件夹下,别放在根目录下。我起初放在了一个小文件夹下,最后生成测试集数据后,启动hive报错。

          错误类型如下,最后更改了存放位置,就OK啦。

        Resources are low on NN. Please add or free up more resources then turn off safe mode manually.
    

    3:进入该解压文件夹下,进入dbgen目录:(我的文件夹名字为:tpc-h,解压到了home目录下)

        cd /home/tpc-h/dbgen
    

           找到一个名字叫makefile.suite的文件,用vim打开并修改:

        vi makefile.suite
    

           在makefile.suite文件里找到对应行,更改为如下形式:

         CC      = gcc
    
         # Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)  
    
         #                                  SQLSERVER, SYBASE  
    
         # Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS, 
    
         #                                  SGI, SUN, U2200, VMS, LINUX, WIN32   
    
         # Current values for WORKLOAD are:  TPCH 
    
         DATABASE= SQLSERVER  
    
         MACHINE = LINUX  
    
         WORKLOAD = TPCH
    

          更改完毕,保存文件。在当前目录下,更改makefile.suite文件为makefile.代码如下:(目的是为了后面使用make命令)

         mv  makefile.suite  makefile
    

         检查你的系统是否含有gcc编译器,若没有,则需要安装。代码如下:

         yum install gcc
    

    4:修改dbgen文件夹下的tpcd.h文件:

         vi tpcd.h
    

     在该文件里,找到SQLSERVER段落,修改为下面内容:

         #ifdef  SQLSERVER
         #define GEN_QUERY_PLAN        "EXPLAIN;"
         #define START_TRAN                "START TRANSACTION;
    "
         #define END_TRAN                "COMMIT;
    "
         #define SET_OUTPUT                ""
         #define SET_ROWCOUNT                "limit %d;
    "
         #define SET_DBASE                "use %s;
    "
         #endif
    

     5:执行make命令,键入如下代码:

         make
    

     对,在命令行输入make四个字母就可以了,接着会出现很多.o格式的文件。

    6:键入如下代码,生成测试集数据(要在dbgen目录下执行):

        ./dbgen -s 1
    

     代码解释:-s为数据规模因子,1代表1G,如果改为10就表示10G。看你要多少。执行完毕后你会发现在当前目录下,出现了8个.tbl文件。这就是生成的数据。

    7:将tpc-h数据导入hive:

         7.1 输入如下命令,启动hive:(下面的命令均在hive里进行)   

        hive
    

        7.2 先创建一个数据库:

        CREATE DATABASE tpch;
    

             
              然后使用该数据库:

        USE tpch;
    

        7.3 创建8张数据表,名字和我们之前生成的8个.tbl文件的名字保持相同。代码如下:(代码来自于参考博文之第2篇)

       Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT,L_SUPPKEY INT,L_LINENUMBER INT,L_QUANTITY DOUBLE,L_EXTENDEDPRICE DOUBLE,L_DISCOUNT DOUBLE,L_TAX DOUBLE,L_RETURNFLAG STRING,L_LINESTATUS STRING,L_SHIPDATE STRING,L_COMMITDATE STRING,L_RECEIPTDATE STRING,L_SHIPINSTRUCT STRING,L_SHIPMODE STRING,L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/lineitem';
    
       Create external table nation (N_NATIONKEY INT,N_NAME STRING,N_REGIONKEY INT,N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/nation';
    
       Create external table region (R_REGIONKEY INT,R_NAME STRING,R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/region';
    
       Create external table part (P_PARTKEY INT , P_NAME STRING , P_MFGR STRING, P_BRAND  STRING, P_TYPE STRING , P_SIZE INT , P_CONTAINER  STRING, P_RETAILPRICE DOUBLE, P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/part';
    
       Create external table supplier (S_SUPPKEY INT , S_NAME STRING ,S_ADDRESS STRING , S_NATIONKEY  INT , S_PHONE  STRING, S_ACCTBAL DOUBLE, S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/supplier';
    
       Create external table partsupp (PS_PARTKEY INT , PS_SUPPKEY INT , PS_AVAILQTY INT , PS_SUPPLYCOST DOUBLE, PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/partsupp';
    
       Create external table customer (C_CUSTKEY INT , C_NAME STRING , C_ADDRESS STRING , C_NATIONKEY  INT , C_PHONE  STRING , C_ACCTBAL DOUBLE , C_MKTSEGMENT STRING , C_COMMENT STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/customer';
    
       Create external table orders (O_ORDERKEY INT , O_CUSTKEY INT ,O_ORDERSTATUS STRING,O_TOTALPRICE DOUBLE, O_ORDERDATE DATE , O_ORDERPRIORITY STRING, O_CLERK  STRING,  O_SHIPPRIORITY  INT ,  O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';
    

         7.4 将数据导入hive,代码如下:(下面 /home/mao/tpch_workplace/tpc_h_tool,应替换你测试集所在的实际地址)

       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/region.tbl' INTO TABLE region;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/nation.tbl' INTO TABLE nation;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/part.tbl'   INTO TABLE part;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/supplier.tbl' INTO TABLE supplier;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/partsupp.tbl' INTO TABLE partsupp;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/customer.tbl' INTO TABLE customer;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/orders.tbl' INTO TABLE orders;
       LOAD DATA LOCAL INPATH '/home/mao/tpch_workplace/tpc_h_tool/dbgen/lineitem.tbl' INTO TABLE lineitem;
    

         7.5 完毕,可以使用SQL查询语句,看数据表中是否有数据。导入过程耗时:10G数据大概6分钟。

         7.6 除了这种导入数据的方案,还有其他几种方案,移步文章开头介绍的参考博文:第一篇博文和第三篇博文。

    你好啊
  • 相关阅读:
    关于参数,集合的演示
    Process
    常见的IT 日语
    C#格式化数值
    提高C#编程水平的50个要点
    .net中C#代码与javaScript函数的相互调用问题
    Rundll32.exe
    徐家骏:我在华为工作十年的感悟
    转换人民币大小金额
    29个Ajax Javascript相册资源
  • 原文地址:https://www.cnblogs.com/zyz2085/p/9806555.html
Copyright © 2011-2022 走看看