zoukankan      html  css  js  c++  java
  • [转帖]Loading Data into HAWQ

    Loading Data into HAWQ

    Loading data into the database is required to start using it but how? There are several approaches to achieve this basic requirement but achieve the result by approaching the problem in different ways. This allows you to load data that best matches your use case.

    Table Setup
    This table will be used for the testing in HAWQ. I have this table created in a single node VM running Hortonworks HDP with HAWQ 2.0 installed. I’m using the default Resource Manager too.

    CREATE TABLE test_data
    (id int,
     fname text,
     lname text)
     DISTRIBUTED RANDOMLY;
    

    Singleton
    Let’s start with probably the worst way first. Sometimes this way is ideal because you have very little data to load but in most cases, avoid singleton inserts. This approach inserts just a single tuple in a single transaction.

    head si_test_data.sql
    insert into test_data (id, fname, lname) values (1, 'jon_00001', 'roberts_00001');
    insert into test_data (id, fname, lname) values (2, 'jon_00002', 'roberts_00002');
    insert into test_data (id, fname, lname) values (3, 'jon_00003', 'roberts_00003');
    insert into test_data (id, fname, lname) values (4, 'jon_00004', 'roberts_00004');
    insert into test_data (id, fname, lname) values (5, 'jon_00005', 'roberts_00005');
    insert into test_data (id, fname, lname) values (6, 'jon_00006', 'roberts_00006');
    insert into test_data (id, fname, lname) values (7, 'jon_00007', 'roberts_00007');
    insert into test_data (id, fname, lname) values (8, 'jon_00008', 'roberts_00008');
    insert into test_data (id, fname, lname) values (9, 'jon_00009', 'roberts_00009');
    insert into test_data (id, fname, lname) values (10, 'jon_00010', 'roberts_00010');
    

    This repeats for 10,000 tuples.

    time psql -f si_test_data.sql > /dev/null
    real	5m49.527s
    

    As you can see, this is pretty slow and not recommended for inserting large amounts of data. Nearly 6 minutes to load 10,000 tuples is crawling.

    COPY
    If you are familiar with PostgreSQL then you will feel right at home with this technique. This time, the data is in a file named test_data.txt and it is not wrapped with an insert statement.

    head test_data.txt
    1|jon_00001|roberts_00001
    2|jon_00002|roberts_00002
    3|jon_00003|roberts_00003
    4|jon_00004|roberts_00004
    5|jon_00005|roberts_00005
    6|jon_00006|roberts_00006
    7|jon_00007|roberts_00007
    8|jon_00008|roberts_00008
    9|jon_00009|roberts_00009
    10|jon_00010|roberts_00010
    
    COPY test_data FROM '/home/gpadmin/test_data.txt' WITH DELIMITER '|';
    COPY 10000
    Time: 128.580 ms
    

    This method is significantly faster but it loads the data through the master. This means it doesn’t scale well as the master will become the bottleneck but it does allow you to load data from a host anywhere on your network so long as it has access to the master.

    gpfdist
    gpfdist is a web server that serves posix files for the segments to fetch. Segment processes will get the data directly from gpfdist and bypass the master when doing so. This enables you to scale by adding more gpfdist processes and/or more segments.

    gpfdist -p 8888 &
    [1] 128836
    [gpadmin@hdb ~]$ Serving HTTP on port 8888, directory /home/gpadmin
    

    Now you’ll need to create a new external table to read the data from gpfdist.

    CREATE EXTERNAL TABLE gpfdist_test_data
    (id int,
     fname text,
     lname text)
    LOCATION ('gpfdist://hdb:8888/test_data.txt')
    FORMAT 'TEXT' (DELIMITER '|');
    

    And to load the data.

    INSERT INTO test_data SELECT * FROM gpfdist_test_data;
    INSERT 0 10000
    Time: 98.362 ms
    

    gpfdist is blazing fast and scales easily. You can add more than one gpfdist location in the external table, use wild cards, use different formats, and much more. The downside is the file must be on a host that all segments can reach. You also have to create a separate gpfdist process on that host.

    gpload
    gpload is a utility that automates the loading process by using gpfdist. Review the documentation for more on this utility. Technically, it is the same as gpfdist and external tables but just automates the commands for you.

    Programmable Extension Framework (PXF)
    PXF allows you to read and write data to HDFS using external tables. Like using gpfdist, it is done by each segment so it scales and executes in parallel.

    For this example, I’ve loaded the test data into HDFS.

    hdfs dfs -cat /test_data/* | head
    1|jon_00001|roberts_00001
    2|jon_00002|roberts_00002
    3|jon_00003|roberts_00003
    4|jon_00004|roberts_00004
    5|jon_00005|roberts_00005
    6|jon_00006|roberts_00006
    7|jon_00007|roberts_00007
    8|jon_00008|roberts_00008
    9|jon_00009|roberts_00009
    10|jon_00010|roberts_00010
    

    The external table definition.

    CREATE EXTERNAL TABLE et_test_data
    (id int,
     fname text,
     lname text)
    LOCATION ('pxf://hdb:51200/test_data?Profile=HdfsTextSimple')
    FORMAT 'TEXT' (DELIMITER '|');
    

    And now to load it.

    INSERT INTO test_data SELECT * FROM et_test_data;
    INSERT 0 10000
    Time: 227.599 ms
    

    PXF is probably the best way to load data when using the “Data Lake” design. You load your raw data into HDFS and then consume it with a variety of tools in the Hadoop ecosystem. PXF can also read and write other formats.

    Outsourcer and gplink
    Last but not least are software programs I created. Outsourcer automates the table creation and load of data directly to Greenplum or HAWQ using gpfdist. It sources data from SQL Server and Oracle as these are the two most common OLTP databases.

    gplink is another tool that can read external data but this technique can connect to any valid JDBC source. It doesn’t automate many of the steps that Oustourcer does but it is a convenient tool to get data from a JDBC source.

    You might be thinking that sqoop does this but not exactly. gplink and Outsourcer load data into HAWQ and Greenplum tables. It is optimized for these databases and fixes data for you automatically. Both remove null and newline characters and escapes the escape and delimiter characters. With sqoop, you will have to read the data from HDFS using PXF and then fix the errors that could be in the files.

    Both tools are linked above.

    Summary
    This post gives a brief description on the various ways to load data into HAWQ. Pick the right technique for your use case. As you can see, HAWQ is very flexible and can handle a variety of ways to load data.

    This entry was posted in Hadoop on July 14, 2016.        

  • 相关阅读:
    实验 6 数组1输出最大值和它所对应的下标
    实验5第二题
    实验5第一题
    作业 3 应用分支与循环结构解决问题 判断是否闰年
    心得与体会1
    第七章
    第六章
    第五章
    第四章
    第一章
  • 原文地址:https://www.cnblogs.com/dajianshi/p/9759049.html
Copyright © 2011-2022 走看看