zoukankan      html  css  js  c++  java
  • pg_bulkload快速加载数据

    pg_bulkload

    git下载地址:https://github.com/ossc-db/pg_bulkload

    介绍

      PostgreSQL提供了一个copy命令的便利数据加载工具,copy命令源于PostgreSQL数据库,copy命令支持文件与表之间的数据加载和表对文件的数据卸载。pg_bulkload是一种用于PostgreSQL的高速数据加载工具,相比copy命令。最大的优势就是速度。优势在让我们跳过shared buffer、wal buffer。直接写文件。pg_bulkload的direct模式就是这种思路来实现的,它还包含了数据恢复功能,即导入失败的话,需要恢复。

    架构图

     pg_bulkload主要包括两个模块:reader和writer。reader负责读取文件、解析tuple,writer负责把解析出的tuple写入输出源中。pg_bulkload最初的版本功能很简单,只是加载数据。3.1版本增加了数据过滤的功能。

    1632304843.png

    安装

    unzip pg_bulkload-master.zip
    cd unzip pg_bulkload-master
    make
    make installl
    

    安装完成后再数据库中创建extension

    create extension pg_bulkload;
    

    pg_bulkload 参数

    [thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload --help
    pg_bulkload is a bulk data loading tool for PostgreSQL

    Usage:
    Dataload: pg_bulkload [dataload options] control_file_path
    Recovery: pg_bulkload -r [-D DATADIR]

    Dataload options:
    -i, --input=INPUT INPUT path or function
    -O, --output=OUTPUT OUTPUT path or table
    -l, --logfile=LOGFILE LOGFILE path
    -P, --parse-badfile=* PARSE_BADFILE path
    -u, --duplicate-badfile=* DUPLICATE_BADFILE path
    -o, --option="key=val" additional option

    Recovery options:
    -r, --recovery execute recovery
    -D, --pgdata=DATADIR database directory

    Connection options:
    -d, --dbname=DBNAME database to connect
    -h, --host=HOSTNAME database server host or socket directory
    -p, --port=PORT database server port
    -U, --username=USERNAME user name to connect as
    -w, --no-password never prompt for password
    -W, --password force password prompt

    Generic options:
    -e, --echo echo queries
    -E, --elevel=LEVEL set output message level
    --help show this help, then exit
    --version output version information, then exit

    Read the website for details. http://github.com/ossc-db/pg_bulkload
    Report bugs to http://github.com/ossc-db/pg_bulkload/issues.

    -o的选项在help中没有,可以通过下面导入的test.log日志里面可以看到有哪些参数配置

    pg_bulkload使用

    db_sqlfx=# create table test(id int,info text ,crt_time timestamp(0));
    CREATE TABLE
    db_sqlfx=# insert into test select generate_series(1,5000000),md5(random()::text),clock_timestamp();
    INSERT 0 5000000
    db_sqlfx=# create index i_test_id on test(id);
    CREATE INDEX
    db_sqlfx=# copy test to '/home/thunisoft/test.csv' with(format 'csv');
    COPY 5000000
    
    --不清理索引直接导入
    [thunisoft@sdbserver1 pg_bulkload-master]$ pg_bulkload -i /home/thunisoft/test.csv -O test -l /home/thunisoft/test.log -p 6543 -d db_sqlfx -o "TYPE=CSV" -o "WRITER=PARALLEL"  -o "TRUNCATE=YES"
    NOTICE: BULK LOAD START
    NOTICE: BULK LOAD END
    	0 Rows skipped.
    	5000000 Rows successfully loaded.
    	0 Rows not loaded due to parse errors.
    	0 Rows not loaded due to duplicate errors.
    	0 Rows replaced with new rows.
    
    
    

    这个地方需要注意的是,copy导出要指定with(format 'csv'),否则会报错

    pg_bulkload耗时

    21.03 sec

    [thunisoft@sdbserver1 pg_bulkload-master]$ cat /home/thunisoft/test.log
    
    pg_bulkload 3.1.18 on 2021-09-22 18:13:26.168808+08
    
    INPUT = /home/thunisoft/test.csv
    PARSE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.prs.csv
    LOGFILE = /home/thunisoft/test.log
    LIMIT = INFINITE
    PARSE_ERRORS = 0
    CHECK_CONSTRAINTS = NO
    TYPE = CSV
    SKIP = 0
    DELIMITER = ,
    QUOTE = """
    ESCAPE = """
    NULL = 
    OUTPUT = public.test
    MULTI_PROCESS = YES
    VERBOSE = NO
    WRITER = DIRECT
    DUPLICATE_BADFILE = /home/thunisoft/abdata/7.0/abase1/pg_bulkload/20210922181326_db_sqlfx_public_test.dup.csv
    DUPLICATE_ERRORS = 0
    ON_DUPLICATE_KEEP = NEW
    TRUNCATE = YES
    
      0 Rows skipped.
      5000000 Rows successfully loaded.
      0 Rows not loaded due to parse errors.
      0 Rows not loaded due to duplicate errors.
      0 Rows replaced with new rows.
    
    Run began on 2021-09-22 18:13:26.168808+08
    Run ended on 2021-09-22 18:13:47.19649+08
    
    CPU 0.66s/8.85u sec elapsed 21.03 sec
    
    

    copy导入耗时

    --导出数据
    db_sqlfx=#  copy test to '/home/thunisoft/test.sql';
    COPY 5000000
    
    --清理数据
    db_sqlfx=# truncate table test;
    TRUNCATE TABLE
    Time: 95.247 ms
    
    --导入
    db_sqlfx=# copy test from '/home/thunisoft/test.csv';
    COPY 5000000
    Time: 38811.062 ms (00:38.811)
    

    在有索引的情况下,pg_bulkload的效率比copy要高

    在不创建索引的情况下,导入和copy的效率对比?

    --删除索引
    drop index i_test_id;
    --copy导入数据24s
    db_sqlfx=# copy test from '/home/thunisoft/test.csv';
    COPY 5000000
    Time: 24227.334 ms (00:24.227)
    --使用pg_bulload导入10s
    pg_bulkload:10s
    --创建索引10s
    db_sqlfx=# CREATE INDEX i_test_id ON public.test USING btree (id);
    CREATE INDEX
    Time: 10552.171 ms (00:10.552)
    

    unlogged table

    --索引保留,清空数据
    update pg_class set relpersistence='u' where relname='test';  
    update pg_class set relpersistence='u' where relname='i_test_id'; 
    
    --pg_bulkload 17s
    Run began on 2021-09-23 16:22:52.192337+08
    Run ended on 2021-09-23 16:23:09.726125+08
    CPU 0.63s/7.43u sec elapsed 17.53 sec
    
    --copy
    db_sqlfx=# copy test from '/home/thunisoft/test.csv';
    COPY 5000000
    Time: 27020.258 ms (00:27.020)
    

    copy 在无日志的情况下提升较大,pg_bulkload提升不明显,因为本身已经绕过了shared_buffers

    附一个批量的脚本:

     1 -bash-4.1$ cat load.sh
     2 #!/bin/sh
     3 
     4 #$1 data fil ename
     5 
     6 file=$1
     7 
     8 if [ ! -f $file  ]
     9 then
    10     echo "File is not exist"
    11     exit 1
    12 fi
    13 
    14 echo "-----------------------------------------------------------------"
    15 
    16 tbname=$( echo $file |cut -d . -f1 )
    17 echo "Table name is : "$tbname
    18 
    19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw
    20 
    21 echo "load complete"
    22 echo "-----------------------------------------------------------------"
    

    总结

    1、在有索引的情况下,pg_bulkload的效率比copy要高一倍,没有索引的情况pg_bulkload效率也要高

    2、copy在unlogged table表的性能有提升,但是和pg_bulkload还有些差距

    3、pg_bulkload的优势还是比较明显,主要在于绕了shared buffer、wal buffer,直接写文件,这种模式通unlogged table一样,不需要写wal。对于使用了流复制备库的场景,需要重建备库。基础备份也需要重新生成

    4、当有超大表的时候导入性能还是非常好

  • 相关阅读:
    20160729noip模拟赛zld
    20160728noip模拟赛zld
    20160727noip模拟赛zld
    20160725noip模拟赛“Paodekuai” alexandrali
    20160723数据结构节alexandrali
    20160722noip模拟赛alexandrali
    洛谷 [P2594] 染色游戏
    洛谷 [P3480] KAM-Pebbles
    UVA 1378 A Funny Stone Game
    SPOJ 4060 A game with probability
  • 原文地址:https://www.cnblogs.com/zhangfx01/p/15587578.html
Copyright © 2011-2022 走看看