很久之前就使用过pg_bulkload来导入数据了,并做了对比试验,现在另一个项目又需要用了,这里做个记录:
1.rpm包比较老,下下来之后发现只支持到pg94,目前我用的是pg10,因此放弃。
2.下载源码安装:
git clone https://github.com/ossc-db/pg_bulkload.git
cd pg_bulkload
make && make install
--这里他会读取pg_config来获取pg的环境变量。
3.在要使用的数据库中执行:
create extension pg_bulkload;
4.导入csv文件:
pg_bulkload -i c_xxx.csv -O c_xxx -l c_xxx_load.log -d xxx -o "TYPE=CSV" -o "WRITER=PARALLEL"
5.导入压缩文件:
zcat c_xxx.gz |pg_bulkload -i stdin -O c_xxx -l c_xxx_load.log -d xxx -o "TYPE=CSV" -o "WRITER=PARALLEL"
6.关于-o的选项在help中没有,我们可以通过导入的log来看有哪些参数可以配置:
pg_bulkload 3.1.14 on 2018-09-28 11:31:12.641693+08 INPUT = stdin PARSE_BADFILE = /var/lib/pgsql/pg10/data/pg_bulkload/20180928113112_sgdw_public_c_xxx.prs LOGFILE = /var/lib/pgsql/sgdw/data/c_xxx_load.log LIMIT = INFINITE PARSE_ERRORS = 0 ENCODING = UTF8 CHECK_CONSTRAINTS = NO TYPE = CSV SKIP = 0 DELIMITER = , QUOTE = """ ESCAPE = """ NULL = OUTPUT = public.c_xxx MULTI_PROCESS = YES VERBOSE = NO WRITER = DIRECT DUPLICATE_BADFILE = /var/lib/pgsql/pg10/data/pg_bulkload/20180928113112_sgdw_public_c_xxx.dup.csv DUPLICATE_ERRORS = 0 ON_DUPLICATE_KEEP = NEW TRUNCATE = YES 0 Rows skipped. 29423400 Rows successfully loaded. 0 Rows not loaded due to parse errors. 0 Rows not loaded due to duplicate errors. 0 Rows replaced with new rows. Run began on 2018-09-28 11:31:12.641693+08 Run ended on 2018-09-28 11:39:48.835205+08 CPU 2.63s/399.05u sec elapsed 516.19 sec
理论上黑体的都是可以配置的,比如配置为verbose为yes,那就在后面加一个-o "verbose=yes"
另外:默认逗号分隔,双引号将值括起来,默认直接写。如果忘记了,就导一个默认的,看看log就知道了。
附一个批量的脚本:
1 -bash-4.1$ cat load.sh 2 #!/bin/sh 3 4 #$1 data fil ename 5 6 file=$1 7 8 if [ ! -f $file ] 9 then 10 echo "File is not exist" 11 exit 1 12 fi 13 14 echo "-----------------------------------------------------------------" 15 16 tbname=$( echo $file |cut -d . -f1 ) 17 echo "Table name is : "$tbname 18 19 zcat $file|pg_bulkload -i stdin -O public.$tbname -l $tbname.log -o "TYPE=CSV" -o "WRITER=PARALLEL" -d sgdw 20 21 echo "load complete" 22 echo "-----------------------------------------------------------------"