Pg::BulkLoad - Bulk Load for Postgres with ability to skip bad records.
version 2.06
Load Comma and Tab Delimited files into Postgres, skipping bad records.
This example will take a table name followed by file names (wildcards allowed) from the command line and load the data.
Shell> split -l 50000 -a 3 -d mydata.csv load
Shell> myloadscript.pl tablename load*
=== myloadscript.pl ===
use Pg::BulkLoad;
my $pgc = Pg::BulkLoad->new(
pg => DBI->connect("dbi:Pg:dbname=$dbname", $username, $password, {AutoCommit => 0}),
errorfile => '/tmp/pgbulk.error',
errorlimit => 500,
);
my $table = shift @ARGV;
my @files = map { `ls -b $_`} @ARGV ;
chomp @files;
for my $file ( @files ) { $pgc->load( $file, $table, 'csv' ) }
Takes arguments in hash format:
pg => DBD::Pg database_handle (mandatory),
errorfile => A file to log errors to (mandatory),
errorcount => a limit of errors before giving up (optional)
Attempts to load your data. Takes 3 parameters:
-
$file
the file you're trying to load.
-
$table
the table to load to.
-
$format
either text or csv
The Postgres 'COPY FROM' lacks a mechanism for skipping bad records. Sometimes we need to ingest 'dirty' data and make the best of it.
Pg::BulkLoad attempts to load your file via the COPY FROM command if it fails it removes the error for the bad line from its working copy, then attempts to load all of the records previous to the error, and then tries to load the remaining data after the failure.
If your data is clean the COPY FROM command is pretty fast, however if there are a lot of bad records, for each failure Pg::BuklLoad has to rewrite the input file. If your data has a lot of bad records small batches are recommended, for clean data performance will be better with a larger batch size. To keep this program simpler I've left chunking larger files up to the user. The split program will quickly split larger files, but you can split them in Perl if you prefer. Pg::BulkLoad does hold the entire data file in memory (to improve performance on dirty files) this will create a practical maximum file size.
Since Pg::Bulkload passes all of the work to copy it is subject to the limitation that the source file must be readable via the file system to the postgres server (usually the postgres user). To avoid permissions problems Pg::Bulkload copies the file to /tmp for loading (leaving the original preserved if it has to evict records). Pg::BulkLoad needs to be run locally to the server, this means that your host for connection will almost always be localhost.
The internal error counting is for the life of an instance not per data file. If you have 100 source files an error limit of 500 and there are 1000 errors in your source you will likely get about half the data loaded before this module quits. You should be prepared to deal with the consequences of a partial load.
The example assumes that you'll set variables for your dbname, user and password. Optionally you could make them command line parameters with this code:
my $dbname = shift @ARGV;
my $username = shift @ARGV;
my $password = shift @ARGV;
my $tablename = shift @ARGV;
Or you could assign a value to any of those. The keyword my is variable declaration, @ARGV is Perl's name for the command line arguments.
My first CPAN module was Pg::BulkCopy, because I had this problem. I found something better that was written in C, so I deprecated my original module which needed a rewrite. Sometimes the utility I switched to doesn't want to compile, so I got tired of that, still had my original problem of getting a lot of data from an external source that has a certain amount of errors, and is creative in finding new ways get bad records past my preprocessor. Pg::BulkCopy wanted to be an import/export utility, Pg::BulkLoad only deals with the core issue of getting the good data loaded.
To properly test it you'll need to export DB_TESTING to a true value in your environment before running tests. When this variable isn't set the tests mock a database for a few of the simpler tests and skip the rest.
John Karr brainbuz@brainbuz.org
This software is Copyright (c) 2018 by John Karr.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007