Skip to content

sogilis/csv_fast_importer

Repository files navigation

Gem Version Tests status Codacy Badge

CSV Fast Importer

A gem to import CSV files' content into a PostgreSQL or MySQL database. It is respectively based on PostgreSQL COPY and MySQL LOAD DATA INFILE which are designed to be as fast as possible.

Why?

CSV importation is a common task which can be done by more than 6 different gems, but none of them is able to import 1 million of lines in a few seconds (see benchmark below), hence the creation of this gem.

Here is an indicative benchmark to compare available solutions. It represents the duration (ms) to import a 10 000 lines csv file into a local PostgreSQL instance on a laptop running OSX (lower is better):

Benchmark

Like all benchmarks, some tuning can produce different results, yet this chart gives a big picture. See benchmark details.

Requirements

  • Rails (ActiveRecord in fact)
  • PostgreSQL or MySQL

Limitations

  • Usual ActiveRecord process (validations, callbacks, computed fields like created_at...) is bypassed. This is the price for performance
  • Custom enclosing field (ex: ") is not supported yet
  • Custom line separator (ex: \r\n for windows file) is not supported yet
  • MySQL: encoding is not supported yet
  • MySQL: transaction is not supported yet
  • MySQL: row_index is not supported yet
  • MySQL: database must have access to file to import

Note about custom line separator: it might work by opening the file with the universal_newline argument (e.g. file = File.new(path, universal_newline: true)). Unfortunately, we weren't able to reproduce and test it so we don't support it "officialy". You can find more information in this ticket (in French).

Installation

Add the dependency to your Gemfile:

gem 'csv_fast_importer'

Run bundle install.

You can install the gem by yourself too:

$ gem install csv_fast_importer

For MySQL ⚠️ enable local_infile for both client and server. In Rails application, juste add local_infile: true to your database config file databse.yml to configure the database client. See Security Issues with LOAD DATA LOCAL for more details.

Usage

Actually, CSV Fast Importer needs active_record to work. Setup your database configuration as in a usual Rails project. Then, use the CsvFastImporter class:

require 'csv_fast_importer'

file = File.new '/path/to/knights.csv'
imported_lines_count = CsvFastImporter.import(file)

puts imported_lines_count

Under the hood, CSV Fast Importer deletes data from the knights table and imports those from knights.csv by mapping columns' names to table's fields. Note: mapping is case insensitive so database fields' names must be lowercase. For instance, a FIRSTNAME CSV column will be mapped to the firstname field.

Options

Option key  Purpose  Default value
encoding File encoding. PostgreSQL only (see FAQ for more details) 'UTF-8'
col_sep Column separator in file ';'
destination Destination table given base filename (without extension)
mapping Column mapping {}
row_index_column Column name where inserting file row index (not used when nil). PostgreSQL only nil
transaction Execute DELETE and INSERT in same transaction. PostgreSQL only :enabled
deletion Row deletion method (:delete for SQL DELETE, :truncate for SQL TRUNCATE or :none for no deletion before import) :delete

If your CSV file is not encoded with same table than your database, you can specify encoding at the file opening (see FAQ for more details):

file = File.new '/path/to/knights.csv', encoding: 'ISO-8859-1'

You can specify a different separator column with the col_sep option (; by default):

CsvFastImporter.import file, col_sep: '|'

By default, CSV Fast Importer computes the database table's name by taking the basename of the imported file. For instance, considering the imported file /path/to/knights.csv, the table's name will be knights. To bypass this default behaviour, specify the destination option:

file = File.new '/path/to/clients.csv'
CsvFastImporter.import file, destination: 'knights'

Finally, you can precise a custom mapping between CSV file's columns and database fields with the mapping option.

Considering the following knights.csv file:

NAME;KNIGHT_EMAIL
Perceval;perceval@logre.cel
Lancelot;lancelot@logre.cel

To map the KNIGHT_EMAIL column to the email database field:

CsvFastImporter.import file, mapping: { knight_email: :email }

Need help?

See FAQ.

How to contribute?

You can fork and submit new pull request (with tests and explanations). First of all, you need to initialize your environment :

$ brew install postgresql # in macOS
$ apt-get install libpq-dev # in Linux
$ bundle install

Then, start your PostgreSQL database (ex: Postgres.app for the Mac) and setup database environment:

$ bundle exec rake test:db:create

This will connect to localhost PostgreSQL database without user (see config/database.postgres.yml) and create a new database dedicated to tests.

Warning: database instance have to allow database creation with UTF-8 encoding.

Finally, you can run all tests with RSpec like this:

$ bundle exec rspec

By default, PostgreSQL is used. You can set another database with environment variables like this for MySQL:

$ DB_TYPE=mysql DB_ROOT_PASSWORD=password DB_USERNAME=username bundle exec rake test:db:create
$ DB_TYPE=mysql DB_USERNAME=username bundle exec rspec

This will connect to mysql with root user (with password as password) and create database for user username. Use DB_TYPE=mysql DB_USERNAME= (with empty username) for anonymous account.

Warning: Mysql tests require your local database permits LOCAL works. Check your Mysql instance with following command: SHOW GLOBAL VARIABLES LIKE 'local_infile' (should be ON).

Versioning

master is the development branch and releases are published as tags.

We follow the Semantic Versioning 2.0.0 for our gem releases.

In few words:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes,
  2. MINOR version when you add functionality in a backwards-compatible manner, and
  3. PATCH version when you make backwards-compatible bug fixes.

Backlog (unordered)

  • Support any column and table case
  • Support custom enclosing field (ex: ")
  • Support custom line serparator (ex: \r\n for windows file)
  • Support custom type convertion
  • MySQL: support encoding parameter. See https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html
  • MySQL: support transaction parameter
  • MySQL: support row_index_column parameter
  • MySQL: run multiple SQL queries in single statement
  • Refactor tests (with should-> must / should -> expect / subject...)
  • Reduce technical debt on db connection (test & benchmark)
  • SQLite support
  • Add link to activerecord-copy
  • Ease local tests on multiple databases with testcontainers
  • Accept csv header which contains column separator

How to release new version?

Setup rubygems.org account:

curl -u {your_gem_account_name} https://rubygems.org/api/v1/api_key.yaml > ~/.gem/credentials
chmod 0600 ~/.gem/credentials

Make sure you are in master branch and run:

bundle exec rake "release:make[major|minor|patch|x.y.z]"

Example: bundle exec rake "release:make[minor]"

Then, follow instructions.