High data-rate SQL-to-JSON extraction from SQL Server, PostgreSQL, or MySQL.
Written by Dave Templin Maintained by Igor Starostenko
Bigboy is basically a SQL-TO-JSON tool that extracts data from SQL Server, PostgreSQL, or MySQL databases. It is designed to handle extremely high data extraction rates (multi-million rows / gigabyte-range) which is achieved by running multiple concurrent extraction queries from a configurable thread pool. The tool provides a simple configuration model for managing any number of data extractions. It also exposes a simple and minimal command-line interface (CLI) that works great for adhoc or batch/cron use-cases.
- Extract data from SQL Server, PostgreSQL, or MySQL
- Perform simple SQL-to-JSON transformations
- Nest rows to form complex hierarchical (or document-oriented) data
- Define configuration parameters to manage dynamic queries
- Run queries in parallel from a configurable thread pool for high data rates
- Combine data from multiple different database sources
- Apply timezone to dates stored without a timezone
- Download the latest release of bigboy for your operating system
- Create a
bigboy.json
with the list ofconnections
(See configuration details for more information) - Create a target directory with
target.json
file withconnection
name (See target for more information) - Create
extract.sql
file in the newly created target directory with the SQL query to use to extract the data (See prefetch section to extract data in parallel) - Consider using available transforms to modify the data on the fly
- Extract the data from the target:
./bigboy ${target}
(You may need to change access permissions to the executable by runningchmod +x bigboy
) - The data is extracted in a JSONL (newline delimited JSON) format in
out/${target}/
directory (by default) with current date filename
Authentication credentials and connection configuration can be set for all database sources in a single config file (See configuration schema)
A directory path that contains required files (instructions) on how to extract data from a source database. Target may contain the following files:
target.json
- REQUIRED - core configuration file of a target that specifiec which connection to use as well as other parameters (See target.json schema)extract.sql
- REQUIRED - SQL query to be used to extract data from source database for a given connection. If prefetch is set up, needs to apply string interpolation (For exampleWHERE id IN (%s)
)prefetch.sql
- OPTIONAL - iftarget.json
hasprefetch
set totrue
, theprefetch.sql
loads ids to be passed intoextract.sql
and run in parallelnest.sql
- OPTIONAL - iftarget.json
hasnest
array of objects configured, thenest.sql
loads ids to be added into the main output with new column for each object
To extract data on large datasets it's recommended to utilize the prefetch feature that allows to split extract in multiple parallel processes. When target has prefetch
set to true
, the prefetch.sql
has to be provided that has an SQL query with a SELECT
statement with one integer column.
The results of the prefetch query would be be passed into multiple processes each execute an extract.sql
with a subset of the prefetched numbers. The extract query needs to apply string interpolation (For example WHERE id IN (%s)
)
Queries can accept parameters to customize the request from outside the target on runtime. Params can be configured in target with a given type and default value. On runtime the params are passed into a prefetch.sql
or an extract.sql
using syntax like WHERE date >= ?1
, where 1
is the index of the param (starting from 1
).
Allows for the output JSON records to have nested arrays. To enable, the target has to have nest
params configured and nest.sql
defined.
The nest query runs before the main extract and then inserts the results in an array format into a column defined in target childKey
param. For records to be inserted into the result of the main extract query, the nest.sql
needs to have a column with the name _parent
which value would match the value of the column in the extract.sql
which name is defined in target parentKey
param.
If nest.sql
has a _value
column only the values would be inserted in the array. Otherwise the entire record object (except the _parent
column) would be added inside the array.
Not yet implemented
Produces multiple files instead of one. To enable, the target has to have split
param set.
When configured, the output argument set with -o
flag has to be a directory without extension.
Currently supports only split by date
. Every output file would contain records for each day returned from the extract query. Use layout
param for MySQL or when the date column is stored as STRING.
All dates are assumed to be in GMT unless a timezone is specified. If a time is not specified then midnight GMT is assumed. Examples below illustrate various scenarios of specifying a date or date-range.
The following examples assume there is a target named log
with a single paramter of type date
representing a start date for the extraction.
Example | Comments |
---|---|
bigboy log 2017-07-21 |
7/21/2017 at midnight GMT |
bigboy log "2017-07-21 15:00:00" |
7/21/2017 at 3pm GMT |
bigboy log today |
Midnight GMT of the current day |
bigboy log yesterday |
Midnight GMT of the previous day |
The following examples assume there is a target named sales
with two parameters of type date
representing a date range for the extraction.
Example | Comments |
---|---|
bigboy sales 2017-07-21 2017-07-23 |
From 7/21/2017 to 7/23/2017 midnight-to-midnight GMT |
bigboy sales 2017-07-21 2d |
Midnight GMT of the previous day |
The time-zone database needed by LoadLocation may not be present on all systems, especially non-Unix systems. LoadLocation looks in the directory for an uncompressed zip file named by the ZONEINFO environment variable, if any, then looks in known installation locations on Unix systems, and finally looks in $GOROOT/lib/time/zoneinfo.zip.
-c
Bigboy config file path (default="bigboy.json")-e
Maximum overall number of errors before aborting (default=100)-n
Include nulls in output (default=false)-o
Output directory (creates "out" directory if not specified)-p
Number of rows per page extracted (default=1000)-q
Supress informational output (default=false)-r
Number of consecutive errors before aborting (default=3)-v
Print version info about bigboy and exit-w
Number of background workers (default=4)
Above flags take priority over the configuration in
bigboy.json
file.
This section describes the bigboy.json
(default, unless -c
flag is used) file format.
Name | Type | Required | Description |
---|---|---|---|
connections |
connection[] | + | Array of connection configurations for each database source |
errors |
INTEGER | - | Total number of errors to ignore before aborting |
nulls |
BOOLEAN | - | If nulls are allowed to be included in the output |
page |
INTEGER | - | Number of rows per page |
quiet |
BOOLEAN | - | If the terminal output should be supressed |
retries |
INTEGER | - | Number of retries to perform in case of an error |
workers |
INTEGER | - | Number of workers to run extracts in parallel |
Name | Type | Required | Description |
---|---|---|---|
driver |
STRING | + | Database driver (mysql , mssql , postgres ) |
server |
STRING | + | Database connection string |
database |
STRING | + | Database name |
dsn |
STRING | - | DB dsn |
port |
INTEGER | - | DB port |
user |
STRING | - | DB username for authentication |
password |
STRING | - | DB password for authentication |
max |
INTEGER | - | Maximum number of open database connections |
timezone |
STRING | - | Can be UTC or Local or IANA Time Zone database format (For example America/Los_Angeles ) |
This section describes the target.json
file format.
Name | Type | Required | Description |
---|---|---|---|
connection |
STRING | + | Connection name |
fetch |
STRING | - | File name for the main extract SQL query. Default extract.sql |
params |
param | - | Allows to pass params into a query. For example to filter the data |
prefetch |
BOOLEAN | - | If prefetch.sql should be used for parallel extraction |
nest |
nest[] | - | Array of columns to be added for each record |
script |
STRING | - | Not yet implemented |
split |
split | - | Produces multiple files instead of one |
timezone |
STRING | - | Defaults to connection timezone |
Name | Type | Required | Description |
---|---|---|---|
connection |
STRING | - | Allows to use different database source. Uses target connection by default |
childKey |
STRING | + | New array property that would contain all matched records |
parentKey |
STRING | + | Has to be an integer |
fetch |
STRING | - | File name for the nest SQL query. Default nest.sql for the first nest column |
timezone |
STRING | - | Defaults to connection timezone |
Name | Type | Required | Description |
---|---|---|---|
name |
STRING | - | Name of the param. Only use for logging and readability |
type |
STRING | - | Converts the text argument to the given type. One of integer , float , string , boolean or date . Required unless param is null |
default |
STRING | - | Setting default param if no command line params are passed |
Name | Type | Required | Description |
---|---|---|---|
by |
STRING | + | Has to be set to date |
layout |
STRING | - | Converts to date from string for MySQL if the date column is STRING (See golang layout format |
value |
STRING | + | Column name which contains the value by which the files to split |
Install golang.
Clone the repo and cd
into it.
$ go build
$ go test .
On Windows:
$ build windows
$ build linux
$ build darwin
On unix:
bash build.sh
Upload build executables is automated using release
workflow in GitHub actions. It gets triggered by a push of a new git tag to origin.
Make sure to keep CHANGELOG.md updated and version incremented.
There are lots of ways to approach ETL, and lots of vendors that want to sell you a solution! Here are some additional references that may be helpful...