Bigboy

High data-rate SQL-to-JSON extraction from SQL Server, PostgreSQL, or MySQL.

Written by Dave Templin Maintained by Igor Starostenko

Overview

Bigboy is basically a SQL-TO-JSON tool that extracts data from SQL Server, PostgreSQL, or MySQL databases. It is designed to handle extremely high data extraction rates (multi-million rows / gigabyte-range) which is achieved by running multiple concurrent extraction queries from a configurable thread pool. The tool provides a simple configuration model for managing any number of data extractions. It also exposes a simple and minimal command-line interface (CLI) that works great for adhoc or batch/cron use-cases.

Features

Extract data from SQL Server, PostgreSQL, or MySQL
Perform simple SQL-to-JSON transformations
Nest rows to form complex hierarchical (or document-oriented) data
Define configuration parameters to manage dynamic queries
Run queries in parallel from a configurable thread pool for high data rates
Combine data from multiple different database sources
Apply timezone to dates stored without a timezone

Quickstart

Download the latest release of bigboy for your operating system
Create a bigboy.json with the list of connections (See configuration details for more information)
Create a target directory with target.json file with connection name (See target for more information)
Create extract.sql file in the newly created target directory with the SQL query to use to extract the data (See prefetch section to extract data in parallel)
Consider using available transforms to modify the data on the fly
Extract the data from the target: ./bigboy ${target} (You may need to change access permissions to the executable by running chmod +x bigboy)
The data is extracted in a JSONL (newline delimited JSON) format in out/${target}/ directory (by default) with current date filename

See basic example

Concepts

Connections

Authentication credentials and connection configuration can be set for all database sources in a single config file (See configuration schema)

Targets

A directory path that contains required files (instructions) on how to extract data from a source database. Target may contain the following files:

target.json - REQUIRED - core configuration file of a target that specifiec which connection to use as well as other parameters (See target.json schema)
extract.sql - REQUIRED - SQL query to be used to extract data from source database for a given connection. If prefetch is set up, needs to apply string interpolation (For example WHERE id IN (%s))
prefetch.sql - OPTIONAL - if target.json has prefetch set to true, the prefetch.sql loads ids to be passed into extract.sql and run in parallel
nest.sql - OPTIONAL - if target.json has nest array of objects configured, the nest.sql loads ids to be added into the main output with new column for each object

Prefetching

To extract data on large datasets it's recommended to utilize the prefetch feature that allows to split extract in multiple parallel processes. When target has prefetch set to true, the prefetch.sql has to be provided that has an SQL query with a SELECT statement with one integer column. The results of the prefetch query would be be passed into multiple processes each execute an extract.sql with a subset of the prefetched numbers. The extract query needs to apply string interpolation (For example WHERE id IN (%s))

See prefetch example

Parameterization

Queries can accept parameters to customize the request from outside the target on runtime. Params can be configured in target with a given type and default value. On runtime the params are passed into a prefetch.sql or an extract.sql using syntax like WHERE date >= ?1, where 1 is the index of the param (starting from 1).

See params example

Transforms

Nesting

Allows for the output JSON records to have nested arrays. To enable, the target has to have nest params configured and nest.sql defined. The nest query runs before the main extract and then inserts the results in an array format into a column defined in target childKey param. For records to be inserted into the result of the main extract query, the nest.sql needs to have a column with the name _parent which value would match the value of the column in the extract.sql which name is defined in target parentKey param. If nest.sql has a _value column only the values would be inserted in the array. Otherwise the entire record object (except the _parent column) would be added inside the array.

See nest example

Scripting

Not yet implemented

Split output

Produces multiple files instead of one. To enable, the target has to have split param set. When configured, the output argument set with -o flag has to be a directory without extension. Currently supports only split by date. Every output file would contain records for each day returned from the extract query. Use layout param for MySQL or when the date column is stored as STRING.

See split example

Timezone format

All dates are assumed to be in GMT unless a timezone is specified. If a time is not specified then midnight GMT is assumed. Examples below illustrate various scenarios of specifying a date or date-range.

The following examples assume there is a target named log with a single paramter of type date representing a start date for the extraction.

Example	Comments
`bigboy log 2017-07-21`	7/21/2017 at midnight GMT
`bigboy log "2017-07-21 15:00:00"`	7/21/2017 at 3pm GMT
`bigboy log today`	Midnight GMT of the current day
`bigboy log yesterday`	Midnight GMT of the previous day

The following examples assume there is a target named sales with two parameters of type date representing a date range for the extraction.

Example	Comments
`bigboy sales 2017-07-21 2017-07-23`	From 7/21/2017 to 7/23/2017 midnight-to-midnight GMT
`bigboy sales 2017-07-21 2d`	Midnight GMT of the previous day

The time-zone database needed by LoadLocation may not be present on all systems, especially non-Unix systems. LoadLocation looks in the directory for an uncompressed zip file named by the ZONEINFO environment variable, if any, then looks in known installation locations on Unix systems, and finally looks in $GOROOT/lib/time/zoneinfo.zip.

Reference

Command Line Arguments

-c Bigboy config file path (default="bigboy.json")
-e Maximum overall number of errors before aborting (default=100)
-n Include nulls in output (default=false)
-o Output directory (creates "out" directory if not specified)
-p Number of rows per page extracted (default=1000)
-q Supress informational output (default=false)
-r Number of consecutive errors before aborting (default=3)
-v Print version info about bigboy and exit
-w Number of background workers (default=4)

Above flags take priority over the configuration in bigboy.json file.

Configuration

This section describes the bigboy.json (default, unless -c flag is used) file format.

Name	Type	Required	Description
`connections`	connection[]	+	Array of connection configurations for each database source
`errors`	INTEGER	-	Total number of errors to ignore before aborting
`nulls`	BOOLEAN	-	If nulls are allowed to be included in the output
`page`	INTEGER	-	Number of rows per page
`quiet`	BOOLEAN	-	If the terminal output should be supressed
`retries`	INTEGER	-	Number of retries to perform in case of an error
`workers`	INTEGER	-	Number of workers to run extracts in parallel

connection

Name	Type	Required	Description
`driver`	STRING	+	Database driver (`mysql`, `mssql`, `postgres`)
`server`	STRING	+	Database connection string
`database`	STRING	+	Database name
`dsn`	STRING	-	DB dsn
`port`	INTEGER	-	DB port
`user`	STRING	-	DB username for authentication
`password`	STRING	-	DB password for authentication
`max`	INTEGER	-	Maximum number of open database connections
`timezone`	STRING	-	Can be `UTC` or `Local` or IANA Time Zone database format (For example `America/Los_Angeles`)

target

This section describes the target.json file format.

Name	Type	Required	Description
`connection`	STRING	+	Connection name
`fetch`	STRING	-	File name for the main extract SQL query. Default `extract.sql`
`params`	param	-	Allows to pass params into a query. For example to filter the data
`prefetch`	BOOLEAN	-	If `prefetch.sql` should be used for parallel extraction
`nest`	nest[]	-	Array of columns to be added for each record
`script`	STRING	-	Not yet implemented
`split`	split	-	Produces multiple files instead of one
`timezone`	STRING	-	Defaults to connection timezone

nest

Name	Type	Required	Description
`connection`	STRING	-	Allows to use different database source. Uses target connection by default
`childKey`	STRING	+	New array property that would contain all matched records
`parentKey`	STRING	+	Has to be an integer
`fetch`	STRING	-	File name for the nest SQL query. Default `nest.sql` for the first nest column
`timezone`	STRING	-	Defaults to connection timezone

param

Name	Type	Required	Description
`name`	STRING	-	Name of the param. Only use for logging and readability
`type`	STRING	-	Converts the text argument to the given type. One of `integer`, `float`, `string`, `boolean` or `date`. Required unless param is `null`
`default`	STRING	-	Setting default param if no command line params are passed

split

Name	Type	Required	Description
`by`	STRING	+	Has to be set to `date`
`layout`	STRING	-	Converts to date from string for MySQL if the date column is STRING (See golang layout format
`value`	STRING	+	Column name which contains the value by which the files to split

Development

Build

Install golang.

Clone the repo and cd into it.

$ go build

Test

$ go test .

Cross compile

On Windows:

$ build windows
$ build linux
$ build darwin

On unix:

bash build.sh

Release

Upload build executables is automated using release workflow in GitHub actions. It gets triggered by a push of a new git tag to origin. Make sure to keep CHANGELOG.md updated and version incremented.

Helpful links

There are lots of ways to approach ETL, and lots of vendors that want to sell you a solution! Here are some additional references that may be helpful...

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
.vscode		.vscode
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
README.md		README.md
build-all.cmd		build-all.cmd
build.cmd		build.cmd
build.sh		build.sh
config.go		config.go
config_test.go		config_test.go
connection.go		connection.go
constants.go		constants.go
extract.go		extract.go
fetch.go		fetch.go
file.go		file.go
flags.go		flags.go
flags_test.go		flags_test.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_test.go		main_test.go
nest.go		nest.go
retry.go		retry.go
rows.go		rows.go
splitter.go		splitter.go
target.go		target.go
target_test.go		target_test.go
todo.txt		todo.txt
transform.go		transform.go
utilities.go		utilities.go
utilities_test.go		utilities_test.go
write.go		write.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bigboy

Table of Contents

Overview

Features

Quickstart

Concepts

Connections

Targets

Prefetching

Parameterization

Transforms

Nesting

Scripting

Split output

Timezone format

Reference

Command Line Arguments

Configuration

connection

target

nest

param

split

Development

Build

Test

Cross compile

Release

Helpful links

About

Releases 2

Packages

Contributors 3

Languages

License

davetemplin/bigboy

Folders and files

Latest commit

History

Repository files navigation

Bigboy

Table of Contents

Overview

Features

Quickstart

Concepts

Connections

Targets

Prefetching

Parameterization

Transforms

Nesting

Scripting

Split output

Timezone format

Reference

Command Line Arguments

Configuration

connection

target

nest

param

split

Development

Build

Test

Cross compile

Release

Helpful links

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages