GitHub - vitoravancini/spark-data-mover: A cli tool that uses spark sql to move data around

Data Mover enables your team to move data around with no waste of time.

Understanding Data Mover

Data Mover 1.0 moves data from any source to any destination that SparkSQL supports

Get data from Amazon S3 and load it to a database
Dump a defined number of rows of a table from a database
Load a csv file into a database or S3

Getting started

Data Mover is already published to docker as techindicium/spark-datamover:v0.1. For more instructions and help, run:

docker run techindicium/spark-datamover:v0.1 --help

ExampleS

Loading a csv file from AWS S3 to a PostgreSQL database

Environment variables are defined via the -e argument;
CSV options such as header, sep etc. can be defined in the --read-options argument;

docker run -e AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY -e AWS_ACCESS_KEY_ID=MY_ACCESS_KEY techindicium/spark-datamover:v0.1 -s s3://folder/subfolder/subfolder/file.csv --filetype csv --read-options header=true --destination "jdbc:postgresql://localhost:PORT/DATABASE?user=USERNAME&password=PASSWD" --destination-table MY_DEST_TABLE

Loading a csv file to a PostgresSQL at localhost

docker run --network host techindicium/spark-datamover:v0.1 -s /home/path/your_file.csv --filetype csv --destination "jdbc:postgresql://localhost:PORT/DATABASE?user=USERNAME&password=PASSWD" --destination-table MY_DEST_TABLE

Notes

If your database is in your localhost, it is necessary to set the argument --network host after docker run
The destination path must be inside quotes

Getting a csv file from a table in Postgres

docker run --network host -v /home/user:/home/user techindicium/spark-datamover:v0.1 -s "jdbc:postgresql://localhost:PORT/DATABASE?user=USERNAME&password=PASSWD" --limit 100 --destination /home/path/my_table.csv --tables my_table --destination-filetype csv

Notes

In order to get a csv file, it is necessary the -v argument after docker run. It is responsible for mapping the volumes of your pc and docker.

Sampling

User can define how many lines to read from the specified source with --limit

JDBC specifics:

Write and read options can be specified with --write-options k=v,k1=v1 and --read-options k=v,k1=v1

Besides the options that the spark data sources allows for reading and writing ( you can check them at the respective docs: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html), you can specify save-mode as append or overwrite

e.g.:

docker run --network host techindicium/spark-datamover:v0.1 -s /home/path/your_file.csv --filetype csv --destination "jdbc:postgresql://localhost:PORT/DATABASE user=USERNAME&password=PASSWD" --destination-table MY_DEST_TABLE --write-options save-mode=append

Oracle limitations:

Table names and column names are truncated to 30 characters
String columns are truncated to 255 characteres

Reporting bugs and contributing code

Want to report a bug or request a feature? Open an issue.

Change log

0.2.0:

    - add ind_extracion_date column with timestamp of time of extraction

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
etc		etc
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Data Mover

Getting started

ExampleS

Sampling

JDBC specifics:

Oracle limitations:

Reporting bugs and contributing code

Change log

About

Releases

Packages

Contributors 2

Languages

vitoravancini/spark-data-mover

Folders and files

Latest commit

History

Repository files navigation

Understanding Data Mover

Getting started

ExampleS

Sampling

JDBC specifics:

Oracle limitations:

Reporting bugs and contributing code

Change log

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages