DynamoDB EMR Exporter

Uses EMR clusters to export and import dynamoDB tables to/from S3. This uses the same routines as Data Pipeline BUT it runs everything though a single cluster for all tables rather than a cluster per table.

Export Usage

Clone this repo to a folder called /usr/local/dynamodb-emr
Install python apt-get install python
Install the python dependencies pip install -r requirements.txt
Configure at least one boto profile
Create a new IAM role called dynamodb_emr_backup_restore using the IAM policy contained in dynamodb_emr_backup_restore.IAMPOLICY.json

The role name can be changed by editing common-json/ec2-attributes.json

Configure the size of your EMR cluster

Edit the common-json/instance-groups.json file to set the number of masters and workers (typically, a single master and worker is fine)

Run the invokeEMR.sh script as follows

./invokeEMR.sh app_name emr_cluster_name boto_profile_name table_filter read_throughput_percentage json_output_directory S3_location

Where

app_name is a 'friendly name' for the DynamoDB table set you wish to export
emr_cluster_name is a name to give to the EMR cluster
boto_profile_name is a valid boto profile name containing your keys and a region
table_filter is a filter for which table names to export (ie. MYAPP_PROD will export ALL tables starting with MYAPP_PROD)
read_throughput_percentage is the percent of provisioned read throughput to use (eg 0.45 will use 45% of the provisioned read throughput)
json_output_directory is a folder to output the json files for configuring the EMR cluster for export
S3_location is a base S3 location to store the exports and all logs (ie. s3://mybucket/myfolder)

Import Usage

When the export runs, it also generates the configuration needed to execute an import. You can find the configuration file for importing within the json output directory you used for the export (importSteps.json). It is also copied to the S3 bucket at the completion of the export.

Running the import

Before running the import, you need to perform 2 tasks

The tables you are importing data into MUST already exist with the same key structure in the region you wish to import into
Edit the restoreEMR.sh script to set the region to that in which you need to restore the data to (variable at the top of the script called CLUSTER_REGION)

Once these are done, you can invoke the restore like so

./restoreEMR.sh app_name emr_cluster_name boto_profile_name local_json_files_path s3_path_for_logs

Where

app_name is a 'friendly name' for the DynamoDB table set you wish to import
emr_cluster_name is a name to give to the EMR cluster
boto_profile_name is a valid boto profile name containing your keys and a region
local_json_files_path is a folder to containing the json files produced by the export
s3_path_for_logs is a base S3 location to store logs from EMR related to the import

NOTE The write throughput to use for the DynamoDB tables is actually defined in the script that runs at export time. This is because it's then configured in the importSteps.json file. If you wish to increase this, you can edit the generated importSteps.json file.

Workings

The basic mechanics of the process are as follows

Export

Check and see if there are any EMR clusters already running for 'this' app. If so, exit. Otherwise, carry on
Setup the common configuration for the cluster
Call the python script to generate the steps (tasks) for EMR for each table. This essentially lists all the tables in the region, applies the provided filter and then generates the JSON that can be passed to EMR to export the tables
Once the steps JSON is present, create a new cluster with the AWS CLI. We have to handle cluster setup failure here so retries are used for failures.
Submit the tasks to the cluster and poll the cluster until it's complete. Any errors of a step will result in a failure being logged
Once we know everything was successful, write the export and import steps files to S3 in case this machine has issues. We also write flag files to S3 indicating the progress of the export (in progress, complete, error, etc.) in case another process needs to ingest this data, it can poll on these status files.

Import

Create a new EMR cluster with the import steps file as the tasks to perform
Poll the cluster to ensure success

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
common-json		common-json
LICENSE		LICENSE
README.md		README.md
crontab.sample		crontab.sample
dynamodb_emr_backup_restore.IAMPOLICY.json		dynamodb_emr_backup_restore.IAMPOLICY.json
invokeEMR.sh		invokeEMR.sh
produce-steps-json.py		produce-steps-json.py
requirements.txt		requirements.txt
restoreEMR.sh		restoreEMR.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DynamoDB EMR Exporter

Export Usage

Import Usage

Running the import

Workings

Export

Import

About

Releases

Packages

Languages

License

Jenji/dynamodb-emr-exporter

Folders and files

Latest commit

History

Repository files navigation

DynamoDB EMR Exporter

Export Usage

Import Usage

Running the import

Workings

Export

Import

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages