Report issue . Submit a feature
Table of Contents
TL;DR: This framework allows you to get started with a Data Pipeline on AWS using native services and ETL Tools. The example in this framework AWS Glue and Amazon Athena for schema generation and data query.
Data Toolkit is part of Firemind's Modern Data Strategy tools
- AWS Step Functions
- AWS Glue
- Amazon Athena
- Amazon EventBridge
- AWS Identity Access Management (IAM) Roles
- Amazon Simple Storage Service (Amazon S3) Buckets
- AWS Systems Manager Parameter Store (SSM) Parameters
- Amazon Simple Notification Service (Amazon SNS)
Ensure your CLI has correct credentials to access the AWS account you want this framework deployed to.
To use this framework, create an empty remote repo in your organisation in GitHub, clone a copy of this repo and push to your remote.
Navigate to github-oidc-federation-template-infra.yml
file and add a default value for:
GitHubOrg
: This should be the name of the organisation where your repo exists.FullRepoName
: The name of the repo which has a copy of this infrastructure.
Add the following to your remote repository secrets:
AWS_REGION
: <e.g. eu-west-1>.S3_TERRAFORM_STATE_REGION
: <e.g. eu-west-1>.S3_TERRAFORM_STATE_BUCKET
: ml-core-<account_id>-state-bucket.ACTION_IAM_ROLE
: arn:aws:iam::<account_id>:role/GithubActionsDeployInfra.
The first step is to deploy a GitHub Actions Role and GitHub OIDC identity provider in the account that allows you to run GitHub actions for the infrastructure.
Note: This only needs to be run once per AWS account. Details on this can be found here: https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions
- Important Note: If an identity provider already exists for your project. Always check that the identity provider exists for your project, which can be found within the AWS IAM console.
Run the following command in the terminal. Can change the stack name and region:
aws cloudformation deploy --template-file github-oidc-federation-template-infra.yml --stack-name app-authorisation-infra-github-authentication --region {{ eu-west-1 }} --capabilities CAPABILITY_IAM --capabilities CAPABILITY_NAMED_IAM
GitHub actions is used to deploy the infrastructure. The config for this can be found in the .gitHub/workflows
We send through a variety of different environment variables
BUILD_STAGE
- We get this from the branch names.S3_TERRAFORM_STATE_BUCKET
- Get this from GitHub secrets.S3_TERRAFORM_STATE_REGION
- Get this from GitHub secrets.AWS_REGION
- Get this from GitHub secrets.SERVICE
- Has default but can be set by user in the.github/workflows
files.
For quick setup follow these instructions:
- Create an empty repo within your GitHub account.
- Checkout this repository on development branch to you local drive and push to your remote repo.
- Assuming the GitHub actions have been set up correctly, the deployment will begin.
If you are having any issues please report a bug via the repo.
- Once the infrastructure has been deployed, navigate to S3 and find the bucket created by the framework
data-core-[stage]-[account_id]-asset-bucket
. - Navigate to
input_data/
folder and upload the sample data found insample-data/AC2021_AnnualisedEntryExit.csv
. - This triggers an
Amazon EventBridge
Rule that targets the Data PipelineStep Function
on Object Creation to S3. - Navigate to the AWS Step Function service and notice the workflow running.
- The first state starts a
Glue Crawler
that generates a data schema based on the uploaded data.
- This schema is stored in a
Glue Data Catalog
.
- Once the Glue Crawler has finished running, a map of SQL queries are executed in parallel through
Amazon Athena
.
- The results of the queries are saved back to S3 under the
query_results/
suffix.
- Finally, an
SNS
message is sent to the configured SNS Topic. **Note**: There are no subscribers to this topic but this can be configured.
- The first state starts a
Configure your AWS credentials in the CLI with permissions to deploy to your account.
Deploy
bash deployment-scripts/quick-deploy.sh
Destroy
bash deployment-scripts/quick-destroy.sh