AWS scripts and configs are in aws/
.
AWS glue scripts requires this repo's features
module, which contains all Spark jobs for feature engineering.
To deploy a Glue job, assume the appropriate role and then you can use the
- web console
- awscli
- Python SDK aka boto3
- Zip the Python requirements
zip -r churn-packages.zip features venv/lib/python2.7/site-packages/dateutil
- Upload churn-packages.zip to S3, using the appropriate role.
- If the Glue job (i.e. the script in the
aws/glue/
directory) has changed, upload it to S3 using the appropriate role. However, it is unlikely that the glue job will change as it is primarily boilerplate. - If a job with this name and glue script does not exist, create a new one via the above methods. Set the following options (
cli
|Python keyword argument
):- job name (
--name
|Name
) - role (
--role
|Role
) - command (
--command Name=glueetl,ScriptLocation=s3://script-location
|Command={'Name': 'glueetl', 'ScriptLocation': 's3://script-location'}
) the command.name must be glueetl - default arguments (
--default-arguments --extra-py-files={S3 LOCATION}/churn-packages.zip
|DefaultArguments={ '--extra-py-files': '{S3 LOCATION}/churn-packages.zip'
) --extra-py-files is a special Glue parameter- You might also add any other job arguments to the default arguments.
- job name (
To actually run jobs you have a choice of:
- start-job-run
- create-trigger
Pass all expected arguments (defined in the glue script) to the job.
To see a job status
- get-job-run
Fitting a model automatically creates a model. It's not necessary to deploy a model to use batch transform.
< TODO: FILL IN WHEN NOTEBOOKS ARE IN THIS REPO >
For Batch Transform in boto3 or the CLI you'll need Create Transform Job, plus Describe Transform Job to track its progress. Under our pipeline, this will be run by boto3 in Lambda.
Lambda functions are in aws/lambda/
. The Python files are named exactly as the CLI functions. On AWS they are prefixed with datascience-{SERVICE}-{NAME}
i.e. datascience-sagemaker-describe-transform-job
. Inputs referenced in the Lambda function will be passed via Step Function's input.
Step functions are in aws/step_functions/
:
- {type}_flow.json has the actual JSON that defines the state machine for Step Functions
- {type}_input.json has sample generic input to pass to cloud events.
TODO: There are 2 flows and inputs here for Sagemaker Batch Transform (transform) and Glue (etl) but there should in fact be ONE complex flow encompassing Glue, Sagemaker Training and Sagemaker Batch Transform, with Lambda wrappers where necessary.
You'll need a machine with Spark installed.
On a laptop, you likely need these in your bash profile:
- export SPARK_HOME={YOUR SPARK HOME}
- export JAVA_HOME={YOUR JAVA HOME}
- export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
Create a virtual environment in this directory: virtualenv venv
The venv
folder is in the .gitignore.
Activate the virtual environment: source venv/bin/activate
Install requirements: pip install -r requirements.txt
- On an Amazon Linux box, you might also need to do:
pip install --target={ABSOLUTE PATH TO YOUR VENV}/venv/lib/python2.7/dist-packages/ scandir pandas --upgrade
To run tests: python -m pytest tests/