Previous projects:
NEWS api on Google Cloud Platform
Jupyter workflows using Docker Container
Next project:
Quick implementation of OCR application with AWS Lambda (in build)
This is individual project 3 for my course, Data Analysis At Scale in Cloud. In this project, I implement a spark workflow on AWS EMR which is similar to real-world applictions: create the cluster, submit a job, add another step and terminate the cluster. In addition to the descriptions below, you can refer to the screencast demo.
- Create a S3 bucket. In the bucket upload
pyspark_job.py
andemr_bootstrap.sh
. - (Optional) Create a Key pair on EC2 instances.
- In the terminal run the following. These commands create the cluster on AWS EMR and submit a Spark job to the cluster. You can check the cluster status on AWS EMR. It generally takes 5 minutes for the cluster be ready and run the job. After the job is completed, you can go to the S3 bucket and check the output.
#!/bin/bash
aws emr create-cluster --name "your-cluster-name" \
--release-label emr-5.29.0 \
--applications Name=Spark \
--log-uri s3://your-bucket/logs/ \
--ec2-attributes KeyName=emr-key \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://your-bucket-name/emr_bootstrap.sh \
--steps Type=Spark,Name="Your-Spark-job-name",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,s3://your-bucket-name/pyspark_job.py] \
--use-default-roles ##--auto-terminate
- Change the variable
clusterid
with your cluster id and run the following. This command submits the job one more time to the running cluster.
#!/bin/bash
clusterid=j-1AXXXXXXPSXX
aws emr add-steps --cluster-id $clusterid \
--steps Type=Spark,Name="Spark job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,s3://your-bucket-name/pyspark_job.py]
- Change the variable
clusterid
with your cluster id and run the following. This step terminates the cluster to stop being charged for the cluster.
#!/bin/bash
clusterid=j-1AXXXXXXPSXX
aws emr modify-cluster-attributes --cluster-id $clusterid --no-termination-protected
aws emr terminate-clusters --cluster-ids $clusterid
pyspark_job.py
: The main Spark job you submitted to the cluster. Here in the demo this job reads the Amazon book review data, apply filtering and wrangling and output the results to S3 bucket.emr_bootstrap.sh
: Dependecies you would the cluster environment to be preinstalled. It would be useful if you want to use the notebook instances.submit_job.sh
: Create the cluster and submit a job to it.addstep.sh
: Add another step(job) to the running cluster.terminate.sh
: Terminate the cluster.