aws_batch

An Example of Running Map Reduce Jobs on AWS Batch with Docker

Parallelization example using map reduce on AWS Batch to count words in a large body of text e.g. moby dick, KDD 99. You can learn more about computing using AWS Batch using the link here
The Docker container can be set up using the commands in the file here. Learn more about using it for Python here
The example leverage the multiprocessing in Python - You can learn more about it here
By default, parallelism expands S3 data to available cores exposed to the container. Batch computing will enable dynamically scale set of EC2 instances and run parallel jobs. When setting up AWS Batch compute environment ensure to have minimum VCPUs set to zero to prevent running on-demand EC2 idle when there no jobs.

Create an S3 bucket with folders for input and output data
Upload mobydick.txt or KDD-dataset to the input folder of the s3 bucket
Note the S3 bucket URL for input data and output folder e.g. s3://mybucket/data/mobydick.txt & s3://mybucket/result
Setup your AWS Batch environment and ECS repository for Docker images - look here for AWS CLI commands
Specify the environment variables below and run the job

Run map reduce patterned jobs for large text mining, log analysis (possibly with elastic search) etc.
Run AWS Batch to preprocess S3 input data for machine learning jobs on AWS Sagemaker and others

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
map_reduce		map_reduce
.gitignore		.gitignore
README.md		README.md
aws_batch_map_reduce.png		aws_batch_map_reduce.png
compute_env_template.json		compute_env_template.json
job_definition_template.json		job_definition_template.json
job_queue_template.json		job_queue_template.json
submit_job_template.json		submit_job_template.json

Provide feedback