An Example of Running Map Reduce Jobs on AWS Batch with Docker
-
Parallelization example using map reduce on AWS Batch to count words in a large body of text e.g. moby dick, KDD 99. You can learn more about computing using AWS Batch using the link here
-
The Docker container can be set up using the commands in the file here. Learn more about using it for Python here
-
The example leverage the multiprocessing in Python - You can learn more about it here
-
By default, parallelism expands S3 data to available cores exposed to the container. Batch computing will enable dynamically scale set of EC2 instances and run parallel jobs. When setting up AWS Batch compute environment ensure to have minimum VCPUs set to zero to prevent running on-demand EC2 idle when there no jobs.
- Create an S3 bucket with folders for input and output data
- Upload mobydick.txt or KDD-dataset to the input folder of the s3 bucket
- Note the S3 bucket URL for input data and output folder e.g. s3://mybucket/data/mobydick.txt & s3://mybucket/result
- Setup your AWS Batch environment and ECS repository for Docker images - look here for AWS CLI commands
- Specify the environment variables below and run the job
- name: s3_input_dir, value: s3://mybucket/input/mobydick.txt
- name: s3_output_dir, value: s3://mybucket/result
- Run map reduce patterned jobs for large text mining, log analysis (possibly with elastic search) etc.
- Run AWS Batch to preprocess S3 input data for machine learning jobs on AWS Sagemaker and others