Skip to content

Using AWS Batch with Docker for Map Reduce Jobs

Notifications You must be signed in to change notification settings

newadays/aws_batch

Repository files navigation

aws_batch

An Example of Running Map Reduce Jobs on AWS Batch with Docker

Alt map_reduce_word_count

  • Parallelization example using map reduce on AWS Batch to count words in a large body of text e.g. moby dick, KDD 99. You can learn more about computing using AWS Batch using the link here

  • The Docker container can be set up using the commands in the file here. Learn more about using it for Python here

  • The example leverage the multiprocessing in Python - You can learn more about it here

  • By default, parallelism expands S3 data to available cores exposed to the container. Batch computing will enable dynamically scale set of EC2 instances and run parallel jobs. When setting up AWS Batch compute environment ensure to have minimum VCPUs set to zero to prevent running on-demand EC2 idle when there no jobs.

Quick Steps

  1. Create an S3 bucket with folders for input and output data
  2. Upload mobydick.txt or KDD-dataset to the input folder of the s3 bucket
  3. Note the S3 bucket URL for input data and output folder e.g. s3://mybucket/data/mobydick.txt & s3://mybucket/result
  4. Setup your AWS Batch environment and ECS repository for Docker images - look here for AWS CLI commands
  5. Specify the environment variables below and run the job
  • name: s3_input_dir, value: s3://mybucket/input/mobydick.txt
  • name: s3_output_dir, value: s3://mybucket/result

Use Cases

  • Run map reduce patterned jobs for large text mining, log analysis (possibly with elastic search) etc.
  • Run AWS Batch to preprocess S3 input data for machine learning jobs on AWS Sagemaker and others

moby dick KDD-dataset AWS Batch AWS S3 AWS ECR

About

Using AWS Batch with Docker for Map Reduce Jobs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages