Skip to content

Latest commit

 

History

History
54 lines (48 loc) · 3.55 KB

File metadata and controls

54 lines (48 loc) · 3.55 KB

EMR - Elastic Map Reduce

MapReduce 101

  • Is a framework designed to allow processing huge amount of data in a parallel, distributed way
  • Data Analysis Architecture: huge scale, parallel processing
  • MapReduce has two main phases: map and reduce
  • It also has to optional phases: combine and partition
  • At high level the process of map reduce is the following:
    • Data is separated into splits
    • Each split can be assigned to a mapper
    • The mapper perform the operation at scale
    • The data is recombined after the operation is completed
  • HDFS (Hadoop File System):
    • Traditionally stored across multiple data nodes
    • Highly fault-tolerant - data is replicated between nodes
    • Named Nodes: provide the namespace for the file system and controls access to HDFS
    • Block: a segment of data in HDFS, generally 64 MB

Amazon EMR Architecture

  • Is a managed implementation of Apache Hadoop, which is a framework for handling big data workloads
  • EMR includes other elements such as Spark, HBase, Presto, Flink, Hive, Pig
  • EMR can be operated long term, or we can provision ad-hoc (transient) clusters for short term workloads
  • EMR runs in one AZ only within a VPC using EC2 for compute
  • It can use spot instances, instance fleets, reserved and on-demand instances as well
  • EMR is used for big data processing, manipulation, analytics, indexing, transformation, etc.
  • EMR architecture: EMR architecture
    • Each cluster requires at least one master node. This manages the cluster and distributes workloads and acts as the NAME node within MapReduce (we SSH into this if necessary)
    • Historically we could have only one master node, nowadays we can have 3 master nodes
    • Core nodes: cluster can have many core nodes. They are used for tracking task, we don't want to destroy these nodes
    • Core nodes also managed to HDFS storage for the cluster. The lifetime of HDFS is linked to the lifetime of the core nodes/cluster
    • Task nodes: used to only run tasks. If they are terminated, the HDFS storage is not affected. Ideally we use spot instances for task nodes
    • EMRFS: file system backed by S3, can persist beyond the lifetime of the cluster. Offers lower performance than HDFS, which is based on local volumes

Amazon EMR Serverless

  • It is a deployment option for Amazon EMR that provides a serverless runtime environment
  • With EMR Serverless we don't have to configure, optimize, secure or operate clusters to run applications with frameworks such as Spark and Hive
  • Job run:
    • Is a request submitted to an EMR Serverless application that the application asynchronously executes and tracks it until completion
    • When a job is submitted we must specify an IAM role that will provide required access for the job to other services such as S3
    • We can submit different jobs at the same time with different runtime roles
  • Workers:
    • EMR Serverless internally uses workers to execute our workloads
    • The default size of this workloads depends on the application type and EMR version
    • When we submit a job, EMR Serverless computes the resources that the application needs for the job and schedules workers
    • EMR Serverless automatically scales workers up or down based on the workload and parallelism required at every stage of the job
  • Pre-initialized capacity:
    • Used to keep workers initialized and ready to respond in seconds
    • Effectively creates a warm pool of workers for an application
  • EMR Studio:
    • It is the user console where we manage our EMR Serverless applications