- Is a framework designed to allow processing huge amount of data in a parallel, distributed way
- Data Analysis Architecture: huge scale, parallel processing
- MapReduce has two main phases: map and reduce
- It also has to optional phases: combine and partition
- At high level the process of map reduce is the following:
- Data is separated into splits
- Each split can be assigned to a mapper
- The mapper perform the operation at scale
- The data is recombined after the operation is completed
- HDFS (Hadoop File System):
- Traditionally stored across multiple data nodes
- Highly fault-tolerant - data is replicated between nodes
- Named Nodes: provide the namespace for the file system and controls access to HDFS
- Block: a segment of data in HDFS, generally 64 MB
- Is a managed implementation of Apache Hadoop, which is a framework for handling big data workloads
- EMR includes other elements such as Spark, HBase, Presto, Flink, Hive, Pig
- EMR can be operated long term, or we can provision ad-hoc (transient) clusters for short term workloads
- EMR runs in one AZ only within a VPC using EC2 for compute
- It can use spot instances, instance fleets, reserved and on-demand instances as well
- EMR is used for big data processing, manipulation, analytics, indexing, transformation, etc.
- EMR architecture:
- Each cluster requires at least one master node. This manages the cluster and distributes workloads and acts as the NAME node within MapReduce (we SSH into this if necessary)
- Historically we could have only one master node, nowadays we can have 3 master nodes
- Core nodes: cluster can have many core nodes. They are used for tracking task, we don't want to destroy these nodes
- Core nodes also managed to HDFS storage for the cluster. The lifetime of HDFS is linked to the lifetime of the core nodes/cluster
- Task nodes: used to only run tasks. If they are terminated, the HDFS storage is not affected. Ideally we use spot instances for task nodes
- EMRFS: file system backed by S3, can persist beyond the lifetime of the cluster. Offers lower performance than HDFS, which is based on local volumes
- It is a deployment option for Amazon EMR that provides a serverless runtime environment
- With EMR Serverless we don't have to configure, optimize, secure or operate clusters to run applications with frameworks such as Spark and Hive
- Job run:
- Is a request submitted to an EMR Serverless application that the application asynchronously executes and tracks it until completion
- When a job is submitted we must specify an IAM role that will provide required access for the job to other services such as S3
- We can submit different jobs at the same time with different runtime roles
- Workers:
- EMR Serverless internally uses workers to execute our workloads
- The default size of this workloads depends on the application type and EMR version
- When we submit a job, EMR Serverless computes the resources that the application needs for the job and schedules workers
- EMR Serverless automatically scales workers up or down based on the workload and parallelism required at every stage of the job
- Pre-initialized capacity:
- Used to keep workers initialized and ready to respond in seconds
- Effectively creates a warm pool of workers for an application
- EMR Studio:
- It is the user console where we manage our EMR Serverless applications