Master, core and task nodes, each in instance groups. Up to 50 instance groups.
-
Core node = slave node, runs tasks, store data on HDFS or EMRFS.
-
Task node = ephemeral, optional, no hdfs
-
Master node keeps the registry of blocks on hdfs, including replication scale
-
64MB block size is default for hdfs
-
Block sizes and replication factor can be set per file
-
Replication factor default 3, can be set in the hdfs-site.xml file, can also be done real time
-
EMRFS implements HDFS on s3
-
EMRFS consistent view
HDFS is automatically split into chunks by Hadoop. If on s3, hadoop will split the data by reading files in multiple http range requests. Ebs volumes do not persist when used with EMR clusters.
File Sizes
- Gzip is not splittable, keep them less than 2GB
- Avoid lots of small files (< 100MB), fewer larger is better
- S3DistCp can be used to combine small files into larger files.
- An extension of DistCp
- S3DistCp can be used to copy between s3 buckets or between s3 hdfs and s3
- Can be added as a step on the cluster
Supported File Formats
- Text - csv, etc
- Parquet - columnar-oriented file format
- ORC - Optimized row columnar file format
- Sequence - flat files containing key/value pairs
- Avro - json based data serialization framework
Modules: Hadoop common, HDFS, Yarn (yet another resource negotiator), MapReduce
- Map Reduce tasks:
- Input
- Split
- Map
- Shuffle
- Reduce
- Result
Releases - Amazon and MapR
Basic options are: Core Hadoop, HBase, Presto and Spark. Deploys to default vpc and security groups.
Advanced options: Older releases are available and can cherry pick applications. Can edit config. Can add steps now or later. Can select vpc / az. If running into private subnet, you’ll need to create vpc or endpoint. Can request spot and setup autoscaling. Always tag with the EMR tools or they won’t roll back to the EMR cluster. Can also set encryption and IAM / key options. Security groups get set for master and core/task.
- Batch oriented
- M3 or M4 instance types
- Scale horizontally
-
Master node
- clusters < 50 nodes use m3.xlarge or m4.xlarge
- clusters > 50 nodes use m3.2xlarge or m4.2xlarge
-
Core nodes
-
Default replication factors:
- Cluster of > 10 nodes: 3
- Cluster of 4-9 nodes: 2
- Cluster of =<3 nodes: 1
-
Can be overridden in hdfs-site.xml
-
Machine Learning P2, C3 or C4 instance types
-
HDFS capacity calculation: total storage / replication factor = hdfs capacity
- Use cloudwatch for events
- Metrics are updated every 5 mins, which is not configurable
- Metrics are archived for 2 weeks, no charge for them
- Can use native application web interfaces
- Use ganglia, open source monitoring tool, runs on master node
-
Manually:
- done with api, cli or console
- Size down can be done on instance hour or task complete
- When resizing core nodes, you cannot size down below the requirements for the replication factor. You can change in hdfs-site.xml and restart the NameNode daemon.
-
Autoscaling:
- Can be based on metrics
- Must add autoscaling role on cluster creation or you cannot use autoscaling
-
IAM policies - allow or deny permissions for IAM users and groups to perform actions. Policies can be combined with tagging to control access on a cluster-by-cluster basis.
-
IAM roles for EMRFS requests to Amazon S3 - You can control whether cluster users can access files from within Amazon EMR based on user, group, or the location of EMRFS data in Amazon S3.
-
IAM roles - The Amazon EMR service role, instance profile, and service-linked role control how Amazon EMR is able to access other AWS services. For more information, see Configure IAM Roles for Amazon EMR Permissions to AWS Services.
-
Kerberos - You can set up Kerberos to provide strong authentication through secret-key cryptography.
-
SSH - provides a secure way for users to connect to the command line on cluster instances. It also provides tunneling to view web interfaces that applications host on the master node. Clients can authenticate using Kerberos or an Amazon EC2 key pair.
-
Data encryption - You can implement data encryption to help protect data at rest in Amazon S3 and in cluster instance storage, and data in transit. For more information, see Encrypt Data in Transit and At Rest. You can also use a custom Amazon Linux AMI to encrypt the EBS root device volume of cluster instances.
-
Security groups - you can use your own to isolate EMR clusters. Can also add additional groups to allow ssh, etc. IAM Roles - can use custom roles
- Long Running
- keeps data on core nodes
- Auto termination disabeld (this is default)
- Termination protection is enabled
- Transient Clustered
- input/output and code stored on s3
- Hive metastore can be stored in rds
Single AZ
-
3 instances groups, Master, Core and Instance
-
Can paste config override or load json from s3
-
Can run core and master nodes on same instance
-
Single Master node
- Manages resources of cluster
- Cooridnates of the running of tasks
- manages HDFS
- Runs resrouce manager
- monitors health of core and tasks nodes
-
Core Node
- slave node
- run tasks
- HDFS or EMFS
- Datanode daemon
- NodeManager
- ApplicationMaster
-
Task Node
- Same as Slave node
- Optional
- No HDFS
- Added and removed from Running Clusters
- Extra Capacity
-
A Distributed file systems
-
Allows simultaneous access to data
-
Stores data in blocks
-
Single Global namespace maintained by the Master node
-
Block size
- 64 mb is default
- 64 - 256 MB
- large files use bigger block
- smaller files use smaller block
- can be set per file
- Resizing is possible
- Cloudwatch is used to monitor the cluster
Data warehouse built on hadoop, uses a sql-like interface (HiveQL) to summarize, query and analyze very large data sets. Provides a high level language interface to mapreduce. Differences between Apache Hive and Hive on EMR, Hive on EMR has native s3 and Dynamo integration, kinesis streams can be used by pushing to s3 first. Supports partitioning on s3.
You can use the EMR DynamoDB connector to:
- Join hive and Dynamo tables
- Query data in Dynamo
- Copy data to/from Dynamo and HDFS
- Copy data to/from Dynamo and s3
- Hadoop user experience
- Open source web interface for hadoop and non hadoop applications
- browser base
- write queries spark sql
- Makes using the EMR cluster easy
- s3 broswer for data
- HDFS browser
- Hive editor
- PIG editor
- Access the metastore data
- Hbase access
- job browser
- Logs are accessible and searcher
- User management
- LDAP
- PAM
- SPNEGO
- OpenID
- Oauth
- SAML2
Open source NoSQL database
Massively scalable distributed big data store. Non-relational, versioned database which runs on top of s3 using EMRFS or HDFS. Built for random, strictly consistent realtime access for tables with billions of rows and millions of columns. Integrates with Hadoops, Hive and Pheonix.
Use cases:
- adtech, content, FINRA
- Large amounts of data: 100s of GBs to PBs.
- high write and update i/o.
- Can use NoSQL and need flexible schema.
- Fast access to random and real-time.
- Fault-tolerance in non-relational environment
Limitations
- No transactions
- Not relational
- Small amount of data
Can connect hive to hbase via zookeeper.
-
Open source in memory distributed fast SQL query engine
-
Run interactive analytic queries against a variety of data sources with dize ranging from GB to PB
-
Query different types of data sources from relational databases, NoSQL databases, frameworks like Hive to stream processing platforms like Kakfa
-
Connectors
- cassandra
- hive
- kafka
- mongodb
- mysql
- postgresl
- redis
-
High concurrency, run 1000s of queries per day
-
In memory processing helps avoid unnecessary IO leading to low latency
-
Does not need separate processing layer like hive does
Fast in memory query engine
Use Cases
- Interactive analytics
- Faster than hive
- flex in terms of language, scala, python etc
- run queries against live streams
- stream processing
- disparate dat sources
- data in small sizes
- Machine learning
- repeated queries at scale against data set
- time machine learning alg
- MLib
- Data integration
- ETL
- Reduc time and cost
When not to use Spark
- Not a database
- not designed for OLTP
- Not for batch processing
- Multiuser requests
Run ETL in Spark and move the data to a typical reporting database
Components
- Spark Core - General execution engine
- SQL
- Streaming
- MLib
- GraphX
- Standalone Scheduler
- YARN
- Mesos
Data APIs:
- Resilient Distributed Datasets (RDDs): logical collection of data partitioned across machines
- DataSet: distributed collection of data
- DataFrame: DataSet organized into named columns
Streaming
- DStreams (Discretized Streams)
- Abstraction of continuous streaming data
- Created from input data sources like kinesis streams
- Are a collection of RDDs
- Transformations are applied to RDDs
- Published to HDFS, databases or dashboards
Tools to access Hive metastore tables.
Managed ETL (spark) service to categorize, clean and enrich data. Can move between data stores, can be used as the metadata catalog. Serverless. Can discover and correlate data across multiple datastores.
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
https://aws.amazon.com/blogs/big-data/using-spark-sql-for-etl/
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-awskms-keys
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-configure-sqs-cw.html
Name of interface | URI |
---|---|
YARN ResourceManager | http://master-public-dns-name:8088/ |
YARN NodeManager | http://coretask-public-dns-name:8042/ |
Hadoop HDFS NameNode | http://master-public-dns-name:50070/ |
Hadoop HDFS DataNode | http://coretask-public-dns-name:50075/ |
Spark HistoryServer | http://master-public-dns-name:18080/ |
Zeppelin | http://master-public-dns-name:8890/ |
Hue | http://master-public-dns-name:8888/ |
Ganglia | http://master-public-dns-name/ganglia/ |
HBase | http://master-public-dns-name:16010/ |
JupyterHub | https://master-public-dns-name:9443/ |
AWS Labs