This guide helps you quickly explore the main features of Delta Lake.
It provides code snippets that show how to read from and write to Delta tables with Amazon EMR.
For more details, check this video, "Incremental Data Processing using Delta Lake with EMR"
-
Create s3 bucket for delta lake (e.g.
learn-deltalake-2022
) -
Create an EMR Cluster using AWS CDK (Check details in instructions)
-
Create an EMR Studio using AWS CDK (Check details in instructions)
-
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/
-
Open the EMR Studio and create an EMR Studio Workspace
-
Launch the EMR Studio Workspace
-
Attach the EMR Cluster to a Jupyter Notebook by following quick guide:
On the EMR Studio Workspace Web Console.
- Step 1. Create a new workspace without attaching the EMR cluster.
- Step 2. Stop the workspace.
- Step 3. Select the stopped workspace and restart it with Launch with options.
ℹ️ More information can be found here.
-
Upload
deltalake-with-emr-demo.ipynb
into the Jupyter Notebook -
Set kernel to PySpark, and Run each cells
-
For running Amazon Athena queries on Delta Lake, Check this
-
Amazon EMR Applications
- Hadoop
- Hive
- JupyterHub
- JupyterEnterpriseGateway
- Livy
- Apache Spark (>= 3.0)
-
Apache Spark (PySpark)
- For
emr-6.7.0
version{ "conf": { "spark.jars.packages": "io.delta:delta-core_2.12:1.2.1", "spark.sql.extensions": "io.delta.sql.DeltasparkSessionExtension", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog" } }
- For
>= emr-6.9.0
and< emr-7.0.0
version{ "conf": { "spark.jars.packages": "io.delta:delta-core_2.13:2.1.0", "spark.sql.extensions": "io.delta.sql.DeltasparkSessionExtension", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog", "spark.sql.catalog.spark_catalog.lf.managed": "true" } }
- For
emr-7.x.x
version{ "conf": { "spark.jars.packages": "io.delta:delta-spark_2.13:3.1.0", "spark.sql.extensions": "io.delta.sql.DeltasparkSessionExtension", "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog" "spark.sql.catalog.spark_catalog.lf.managed": "true" } }
⚠️ YOU NEED to configurespark.jar.packages
according to the Delta version that matches your Spark version.ℹ️ For more details on
spark.jar.packages
, see Apache Spark Configuration - Runtime EnvironmentSee also [1] Set up Apache Spark with Delta Lake, [2] Use a Delta Lake cluster with Spark.
- For
ℹ️ The following table lists are lastly updated on 3 Aug 2024
Delta lake version | Apache Spark version |
---|---|
3.2.x | 3.5.x |
3.1.x | 3.5.x |
3.0.x | 3.5.x |
2.4.x | 3.4.x |
2.3.x | 3.3.x |
2.2.x | 3.3.x |
2.1.x | 3.3.x |
2.0.x | 3.2.x |
1.2.x | 3.2.x |
1.1.x | 3.2.x |
1.0.x | 3.1.x |
0.7.x and 0.8.x | 3.0.x |
Below 0.7.x | 2.4.2 - 2.4.<latest> |
- More infomration at: Delta Lake releases
- (video) Incremental Data Processing using Delta Lake with EMR
- (video) DBT + Spark/EMR + Delta Lake/S3
- An Introduction to Modern Data Lake Storage Layers (2022-02-22)
- Compatibility with Apache Spark
- Amazon EMR Releases
- Delta Lake releases
io.delta
Maven Repository- Apache Spark Configuration - Runtime Environment
- Set up Apache Spark with Delta Lake
- Presto and Athena to Delta Lake integration
- Redshift Spectrum to Delta Lake integration
- Support for automatic and incremental Presto/Athena manifest generation (#453)
- Amazon EMR - Attach a compute to an EMR Studio Workspace
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.