Skip to content

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

License

Notifications You must be signed in to change notification settings

adrian-ionescu/delta

 
 

Repository files navigation

Delta Lake Logo

CircleCI

Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

See the Delta Lake Documentation for more details on how to get started in Scala, Java or Python.

Latest Binaries

Delta Lake is published to Maven Central Repository and can be used by adding a dependency in your POM file.

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>0.1.0</version>
</dependency>

Compatibility

Compatibility with Apache Spark Versions

Delta Lake currently requires Apache Spark 2.4.2. Earlier versions are missing SPARK-27453, which breaks the partitionBy clause of the DataFrameWriter.

API Compatibility

The only stable, public APIs currently provided by Delta Lake are through the DataFrameReader/Writer (i.e. spark.read, df.write, spark.readStream and df.writeStream). Options to these APIs will remain stable within a major release of Delta Lake (e.g. 1.x.x).

All other interfaces in the this library are considered internal, and are subject to change across minor / patch releases.

Data Storage Compatibility

Delta Lake guarantees backward compatibility for all Delta Lake tables (i.e. newer versions of Delta Lake will always be able to read tables written by older versions of Delta Lake). However, we reserve the right to break forwards compatibilty as new features are introduced to the transaction protocol (i.e. an older version of Delta Lake may not be able to read a table produced by a newer version.

Breaking changes in the protocol are indicated by incrementing the minumum reader/writer version in the Protocol action.

Building

Delta Lake Core is compiled using SBT.

To compile, run

build/sbt compile

To generate artifacts, run

build/sbt package

To execute tests, run

build/sbt test

Refer to SBT docs for more commands.

Transaction Protocol

Delta lake works by storing a transaction log along side the actual data files in a table. Entries in the log, called delta files, are stored as atomic collections of actions in the _delta_log directory, at the root of a table. Entries in the log encoded using JSON and are named as zero-padded contigious integers.

/table/_delta_log/00000000000000000000.json
/table/_delta_log/00000000000000000001.json
/table/_delta_log/00000000000000000002.json

To avoid needing to read the entire transaction log everytime a table is loaded, Delta Lake will also occasionally create a checkpoint, which contains the entire state of the table at the given version. Checkpoints are encoded using parquet and must only be written after the accompanying delta files has been written.

Requirements for Underlying Storage Systems

Delta Lake's ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, we require the storage system to provide the following.

  1. Atomic visibility: There must a way for a file to visible in its entirely or not visible at all.
  2. Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
  3. Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

Open source Delta Lake currently supports all these guarantees only on HDFS. It is possible to make it work with other storage systems by plugging in custom implementations of the LogStore API.

As an optimization, storage systems can also allow partial listing of a directory, given a start marker. Delta can use this ability to efficiently discover the latest version of a table, without listing all of the files in the transaction log.

Reporting issues

We use Github Issues to track community reported issues. You can also contact the community for getting answers.

Contributing

We welcome contributions to Delta Lake. We use Github Pull Requests for accepting changes. You will be propted to sign a contributor license agreement before you change can be accepted.

Community

There are two mediums of communication within the Delta Lake community.

About

An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 97.5%
  • Shell 2.5%