Allow concurrent writes to partitions that don't interact with each other #9

calvinlfer · 2019-04-25T03:38:36Z

I have a use case where I would like to update multiple partitions of data at the same time but the partitions are totally separate and do not interact with each other.

For example, I would like to run these queries concurrently (which would be very useful in the case of backfills):

spark.read.parquet("s3://jim/dataset/v1/valid-records")
  .filter(col("event_date") === lit("2019-01-01"))
  .write
  .partitionBy("event_date")
  .format("delta")
  .mode(SaveMode.Overwrite)
  .option("replaceWhere", "event_date == '2019-01-01'")
  .save("s3://calvin/delta-experiments")

spark.read.parquet("s3://jim/dataset/v1/valid-records")
  .filter(col("event_date") === lit("2019-01-02"))
  .write
  .partitionBy("event_date")
  .format("delta")
  .mode(SaveMode.Overwrite)
  .option("replaceWhere", "event_date == '2019-01-02'")
  .save("s3://calvin/delta-experiments")

So the data above being written as delta belongs to two separate partitions which do not interact with each other. According to the Delta documentation and what I experience is a com.databricks.sql.transaction.tahoe.ProtocolChangedException: The protocol version of the Delta table has been changed by a concurrent update. Please try the operation again.

Would you support this use-case where you can update partitions concurrently that do not interact with each other?

Parquet seems to allow this just fine (without any corruption if you turn on dynamic partitioning with spark.sql.sources.partitionOverwriteMode). This is a very valid use case if you adhere to Maxime Beauchemin's technique of immutable table partitions.

The text was updated successfully, but these errors were encountered:

wernerdaehn · 2019-04-25T09:53:36Z

I would assume this limitation is because Delta supports ACID within a table. If you have two sessions writing into different partitions, this would need a different transaction handling compared to the situation where a single writer writes into all partitions.
Might be harder to implement than it looks at first sight.

Having said that, I would love to have such option as well. There will be situations where you need mass data loads and would be okay with a relaxed transaction guarantee. And there will be situations where transaction guarantees are more important than mass data performance.

my2cents

hackmad · 2019-04-25T14:12:59Z

I can appreciate the challenges in designing something like this. However, it basically makes it so that existing processes that can use Parquet format to simultaneously load partitions cannot be converted over to using Delta. It essentially serializes all stages that could be run in parallel. It might be worth having an option to load all data and then update partition information in the metastore. Similar to how you would have to do in Athena. If a new pipeline is created, we would have to workaround this by first loading the secondary parition level data into their own S3 locations without partitioning and then later organize them. This would still have the additional overhead of addtional storage (which could be mitigated with retention policies in S3) but more importantly more than doubling the compute cost to process the data a second time.

tdas · 2019-04-25T17:10:03Z

First of all, thank you very much for trying out Delta Lake! The current version (0.1.0) has a very restrictive conflict detection check to be absolutely safe. In future releases, we will slowly relax the conflict criteria to allow more concurrency while ensuring the ACID guarantees. Hopefully, we will be able to make such workloads easier.

ligao101 · 2019-04-25T17:21:22Z

hello I am looking into how delta would work on s3 based data lake. what are the current limitations of running the current delta lake on s3? This issue appears to be one of the potential limitations we could see. Thanks!

tdas · 2019-04-25T17:29:31Z

Delta Lake currently only works with HDFS with full guarantees because HDFS provides the necessary file system operation guarantees that give Delta Lake its consistency guarantee. S3 file system does not provide those guarantees yet, primarily because S3 does not provide list-after-write guarantees.
Details on the required guarantees - https://github.com/delta-io/delta#transaction-protocol

PS: This is a completely different issue that the original issue in this thread. Please make a different issue for this.

liwensun · 2019-06-19T23:30:44Z

Thanks for sharing your use case and the great discussions. I have created a concurrency support tracking issue which references to this issue so people can see the use cases and discussions here.

koertkuipers · 2019-08-28T22:21:39Z

@calvinlfer i agree that concurrent writes to totally separate partitions would be great.

however i was surprised to hear you say parquet supports this just fine. we have run into issues with this using dynamic partitionOverwriteMode and partitionBy, because both writers try to create temporary files in baseDir/_temporary, leading to weird errors when one finishes and deletes _temporary while the other job is still running. just FYI.

koertkuipers · 2019-08-29T03:50:33Z

i am not sure it is straightforward to safely allow concurrent writes that replace partitions. optimistic transaction seems to know what files were added or deleted, but thats not the same as knowing what the intent was of the transaction.

for example a transaction might have had the intent to replace everything where say a=1, so replaceWhere("a=1"), and let say there was nothing to delete and it only wrote out only to a=1/b=1/part.snappy
if another transaction ran at same and also had a replaceWhere("a=1") and also deleted nothing but created a file a=1/b=2/part.snappy, then by just looking at the file actions they do not seem to be in conflict, but they are.

koertkuipers · 2019-08-29T14:38:53Z

note that for the example of dynamic partition overwrite (which is not in delta but we added on our own branch) it is easy to reason about, because the files deleted are always in exact same partitions as where files are added, so you only need to check for conflicts with respect to added files (e.g. verify the transactions did not write to exact same partitions).

calvinlfer · 2019-08-29T21:02:50Z

Hey @koertkuipers what version and flavor of Spark are you using? @hackmad and myself have seen this work at scale on the Databricks platform with Spark 2.4.1

koertkuipers · 2019-08-29T21:08:41Z

@calvinlfer we use spark 2.4.1 on hadoop 2.7
however i am a little uncertain if that's the version we observed the issue with or if it was an earlier version and we have avoided the situation ever since. i remember the errors being hdfs lease exceptions because one job would delete the _temporary directory while the other was still using it.

koertkuipers · 2019-08-29T22:31:59Z

@calvinlfer maybe things changed for the better. i now see when i run two jobs writing to different partitions using partitionBy and dynamic partitionOverwriteMode:

drwxr-xr-x   - koert koert          0 2019-08-29 18:18 out/.spark-staging-be40030e-8eef-4680-85ac-b55e6519df60/partition=2
Found 1 items
drwxr-xr-x   - koert koert          0 2019-08-29 18:18 out/.spark-staging-d25f16d3-8f2d-4cf4-89bd-09256469b5e5/partition=1
Found 1 items
drwxr-xr-x   - koert koert          0 2019-08-29 18:17 out/_temporary/0

so it seems each job has its own .spark-staging directory, and _temporary isnt really used? not sure...

calvinlfer · 2019-08-30T00:56:13Z

Sorry I should have mentioned this more explicitly earlier but we used S3 instead of HDFS so I believe the underlying implementation is quite different and allows for concurrent writes to non conflicting partitions

hospadar · 2019-09-06T13:37:37Z

Wanted to add to this - this would be a blocker for us as well to switch from parquet to delta.

Right now we store our underlying data in s3 as parquet and do management of partitions fairly manually to keep tables in a happy state. We always write out new (or replacement) partitions to a new folder then just swap the location of the partition in the metastore to make it look like an atomic update to anyone querying the data downstream (also allows us to theoretically roll back an update, although that's impractical for us and requires log spelunking to find the old paths).

We often do big backfill/reprocessing jobs where we process tons of dates in parallel to keep the cluster over-committed. If we could only write one partition at once our throughput would slow down quite a bit on jobs like this.

I'd love to switch to delta, it would make it MUCH easier to revert data to earlier states (and a variety of other things would become more convenient for us), but this issue is probably a blocker.

Our logic goes something like:

///// First thread is doing something like:
String path = "s3://warehouse/" + UUID.randomUUID().toString()
dataframe.write.parquet(path)
spark.sql("ALTER TABLE target PARTITION (dt='2019-01-01') SET LOCATION '" + path + "'")

/////Another thread doing the same thing, but for a different date
String path2 = "s3://warehouse/" + UUID.randomUUID().toString()
dataframe.write.parquet(path2)
//register second datframe to a different partition
spark.sql("ALTER TABLE target PARTITION (dt='2019-01-02') SET LOCATION '" + path2 + "'")

koertkuipers · 2019-09-07T22:40:11Z

@calvinlfer i did some more checking and the issue of writers conflicting with each other when writing to same baseDir with dynamic partition overwrite does still exist in spark 2.4 and spark master for all file sources. i cannot say anything about writing to s3, that could be very different.
for more info please see (and vote for):
https://issues.apache.org/jira/browse/SPARK-28945

tdas · 2019-12-11T03:34:01Z

We have improved our concurrency control in this commit - f328300

This allows operations on disjoint partitions to be concurrently written.

* Parse file metadata as a separate task * change version to distinguish this branch * log store chooses where checkpoitns go (delta-io#6) * handle snapshot names (delta-io#9) Signed-off-by: Ryan Murray rymurr@gmail.com

* Parse file metadata as a separate task * change version to distinguish this branch * log store chooses where checkpoitns go (delta-io#6) * handle snapshot names (delta-io#9)

update fork

This PR makes some minor improvements in the error messages for unsupported features. It also adds the table property `spark.sql.sources.provider` so that a Delta table created by Hive can be read by Spark 3.0.0+ when they share the same metastore.

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes

# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter

tdas added the enhancement New feature or request label Apr 25, 2019

liwensun mentioned this issue Jun 19, 2019

[Concurrency Control] Support concurrent writes #72

Closed

liwensun closed this as completed Jun 19, 2019

liwensun reopened this Jun 19, 2019

mukulmurthy mentioned this issue Sep 19, 2019

Simple implementation of concurrent write for independent partitions. #114

Closed

mukulmurthy added this to the Future Roadmap milestone Oct 2, 2019

tdas modified the milestones: Future Roadmap, 0.5.0 Dec 11, 2019

tdas closed this as completed Dec 11, 2019

rymurr pushed a commit to rymurr/delta that referenced this issue Nov 3, 2020

handle snapshot names (delta-io#9)

66b3276

jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022

Merge pull request delta-io#9 from delta-io/master

eee16c7

update fork

elenigeo mentioned this issue Nov 14, 2022

[BUG]Merge didn't fail while inserting wrong data type values #1483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow concurrent writes to partitions that don't interact with each other #9

Allow concurrent writes to partitions that don't interact with each other #9

calvinlfer commented Apr 25, 2019 •

edited

Loading

wernerdaehn commented Apr 25, 2019

hackmad commented Apr 25, 2019

tdas commented Apr 25, 2019

ligao101 commented Apr 25, 2019

tdas commented Apr 25, 2019

liwensun commented Jun 19, 2019 •

edited

Loading

koertkuipers commented Aug 28, 2019

koertkuipers commented Aug 29, 2019

koertkuipers commented Aug 29, 2019 •

edited

Loading

calvinlfer commented Aug 29, 2019

koertkuipers commented Aug 29, 2019

koertkuipers commented Aug 29, 2019

calvinlfer commented Aug 30, 2019

hospadar commented Sep 6, 2019

koertkuipers commented Sep 7, 2019

tdas commented Dec 11, 2019

Allow concurrent writes to partitions that don't interact with each other #9

Allow concurrent writes to partitions that don't interact with each other #9

Comments

calvinlfer commented Apr 25, 2019 • edited Loading

wernerdaehn commented Apr 25, 2019

hackmad commented Apr 25, 2019

tdas commented Apr 25, 2019

ligao101 commented Apr 25, 2019

tdas commented Apr 25, 2019

liwensun commented Jun 19, 2019 • edited Loading

koertkuipers commented Aug 28, 2019

koertkuipers commented Aug 29, 2019

koertkuipers commented Aug 29, 2019 • edited Loading

calvinlfer commented Aug 29, 2019

koertkuipers commented Aug 29, 2019

koertkuipers commented Aug 29, 2019

calvinlfer commented Aug 30, 2019

hospadar commented Sep 6, 2019

koertkuipers commented Sep 7, 2019

tdas commented Dec 11, 2019

calvinlfer commented Apr 25, 2019 •

edited

Loading

liwensun commented Jun 19, 2019 •

edited

Loading

koertkuipers commented Aug 29, 2019 •

edited

Loading