`{MDS,Joint}Writer` should be more tolerant to transient failures in remote storage #451

thempatel · 2023-09-27T17:27:43Z

Environment

python 10
rev: 764a73014ca882aba372aad9c454dca73f07199d
OS: apache/beam_python3.10_sdk:2.49.0
n2d-standard, etc. (GCP)

It is a bit of a challenge to reproduce this behavior, so I'll do my best to explain what I believe to be happening.

We've been following the documentation here to create a massive dataset using the MDS format! I noticed that the writer (Joint and Base class) asynchronously uploads index and shard data here and here. Additionally, because we are on GCP, we use GCS. I've noticed that the GCSUploader uses google storage client to perform the upload, which is great!

Our problem can be described as following: there is a rare situation where it's possible to write an index file to remote storage, where a shard contained in the index fails to upload. Thus the index is incorrect or corrupt.

I believe it precipitates as follows

In the call to finish the writer, both the in flight data and index are flushed and enqueued to be written to remote storage
There is no guarantee on the order in which these uploads happen
The above gcs uploader is not configured to retry on failure
The index upload succeeds
The data shard upload fails
The program won't terminate because the future isn't explicitly handled in the main thread, an error is simply logged

If folks agree with this characterization, then there are some solutions:

enqueue the index upload only after any inflight upload requests are completed. When this happens, the event will be set, and the index file will, correctly, not be uploaded.
configure the gcs uploader to retry on failures
explicitly handle futures and hard fail the main thread

The text was updated successfully, but these errors were encountered:

snarayan21 · 2023-09-28T00:52:23Z

Hey @thempatel, thanks for raising this issue. Will discuss with the team and get back to you!

snarayan21 · 2023-09-28T01:03:23Z

Hey @thempatel, the team's aware of the issue and we're currently working on implementing better retry logic for both downloaders and uploaders. This will address the issue you're seeing. Thanks for bringing this up with us!

karan6181 · 2023-09-28T07:01:06Z

Hi @thempatel , Can you please review this PR #448 to see if the issues that you have mentioned above solves that or not? Thanks!

thempatel · 2023-09-28T11:57:17Z

@snarayan21 @karan6181 thanks team! the attached change looks 👌🏽 !

viyjy · 2023-09-29T19:00:41Z

@karan6181 Hi, thanks for fixing this issue. Since there is no new released version for this new feature, how can I use it?

karan6181 · 2023-09-29T19:39:00Z

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

viyjy · 2023-09-29T20:14:36Z

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

Thanks. I have already launched multiple jobs using previous version, and they are running now. I can see files are saving to my s3 bucket. What if my current jobs failed at some moment? Should I delete all the generated files on s3 and then use main branch to generate data from scratch? Since my dataset is very large, I am trying to avoid rewriting files that are already saved to s3. Thanks.

viyjy · 2023-10-01T18:26:37Z

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

Update: all of my jobs failed due to this error #453

thempatel added the bug Something isn't working label Sep 27, 2023

karan6181 mentioned this issue Sep 28, 2023

Add a retry logic with backoff and jitter #448

Merged

8 tasks

karan6181 closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`{MDS,Joint}Writer` should be more tolerant to transient failures in remote storage #451

`{MDS,Joint}Writer` should be more tolerant to transient failures in remote storage #451

thempatel commented Sep 27, 2023 •

edited

Loading

snarayan21 commented Sep 28, 2023

snarayan21 commented Sep 28, 2023

karan6181 commented Sep 28, 2023

thempatel commented Sep 28, 2023

viyjy commented Sep 29, 2023

karan6181 commented Sep 29, 2023

viyjy commented Sep 29, 2023

viyjy commented Oct 1, 2023

{MDS,Joint}Writer should be more tolerant to transient failures in remote storage #451

{MDS,Joint}Writer should be more tolerant to transient failures in remote storage #451

Comments

thempatel commented Sep 27, 2023 • edited Loading

Environment

snarayan21 commented Sep 28, 2023

snarayan21 commented Sep 28, 2023

karan6181 commented Sep 28, 2023

thempatel commented Sep 28, 2023

viyjy commented Sep 29, 2023

karan6181 commented Sep 29, 2023

viyjy commented Sep 29, 2023

viyjy commented Oct 1, 2023

`{MDS,Joint}Writer` should be more tolerant to transient failures in remote storage #451

`{MDS,Joint}Writer` should be more tolerant to transient failures in remote storage #451

thempatel commented Sep 27, 2023 •

edited

Loading