Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{MDS,Joint}Writer should be more tolerant to transient failures in remote storage #451

Closed
thempatel opened this issue Sep 27, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@thempatel
Copy link

thempatel commented Sep 27, 2023

Environment

  • python 10
  • rev: 764a73014ca882aba372aad9c454dca73f07199d
  • OS: apache/beam_python3.10_sdk:2.49.0
  • n2d-standard, etc. (GCP)

It is a bit of a challenge to reproduce this behavior, so I'll do my best to explain what I believe to be happening.

We've been following the documentation here to create a massive dataset using the MDS format! I noticed that the writer (Joint and Base class) asynchronously uploads index and shard data here and here. Additionally, because we are on GCP, we use GCS. I've noticed that the GCSUploader uses google storage client to perform the upload, which is great!

Our problem can be described as following: there is a rare situation where it's possible to write an index file to remote storage, where a shard contained in the index fails to upload. Thus the index is incorrect or corrupt.

I believe it precipitates as follows

  1. In the call to finish the writer, both the in flight data and index are flushed and enqueued to be written to remote storage
  2. There is no guarantee on the order in which these uploads happen
  3. The above gcs uploader is not configured to retry on failure
  4. The index upload succeeds
  5. The data shard upload fails
  6. The program won't terminate because the future isn't explicitly handled in the main thread, an error is simply logged

If folks agree with this characterization, then there are some solutions:

  1. enqueue the index upload only after any inflight upload requests are completed. When this happens, the event will be set, and the index file will, correctly, not be uploaded.
  2. configure the gcs uploader to retry on failures
  3. explicitly handle futures and hard fail the main thread
@thempatel thempatel added the bug Something isn't working label Sep 27, 2023
@snarayan21
Copy link
Collaborator

Hey @thempatel, thanks for raising this issue. Will discuss with the team and get back to you!

@snarayan21
Copy link
Collaborator

Hey @thempatel, the team's aware of the issue and we're currently working on implementing better retry logic for both downloaders and uploaders. This will address the issue you're seeing. Thanks for bringing this up with us!

@karan6181
Copy link
Collaborator

Hi @thempatel , Can you please review this PR #448 to see if the issues that you have mentioned above solves that or not? Thanks!

@thempatel
Copy link
Author

@snarayan21 @karan6181 thanks team! the attached change looks 👌🏽 !

@viyjy
Copy link

viyjy commented Sep 29, 2023

@karan6181 Hi, thanks for fixing this issue. Since there is no new released version for this new feature, how can I use it?

@karan6181
Copy link
Collaborator

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

@viyjy
Copy link

viyjy commented Sep 29, 2023

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

Thanks. I have already launched multiple jobs using previous version, and they are running now. I can see files are saving to my s3 bucket. What if my current jobs failed at some moment? Should I delete all the generated files on s3 and then use main branch to generate data from scratch? Since my dataset is very large, I am trying to avoid rewriting files that are already saved to s3. Thanks.

@viyjy
Copy link

viyjy commented Oct 1, 2023

Hi @viyjy , we are working on a new patch release. In the mean time, can you use main branch ?

Update: all of my jobs failed due to this error #453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants