-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{MDS,Joint}Writer
should be more tolerant to transient failures in remote storage
#451
Comments
Hey @thempatel, thanks for raising this issue. Will discuss with the team and get back to you! |
Hey @thempatel, the team's aware of the issue and we're currently working on implementing better retry logic for both downloaders and uploaders. This will address the issue you're seeing. Thanks for bringing this up with us! |
Hi @thempatel , Can you please review this PR #448 to see if the issues that you have mentioned above solves that or not? Thanks! |
@snarayan21 @karan6181 thanks team! the attached change looks 👌🏽 ! |
@karan6181 Hi, thanks for fixing this issue. Since there is no new released version for this new feature, how can I use it? |
Hi @viyjy , we are working on a new patch release. In the mean time, can you use |
Thanks. I have already launched multiple jobs using previous version, and they are running now. I can see files are saving to my s3 bucket. What if my current jobs failed at some moment? Should I delete all the generated files on s3 and then use |
Environment
764a73014ca882aba372aad9c454dca73f07199d
apache/beam_python3.10_sdk:2.49.0
It is a bit of a challenge to reproduce this behavior, so I'll do my best to explain what I believe to be happening.
We've been following the documentation here to create a massive dataset using the MDS format! I noticed that the writer (
Joint
andBase class
) asynchronously uploads index and shard data here and here. Additionally, because we are on GCP, we use GCS. I've noticed that theGCSUploader
uses google storage client to perform the upload, which is great!Our problem can be described as following: there is a rare situation where it's possible to write an index file to remote storage, where a shard contained in the index fails to upload. Thus the index is incorrect or corrupt.
I believe it precipitates as follows
finish
the writer, both the in flight data and index are flushed and enqueued to be written to remote storageIf folks agree with this characterization, then there are some solutions:
The text was updated successfully, but these errors were encountered: