-
Notifications
You must be signed in to change notification settings - Fork 557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HfFileSystem's transaction
is working counterintuitively
#1733
Comments
Hi @TwoAbove, thanks for describing your issue. I assume this is more a feature request than a bug. We don't have any transaction-specific implementation in As you said, implementing proper transactions is something that requires work both client-side and server-side, and it will most likely not happen anytime soon. However a solution that has being introduced in v0.18.0 is to upload files one by one and make a differed commit only once - which is the closest thing we have to transactions 😄. To do this, you'll need to use
Hope this will help you. Please let me know if you have any question. |
Raising a |
Opened a PR for that: #1736 Otherwise @TwoAbove please let me know if you need further help. Otherwise we can close this issue I think. We will not implement transactions so I think explicitly raising an exception is fine for now. |
That makes sense. Thanks! |
Also @Wauplin thanks for the suggestion to upload files one by one and make a differed commit only once. That pretty much solves the main issue we were anticipating! |
When renaming you don't need the "pre-upload" step since nothing else is re-uploaded. So yes you can have in the same |
Sounds good. Thanks again! |
Describe the bug
Hey!
I'm trying to optimize some code that updates a HF dataset.
In short, here's the pseudo-code that best describes what I'm doing:
What I expect
with fs.transaction
to do is to group renaming and writing actions until the transaction ends and then everything is committed to the HF dataset in one commit.The issue is that, currently, it does not group the changes, and they are committed separately. We can very quickly hit API request limits because of this. We do this chunked updates because the Github runner that we're using can't handle downloading the HF dataset and updating it in-memory, so this is the solution we came up with. We could use larger machines, but that's not sustainable in the long run - this will happen eventually.
I've looked at the implementation in
hf_api
andhf_file_system
, and I couldn't find a way to implement this pooling - I guess that it needs server-side support.If this something that's possible to do? Am I missing anything?
Maybe someone can propose some other method to push new rows to a HF Dataset?
Thanks!
P.S.
Here's the PR that I proposed in our repo to solve the OOM issue we were seeing https://github.com/LAION-AI/Discord-Scrapers/pull/2/files
And also some discussions about this in the Dataset itself: https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/3 https://huggingface.co/datasets/laion/dalle-3-dataset/discussions/4
Reproduction
No response
Logs
No response
System info
The text was updated successfully, but these errors were encountered: