-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chunking to ds_tool #97
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loved the DatasetChunkProcessor
abstraction. It cleanly separates the concerns without increasing complexity in other areas of the code.
I left too many nit comments, feel free to ignore most, but 2 3 comments are very important:
- I feel like
_upload
would rewrite the whole set which is not desired and - I didn't fully see the "recursive" logic that I thought you mentioned in TGIF. I expected a different approach there.
- Make sure
dataset.py
can handle the Error classes gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pat.
Left some comments, but approved to unblock you.
Adding chunking to ds_tool so it is more resilient to processing/upload failures for larger datasets.
Dynamically chunks dataset to process and upload. Failed chunks get reprocessed into smaller chunks and recursively go through the same process. Repeated until the chunk_split_threshold is reached, then it gets uploaded as is.
Note:
Push_to_hub automatically shards chunks into roughly 500mb. So in huggingface there will be a chunk and shard denominator, but we can just use a wildcard to set the path for our splits in the README, ie subset_name/split_name/**