-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push to hub capabilities for Dataset
and DatasetDict
#3098
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks super cool (and clean)! 😎
I have a few minor comments/suggestions:
src/datasets/arrow_dataset.py
Outdated
def push_to_hub( | ||
self, | ||
repo_id: str, | ||
split_name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rename this arg to split
(for consistency with the from_*
methods) and make it optional: if None
, then is set to self.split
. Also, with this change, you don't have to explicitly pass the DatasetDict key as a split
in DatasetDict.push_to_hub
.
src/datasets/arrow_dataset.py
Outdated
self, | ||
repo_id: str, | ||
split_name: str, | ||
private: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is related to the Hub API design, but IMO it makes more sense to have this arg set to False
by default (if I'm not mistaken, private repos are only available to the paid accounts). So I'm just wondering what's the reasoning behind this decision from the API design standpoint (since the docs here don't mention this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omission from my part, thank you!
branch: Optional[str] = None, | ||
shard_size: Optional[int] = 500 << 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think the defaults should be explained in the docstring (same for the DatasetDict.push_to_hub
docstring):
branch -> main
shard_size -> 500MB
A nit: maybe rename shard_size
to chunksize
for consistency with the JSON and TEXT writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually shard_size
is more appropriate IMO since the dataset actually ends up being sharded in several files, while chunksize
refers to chunking parts of a dataset to load them in memory one by one.
src/datasets/arrow_dataset.py
Outdated
identifier = repo_id.split("/") | ||
if len(identifier) == 2: | ||
organization, dataset_name = identifier | ||
else: | ||
dataset_name = identifier[0] | ||
organization = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe raise a ValueError if len(identifier) > 2
to avoid later confussion on the user side.
Thank you for your reviews! I should have addressed all of your comments, and I added a test to ensure that As blockers for this PR:
ds = Dataset.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
local_ds = DatasetDict({"random": ds})
local_ds['random'].split # returns None In order to remove the
|
Currently it looks like it only saves the last split. >>> dataset.push_to_hub("lhoestq/squad_titles", shard_size=50<<10)
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 100%|█| 31/31 [00:22<00:00, 1.38
Pushing split validation to the Hub.
The repository already exists: the `private` keyword argument will be ignored.
Deleting unused files from dataset repository: 100%|█| 31/31 [00:14<00:00,
Pushing dataset shards to the dataset hub: 100%|█| 4/4 [00:03<00:00, 1.18it Note the "Deleting" part. |
I think this PR should fix #3035, so feel free to link it. |
ae56a67
to
eb2a115
Compare
Thank you for your comments! I have rebased on @lhoestq, I have fixed the issue with splits and added a corresponding test. @mariosasko I have not updated the Only remaining issues before merging:
I need to understand how to build a |
Cool thanks ! And indeed this won't solve #3035 yet
You can use the key in the DatasetDict instead of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik Sorry, I misread the linked PR.
What do you think about bumping the minimum version of pyarrow to 3.0.0 ? This is the minimum required version to write parquet files, which is needed for push_to_hub. That's why our pyarrow 1 CI is failing. I think it's fine since it's been available for a long time (january 2021) and it's also the version that is installed on Google Colab. |
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Mario Šaško <mario@huggingface.co>
9f088a9
to
bcfbf1f
Compare
Pushing pyarrow to 3.0.0 is fine for me. I don’t think we need to keep a lot of backward support for pyarrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, thank you so much :)
Now let's add some docs about it - I'm creating a PR now !
Hi. |
Hi ! Let me respond here as well in case other people have the same issues and come here:
Old version of |
This PR implements a
push_to_hub
method onDataset
andDatasetDict
. This does not currently work inIterableDatasetDict
norIterableDataset
as those are simple dicts and I would like your opinion on how you would like to implement this before going ahead and doing it.This implementation needs to be used with the following
huggingface_hub
branch in order to work correctly: huggingface/huggingface_hub#415Implementation
The
push_to_hub
API is entirely based on HTTP requests rather than a git-based workflow:push_to_hub
method.Repository
helper of thehuggingface_hub
to be used instead of thepush_to_hub
method which will always be, by design, limiting in that regard (even if based on a git-workflow instead of HTTP requests)In order to overcome the limit of 5GB files set by the HTTP requests, dataset sharding is used.
Testing
The test suite implemented here makes use of the moon-staging instead of the production setup. As several repositories are created and deleted, it is better to use the staging.
It does not require setting an environment variable or any kind of special attention but introduces a new decorator
with_staging_testing
which patches global variables to use the staging endpoint instead of the production endpoint.Examples
The tests cover a lot of examples and behavior.