Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push to hub capabilities for Dataset and DatasetDict #3098

Merged
merged 17 commits into from
Nov 24, 2021

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Oct 17, 2021

This PR implements a push_to_hub method on Dataset and DatasetDict. This does not currently work in IterableDatasetDict nor IterableDataset as those are simple dicts and I would like your opinion on how you would like to implement this before going ahead and doing it.

This implementation needs to be used with the following huggingface_hub branch in order to work correctly: huggingface/huggingface_hub#415

Implementation

The push_to_hub API is entirely based on HTTP requests rather than a git-based workflow:

  • This allows pushing changes without firstly cloning the repository, which reduces the time in half for the push_to_hub method.
  • Collaboration, as well as the system of branches/merges/rebases is IMO less straightforward than for models and spaces. In the situation where such collaboration is needed, I would heavily advocate for the Repository helper of the huggingface_hub to be used instead of the push_to_hub method which will always be, by design, limiting in that regard (even if based on a git-workflow instead of HTTP requests)

In order to overcome the limit of 5GB files set by the HTTP requests, dataset sharding is used.

Testing

The test suite implemented here makes use of the moon-staging instead of the production setup. As several repositories are created and deleted, it is better to use the staging.

It does not require setting an environment variable or any kind of special attention but introduces a new decorator with_staging_testing which patches global variables to use the staging endpoint instead of the production endpoint.

Examples

The tests cover a lot of examples and behavior.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super cool (and clean)! 😎

I have a few minor comments/suggestions:

def push_to_hub(
self,
repo_id: str,
split_name: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rename this arg to split (for consistency with the from_* methods) and make it optional: if None, then is set to self.split. Also, with this change, you don't have to explicitly pass the DatasetDict key as a split in DatasetDict.push_to_hub.

self,
repo_id: str,
split_name: str,
private: Optional[bool] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is related to the Hub API design, but IMO it makes more sense to have this arg set to False by default (if I'm not mistaken, private repos are only available to the paid accounts). So I'm just wondering what's the reasoning behind this decision from the API design standpoint (since the docs here don't mention this)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omission from my part, thank you!

Comment on lines +3339 to +3321
branch: Optional[str] = None,
shard_size: Optional[int] = 500 << 20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think the defaults should be explained in the docstring (same for the DatasetDict.push_to_hub docstring):
branch -> main
shard_size -> 500MB

A nit: maybe rename shard_size to chunksize for consistency with the JSON and TEXT writers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually shard_size is more appropriate IMO since the dataset actually ends up being sharded in several files, while chunksize refers to chunking parts of a dataset to load them in memory one by one.

Comment on lines 3373 to 3371
identifier = repo_id.split("/")
if len(identifier) == 2:
organization, dataset_name = identifier
else:
dataset_name = identifier[0]
organization = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe raise a ValueError if len(identifier) > 2 to avoid later confussion on the user side.

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
tests/test_upstream_hub.py Outdated Show resolved Hide resolved
@LysandreJik
Copy link
Member Author

Thank you for your reviews! I should have addressed all of your comments, and I added a test to ensure that private datasets work correctly too. I have merged the changes in huggingface_hub, so the main branch can be installed now; and I will release v0.1.0 soon.

As blockers for this PR:

  • It's still waiting for Resolve data_files by split name #3027 to be addressed as the folder name will dictate the split name
  • The self.split name is set to None when the dataset dict is instantiated as follows:
ds = Dataset.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
local_ds = DatasetDict({"random": ds})

local_ds['random'].split  # returns None

In order to remove the split=key I would need to know of a different way to test here as it relies on the above as a surefire way of constructing a DatasetDict.

  • Finally, the threading parameter is flaky on moon-staging which results in many errors server side. I propose to leave it as an argument instead of having it having it set to True so that users may toggle it according to their wish.

@lhoestq
Copy link
Member

lhoestq commented Nov 3, 2021

Currently it looks like it only saves the last split.
Indeed when writing the data of one split, it deletes all the other files from the other splits

>>> dataset.push_to_hub("lhoestq/squad_titles", shard_size=50<<10)           
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 100%|| 31/31 [00:22<00:00,  1.38
Pushing split validation to the Hub.
The repository already exists: the `private` keyword argument will be ignored.
Deleting unused files from dataset repository: 100%|| 31/31 [00:14<00:00,  
Pushing dataset shards to the dataset hub: 100%|| 4/4 [00:03<00:00,  1.18it

Note the "Deleting" part.

@mariosasko
Copy link
Collaborator

mariosasko commented Nov 3, 2021

I think this PR should fix #3035, so feel free to link it.

@LysandreJik
Copy link
Member Author

Thank you for your comments! I have rebased on master to have PR #3221. I've updated all tests to reflect the - instead of the _ in the filenames.

@lhoestq, I have fixed the issue with splits and added a corresponding test.

@mariosasko I have not updated the load_dataset method to work differently, so I don't expect #3035 to be resolved with push_to_hub.

Only remaining issues before merging:

  • Take a good look at the threading and if that's something we want to keep.
  • As mentioned above:

The self.split name is set to None when the dataset dict is instantiated as follows:

ds = Dataset.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
local_ds = DatasetDict({"random": ds})

local_ds['random'].split  # returns None

I need to understand how to build a DatasetDict from some Dataset objects to be able to leverage the split parameter in DatasetDict.push_to_hub

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2021

Cool thanks ! And indeed this won't solve #3035 yet

I need to understand how to build a DatasetDict from some Dataset objects to be able to leverage the split parameter in DatasetDict.push_to_hub

You can use the key in the DatasetDict instead of the split attribute

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LysandreJik Sorry, I misread the linked PR.

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member

lhoestq commented Nov 12, 2021

What do you think about bumping the minimum version of pyarrow to 3.0.0 ? This is the minimum required version to write parquet files, which is needed for push_to_hub. That's why our pyarrow 1 CI is failing.

I think it's fine since it's been available for a long time (january 2021) and it's also the version that is installed on Google Colab.

LysandreJik and others added 12 commits November 22, 2021 07:01
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Mario Šaško <mario@huggingface.co>
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Mario Šaško <mario@huggingface.co>
@thomwolf
Copy link
Member

Pushing pyarrow to 3.0.0 is fine for me. I don’t think we need to keep a lot of backward support for pyarrow.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thank you so much :)

Now let's add some docs about it - I'm creating a PR now !

@lhoestq lhoestq merged commit 46d5b2f into huggingface:master Nov 24, 2021
@lhoestq lhoestq mentioned this pull request Nov 24, 2021
@piegu
Copy link

piegu commented Dec 7, 2021

Hi.
I published in the forum about my experience with DatasetDict.push_to_hub(): here is my post.
On my side, there is a problem as my train and validation Datasets are concatenated when I do a load_dataset() from the DatasetDict I pushed to the HF datasets hub.

@lhoestq
Copy link
Member

lhoestq commented Dec 8, 2021

Hi ! Let me respond here as well in case other people have the same issues and come here:

push_to_hub was introduced in datasets 1.16, and to be able to properly load a dataset with separated splits you need to have datasets>=1.16.0 as well.

Old version of datasets used to concatenate everything in the train split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants