Push to hub capabilities for `Dataset` and `DatasetDict` #3098

LysandreJik · 2021-10-17T04:12:44Z

This PR implements a push_to_hub method on Dataset and DatasetDict. This does not currently work in IterableDatasetDict nor IterableDataset as those are simple dicts and I would like your opinion on how you would like to implement this before going ahead and doing it.

This implementation needs to be used with the following huggingface_hub branch in order to work correctly: huggingface/huggingface_hub#415

Implementation

The push_to_hub API is entirely based on HTTP requests rather than a git-based workflow:

This allows pushing changes without firstly cloning the repository, which reduces the time in half for the push_to_hub method.
Collaboration, as well as the system of branches/merges/rebases is IMO less straightforward than for models and spaces. In the situation where such collaboration is needed, I would heavily advocate for the Repository helper of the huggingface_hub to be used instead of the push_to_hub method which will always be, by design, limiting in that regard (even if based on a git-workflow instead of HTTP requests)

In order to overcome the limit of 5GB files set by the HTTP requests, dataset sharding is used.

Testing

The test suite implemented here makes use of the moon-staging instead of the production setup. As several repositories are created and deleted, it is better to use the staging.

It does not require setting an environment variable or any kind of special attention but introduces a new decorator with_staging_testing which patches global variables to use the staging endpoint instead of the production endpoint.

Examples

The tests cover a lot of examples and behavior.

mariosasko

This looks super cool (and clean)! 😎

I have a few minor comments/suggestions:

mariosasko · 2021-10-19T10:00:21Z

src/datasets/arrow_dataset.py

+    def push_to_hub(
+        self,
+        repo_id: str,
+        split_name: str,


Would rename this arg to split (for consistency with the from_* methods) and make it optional: if None, then is set to self.split. Also, with this change, you don't have to explicitly pass the DatasetDict key as a split in DatasetDict.push_to_hub.

mariosasko · 2021-10-19T10:13:16Z

src/datasets/arrow_dataset.py

+        self,
+        repo_id: str,
+        split_name: str,
+        private: Optional[bool] = None,


I understand this is related to the Hub API design, but IMO it makes more sense to have this arg set to False by default (if I'm not mistaken, private repos are only available to the paid accounts). So I'm just wondering what's the reasoning behind this decision from the API design standpoint (since the docs here don't mention this)?

Omission from my part, thank you!

mariosasko · 2021-10-19T10:25:02Z

src/datasets/arrow_dataset.py

+        branch: Optional[str] = None,
+        shard_size: Optional[int] = 500 << 20,


Think the defaults should be explained in the docstring (same for the DatasetDict.push_to_hub docstring):
branch -> main
shard_size -> 500MB

A nit: maybe rename shard_size to chunksize for consistency with the JSON and TEXT writers.

Actually shard_size is more appropriate IMO since the dataset actually ends up being sharded in several files, while chunksize refers to chunking parts of a dataset to load them in memory one by one.

mariosasko · 2021-10-19T10:43:37Z

src/datasets/arrow_dataset.py

+        identifier = repo_id.split("/")
+        if len(identifier) == 2:
+            organization, dataset_name = identifier
+        else:
+            dataset_name = identifier[0]
+            organization = None


Maybe raise a ValueError if len(identifier) > 2 to avoid later confussion on the user side.

src/datasets/arrow_dataset.py

tests/test_upstream_hub.py

LysandreJik · 2021-11-01T21:41:07Z

Thank you for your reviews! I should have addressed all of your comments, and I added a test to ensure that private datasets work correctly too. I have merged the changes in huggingface_hub, so the main branch can be installed now; and I will release v0.1.0 soon.

As blockers for this PR:

It's still waiting for Resolve data_files by split name #3027 to be addressed as the folder name will dictate the split name
The self.split name is set to None when the dataset dict is instantiated as follows:

ds = Dataset.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
local_ds = DatasetDict({"random": ds})

local_ds['random'].split  # returns None

In order to remove the split=key I would need to know of a different way to test here as it relies on the above as a surefire way of constructing a DatasetDict.

Finally, the threading parameter is flaky on moon-staging which results in many errors server side. I propose to leave it as an argument instead of having it having it set to True so that users may toggle it according to their wish.

lhoestq · 2021-11-03T13:50:04Z

Currently it looks like it only saves the last split.
Indeed when writing the data of one split, it deletes all the other files from the other splits

>>> dataset.push_to_hub("lhoestq/squad_titles", shard_size=50<<10)           
Pushing split train to the Hub.
Pushing dataset shards to the dataset hub: 100%|█| 31/31 [00:22<00:00,  1.38
Pushing split validation to the Hub.
The repository already exists: the `private` keyword argument will be ignored.
Deleting unused files from dataset repository: 100%|█| 31/31 [00:14<00:00,  
Pushing dataset shards to the dataset hub: 100%|█| 4/4 [00:03<00:00,  1.18it

Note the "Deleting" part.

mariosasko · 2021-11-03T17:27:32Z

I think this PR should fix #3035, so feel free to link it.

LysandreJik · 2021-11-08T02:29:57Z

Thank you for your comments! I have rebased on master to have PR #3221. I've updated all tests to reflect the - instead of the _ in the filenames.

@lhoestq, I have fixed the issue with splits and added a corresponding test.

@mariosasko I have not updated the load_dataset method to work differently, so I don't expect #3035 to be resolved with push_to_hub.

Only remaining issues before merging:

Take a good look at the threading and if that's something we want to keep.
As mentioned above:

The self.split name is set to None when the dataset dict is instantiated as follows:
ds = Dataset.from_dict({"x": [1, 2, 3], "y": [4, 5, 6]})
local_ds = DatasetDict({"random": ds})

local_ds['random'].split  # returns None

I need to understand how to build a DatasetDict from some Dataset objects to be able to leverage the split parameter in DatasetDict.push_to_hub

lhoestq · 2021-11-08T10:13:45Z

Cool thanks ! And indeed this won't solve #3035 yet

I need to understand how to build a DatasetDict from some Dataset objects to be able to leverage the split parameter in DatasetDict.push_to_hub

You can use the key in the DatasetDict instead of the split attribute

mariosasko

@LysandreJik Sorry, I misread the linked PR.

src/datasets/arrow_dataset.py

lhoestq · 2021-11-12T14:16:32Z

What do you think about bumping the minimum version of pyarrow to 3.0.0 ? This is the minimum required version to write parquet files, which is needed for push_to_hub. That's why our pyarrow 1 CI is failing.

I think it's fine since it's been available for a long time (january 2021) and it's also the version that is installed on Google Colab.

s

Co-authored-by: Mario Šaško <mario@huggingface.co>

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Co-authored-by: Mario Šaško <mario@huggingface.co>

thomwolf · 2021-11-24T07:36:17Z

Pushing pyarrow to 3.0.0 is fine for me. I don’t think we need to keep a lot of backward support for pyarrow.

lhoestq

This is awesome, thank you so much :)

Now let's add some docs about it - I'm creating a PR now !

piegu · 2021-12-07T00:22:12Z

Hi.
I published in the forum about my experience with DatasetDict.push_to_hub(): here is my post.
On my side, there is a problem as my train and validation Datasets are concatenated when I do a load_dataset() from the DatasetDict I pushed to the HF datasets hub.

lhoestq · 2021-12-08T16:04:50Z

Hi ! Let me respond here as well in case other people have the same issues and come here:

push_to_hub was introduced in datasets 1.16, and to be able to properly load a dataset with separated splits you need to have datasets>=1.16.0 as well.

Old version of datasets used to concatenate everything in the train split.

mariosasko reviewed Oct 19, 2021

View reviewed changes

lhoestq reviewed Oct 22, 2021

View reviewed changes

tests/test_upstream_hub.py Outdated Show resolved Hide resolved

lhoestq mentioned this pull request Nov 5, 2021

Resolve data_files by split name #3221

Merged

LysandreJik force-pushed the push_to_hub branch from ae56a67 to eb2a115 Compare November 8, 2021 02:25

mariosasko reviewed Nov 8, 2021

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

LysandreJik and others added 12 commits November 22, 2021 07:01

First version

c38929d

s

Second version

2953ea7

Third version

cf1873f

Update src/datasets/arrow_dataset.py

90c3da4

Co-authored-by: Mario Šaško <mario@huggingface.co>

Update src/datasets/arrow_dataset.py

8685334

Co-authored-by: Mario Šaško <mario@huggingface.co>

Update src/datasets/arrow_dataset.py

534e585

Co-authored-by: Mario Šaško <mario@huggingface.co>

Update tests/test_upstream_hub.py

a26dda8

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Address review comments

35af2dd

Test for private

ac95383

Update now that huggingface#3027 is merged

38ffce1

Update src/datasets/arrow_dataset.py

4b1723e

Co-authored-by: Mario Šaško <mario@huggingface.co>

Last comments

bcfbf1f

LysandreJik force-pushed the push_to_hub branch from 9f088a9 to bcfbf1f Compare November 22, 2021 06:04

LysandreJik added 4 commits November 22, 2021 10:16

Comments with Quentin

9e4fd59

Tests

e39467f

Fix tests

9fd3465

Removing threading

b4726fc

fix push_to_hub when token isn't passed

2ecb7cf

lhoestq approved these changes Nov 24, 2021

View reviewed changes

lhoestq merged commit 46d5b2f into huggingface:master Nov 24, 2021

mariosasko mentioned this pull request Nov 24, 2021

Finish transition to PyArrow 3.0.0 #3318

Merged

lhoestq mentioned this pull request Nov 24, 2021

Add push_to_hub docs #3319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push to hub capabilities for `Dataset` and `DatasetDict` #3098

Push to hub capabilities for `Dataset` and `DatasetDict` #3098

LysandreJik commented Oct 17, 2021 •

edited

Loading

mariosasko left a comment

mariosasko Oct 19, 2021

mariosasko Oct 19, 2021

LysandreJik Nov 1, 2021

mariosasko Oct 19, 2021

lhoestq Oct 22, 2021

mariosasko Oct 19, 2021

LysandreJik commented Nov 1, 2021

lhoestq commented Nov 3, 2021

mariosasko commented Nov 3, 2021 •

edited

Loading

LysandreJik commented Nov 8, 2021

lhoestq commented Nov 8, 2021

mariosasko left a comment

lhoestq commented Nov 12, 2021 •

edited

Loading

thomwolf commented Nov 24, 2021

lhoestq left a comment

piegu commented Dec 7, 2021

lhoestq commented Dec 8, 2021

		branch: Optional[str] = None,
		shard_size: Optional[int] = 500 << 20,

Push to hub capabilities for Dataset and DatasetDict #3098

Push to hub capabilities for Dataset and DatasetDict #3098

Conversation

LysandreJik commented Oct 17, 2021 • edited Loading

Implementation

Testing

Examples

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Oct 19, 2021

Choose a reason for hiding this comment

mariosasko Oct 19, 2021

Choose a reason for hiding this comment

LysandreJik Nov 1, 2021

Choose a reason for hiding this comment

mariosasko Oct 19, 2021

Choose a reason for hiding this comment

lhoestq Oct 22, 2021

Choose a reason for hiding this comment

mariosasko Oct 19, 2021

Choose a reason for hiding this comment

LysandreJik commented Nov 1, 2021

lhoestq commented Nov 3, 2021

mariosasko commented Nov 3, 2021 • edited Loading

LysandreJik commented Nov 8, 2021

lhoestq commented Nov 8, 2021

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Nov 12, 2021 • edited Loading

thomwolf commented Nov 24, 2021

lhoestq left a comment

Choose a reason for hiding this comment

piegu commented Dec 7, 2021

lhoestq commented Dec 8, 2021

Push to hub capabilities for `Dataset` and `DatasetDict` #3098

Push to hub capabilities for `Dataset` and `DatasetDict` #3098

LysandreJik commented Oct 17, 2021 •

edited

Loading

mariosasko commented Nov 3, 2021 •

edited

Loading

lhoestq commented Nov 12, 2021 •

edited

Loading