Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushing dataset to hub crash #5672

Closed
tzvc opened this issue Mar 26, 2023 · 3 comments
Closed

Pushing dataset to hub crash #5672

tzvc opened this issue Mar 26, 2023 · 3 comments

Comments

@tzvc
Copy link

tzvc commented Mar 26, 2023

Describe the bug

Uploading a dataset with push_to_hub() fails without error description.

Steps to reproduce the bug

Hey there,

I've built a image dataset of 100k images + text pair as described here https://huggingface.co/docs/datasets/image_dataset#imagefolder

Now I'm trying to push it to the hub but I'm running into issues. First, I tried doing it via git directly, I added all the files in git lfs and pushed but I got hit with an error saying huggingface only accept up to 10k files in a folder.

So I'm now trying with the push_to_hub() func as follow:

from datasets import load_dataset
import os

dataset = load_dataset("imagefolder", data_dir="./data", split="train")
dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))

But again, this produces an error:

Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100212/100212 [00:00<00:00, 439108.61it/s]
Downloading and preparing dataset imagefolder/default to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 100211/100211 [00:00<00:00, 149323.73it/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15947.92it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2245.34it/s]
Dataset imagefolder downloaded and prepared to /home/contact_theochampion/.cache/huggingface/datasets/imagefolder/default-20567ffc703aa314/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data.
Resuming upload of the dataset shards.                                                                                                                                        
Pushing dataset shards to the dataset hub: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:31<00:00,  2.24s/it]
Downloading metadata: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:00<00:00, 225kB/s]
Traceback (most recent call last):
  File "/home/contact_theochampion/organization-logos/push_to_hub.py", line 5, in <module>
    dataset.push_to_hub("tzvc/organization-logos", token=os.environ.get('HF_TOKEN'))
  File "/home/contact_theochampion/.local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 5245, in push_to_hub
    repo_info = dataset_infos[next(iter(dataset_infos))]
StopIteration

What could be happening here ?

Expected behavior

The dataset is pushed to the hub

Environment info

  • datasets version: 2.10.1
  • Platform: Linux-5.10.0-21-cloud-amd64-x86_64-with-glibc2.31
  • Python version: 3.9.2
  • PyArrow version: 11.0.0
  • Pandas version: 1.5.3
@lhoestq
Copy link
Member

lhoestq commented Mar 27, 2023

Hi ! It's been fixed by #5598. We're doing a new release tomorrow with the fix and you'll be able to push your 100k images ;)

Basically push_to_hub used to fail if the remote repository already exists and has a README.md without dataset_info in the YAML tags.

In the meantime you can install datasets from source

@bukosabino
Copy link

Hi @lhoestq ,

What version of datasets library fix this case? I am using the last v2.10.1 and I get the same error.

@lhoestq
Copy link
Member

lhoestq commented Mar 29, 2023

We just released 2.11 which includes a fix :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants