Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips to upload large models/datasets #1565

Merged
merged 11 commits into from
Aug 16, 2023
Merged

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Jul 20, 2023

We've talked several times about having a section in the docs with some tips for users wanting to upload a large amount of data. I tried to sum-up everything with two distinct aspects:

  • technical limitations (max size per file, max file per commit, max file per folder, max file per repo)
  • practical tips (start small, use hf_transfer, expect failures)

I also reorganized a bit the Upload guide in general. It's becoming a quite long page but I still think it makes sense to have everything in one place.

@Wauplin Wauplin added the documentation Improvements or additions to documentation label Jul 20, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 20, 2023

The documentation is not available anymore as the PR was closed or merged.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 20, 2023

cc @Pierrci @coyotte508 can you check the info is accurate. I took it from #995 (comment) but prefer to double-check it's not outdated (already 1y old 😄)

cc @lhoestq might be interested for datasets users as well

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome !

docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Once merged, it may be nice to link to this from the Datasets docs as well 😄

docs/source/guides/upload.md Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 20, 2023

Thanks for the review and valuable feedback @stevhliu @lhoestq! I'll update to PR to make the short list more explicit about what is a hard limit and what is a recommendation. And also move it up in the section :)

Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! 🔥

@@ -371,11 +376,89 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al

For more detailed information, take a look at the [`HfApi`] reference.

## Push files with Git LFS
## Tips and tricks for large uploads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful to not break this url

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be too worried about it to be honest. I've checked and #upload-files-with-git-lfs is referenced nowhere in our internal docs (both hfh and hf_docs). Doesn't mean that such a URL doesn't exist in the wild but I would expect probability to be quite low. And even if it's the case, users will be redirected to the correct page even though it's not the correct section. Since Git LFS upload is mostly deprecated it should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw it's quite easy to add backward compat for a particular url anchor if needed)

docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
Wauplin and others added 3 commits July 21, 2023 11:44
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 21, 2023

I have addressed all the comments:

  • merged the PR suggestions
  • suggest to contact us via Discord or datasets@huggingface.co (I expect repos with TBs of data to be 99% datasets)
  • compiled technical recommendation in a table as suggested by @stevhliu:

image

@stevhliu Could you have a second look to the docs and check it looks good? Thanks in advance!

@@ -371,11 +377,95 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al

For more detailed information, take a look at the [`HfApi`] reference.

## Push files with Git LFS
## Tips and tricks for large uploads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we don't want to put that content on its own doc page? (no strong opinio)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion either. I thought the content was a bit light to get its own page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was initially looking for this PR in hub-docs, so yeah it could make sense haha, but it seems to me that some of the advice and limits are specific to huggingface_hub (or at least to uploads through the HTTP API), so we can keep it here, at least for now

@julien-c julien-c requested a review from Pierrci July 21, 2023 14:35
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's wait for @Pierrci quick review before merging please 🙏

Wauplin and others added 3 commits July 21, 2023 17:29
Co-authored-by: Julien Chaumond <julien@huggingface.co>
…e/huggingface_hub into tutorial-upload-large-dataset
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, love the additional columns you added to the table! 🤗

docs/source/guides/upload.md Outdated Show resolved Hide resolved
@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 21, 2023

Good, thanks @stevhliu I made the little change.

Final review from @Pierrci and we are good to go! 🎉

Copy link
Member

@Pierrci Pierrci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finally able to take a look 😄

I don't think we should advertise hard limits, if you reach them it means your repo is going to be very hard to manage (both for the user and for us); we really want to steer people as much as we can toward numbers lower than those. We also want to keep some leeway and be able to change those limits at our discretion to protect our infra if needed.

So I pushed suggestions that advertise recommendations instead - and as for any recommendations, you can choose to ignore them if you're a player, but sometimes it's not a good idea, particularly if you go way above them :)

docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
docs/source/guides/upload.md Outdated Show resolved Hide resolved
@@ -371,11 +377,95 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al

For more detailed information, take a look at the [`HfApi`] reference.

## Push files with Git LFS
## Tips and tricks for large uploads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was initially looking for this PR in hub-docs, so yeah it could make sense haha, but it seems to me that some of the advice and limits are specific to huggingface_hub (or at least to uploads through the HTTP API), so we can keep it here, at least for now

Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 16, 2023

Thanks for the review @Pierrci! Completely understand your point here about not to commit to much on the hard limits. I have applied your recommendations. Could you have a final look and we are good to merge? :)

@Pierrci
Copy link
Member

Pierrci commented Aug 16, 2023

Thanks @Wauplin! I did a Grammarly pass and pushed additional changes (the VSCode extension is really nice for that!), everything LGTM on my end!

@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 16, 2023

Perfect! Thanks for the final pass, I'm finally merging this PR 😄

@Wauplin Wauplin merged commit 9d418ae into main Aug 16, 2023
4 checks passed
@Wauplin Wauplin deleted the tutorial-upload-large-dataset branch August 16, 2023 15:38
For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files.
- The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to
create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough.
- **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks of around 5GB each**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late to the party but i would have done larger, e.g. 20GB or at least 10GB (Cloudfront caches up to 30GB if i'm not mistaken)

@Pierrci @huggingface/moon-landing-back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I don't really see the interest; as mentioned just below in the doc splitting into small chunks is better for uploading/downloading and retries, while I'm not sure there are a lot of advantages in doing chunks of 30GB?

@severo
Copy link
Collaborator

severo commented Aug 24, 2023

@julien-c about https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/13?u=severo, we had the information in a previous version of this PR (see the table above: #1565 (comment)).

Should we add it again?

@julien-c
Copy link
Member

i think we didn't want to do a table with "recommended limits" and "hard limits", so maybe just add a sentence like:

in all cases no single LFS file will be able to be >50GB. I.e. 50GB is the hard limit for single file size.

@severo
Copy link
Collaborator

severo commented Aug 25, 2023

OK: #1624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants