Skip to content

Commit

Permalink
grammarly pass
Browse files Browse the repository at this point in the history
  • Loading branch information
Pierrci committed Aug 16, 2023
1 parent 81847ff commit 4c2bb66
Showing 1 changed file with 21 additions and 19 deletions.
40 changes: 21 additions & 19 deletions docs/source/guides/upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,9 @@ For more detailed information, take a look at the [`HfApi`] reference.
## Tips and tricks for large uploads
There are some limitations to be aware of when you're dealing with a large amount of data in your repo. Given the time it takes to stream the data, it can be very annoying to get an upload/push to fail at the end of the process or encounter a degraded experience, be it on hf.co or when working locally. We gathered a list of tips and recommendations to think about when structuring your repo.
There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
We gathered a list of tips and recommendations for structuring your repo.
| Characteristic | Recommended | Tips |
Expand All @@ -393,7 +395,7 @@ There are some limitations to be aware of when you're dealing with a large amoun
_* Not relevant when using `git` CLI directly_
Please read the next section to get a better understanding of those limits and how to deal with them.
Please read the next section to understand better those limits and how to deal with them.
### Hub repository size limitations
Expand All @@ -405,44 +407,44 @@ Under the hood, the Hub uses Git to version the data, which has structural impli
If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**,
which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider:
- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord).
- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if you have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord).
- **Number of files**:
- For optimal experience, we recommend to keep the total number of files under 100k. Try merging the data into fewer files if you have more.
For example, json files can be merged into a single jsonl file or large datasets can be exported as Parquet files.
- For optimal experience, we recommend keeping the total number of files under 100k. Try merging the data into fewer files if you have more.
For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files.
- The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to
create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough.
- **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g.
model weights), we strongly recommend to split them **into chunks of around 5GB each**. There are a few reasons for this:
- Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always
- **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks of around 5GB each**.
There are a few reasons for this:
- Uploading and downloading smaller files is much easier both for you and the other users. Connection issues can always
happen when streaming data and smaller files avoid resuming from the beginning in case of errors.
- Files are served to the users using CloudFront. From our experience, huge files are not cached by this service
leading to a slower download speed.
- **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from
our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to
improve the service but one must always remember that a git repository is not meant to work as a database with a lot of
writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a
our experience, the user experience on the Hub starts to degrade after a few thousand commits. We are constantly working to
improve the service, but one must always remember that a git repository is not meant to work as a database with a lot of
writes. If your repo's history gets very large, it is always possible to squash all the commits to get a
fresh start.
- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is uploaded on the Hub, each
- **Number of operations per commit**: Once again, there is no hard limit here. When a commit is uploaded on the Hub, each
git operation (addition or delete) is checked by the server. When a hundred LFS files are committed at once,
each file is checked individually to make sure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`,
a timeout of 60s is set on the request, meaning that if the process takes more time an error is raised
each file is checked individually to ensure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`,
a timeout of 60s is set on the request, meaning that if the process takes more time, an error is raised
client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still
completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend
to add around 50-100 files per commit.
adding around 50-100 files per commit.
### Practical tips
Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical
Now that we've seen the technical aspects you must consider when structuring your repository, let's see some practical
tips to make your upload process as smooth as possible.
- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
on a script when failing takes only a small amount of time.
on a script when failing takes only a little time.
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always
best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our
servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you
already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never
be re-uploaded twice but checking it client-side can still save some time.
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed-up
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up
uploads on machines with very high bandwidth. To use it, you must install it (`pip install hf_transfer`) and enable it
by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. You can then use `huggingface_hub` normally.
Disclaimer: this is a power user tool. It is tested and production-ready but lacks user-friendly features like progress
Expand All @@ -451,7 +453,7 @@ bars or advanced error handling.
## (legacy) Upload files with Git LFS
All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub.
However we also provide [`Repository`], a wrapper around the git tool to manage a local repository.
However, we also provide [`Repository`], a wrapper around the git tool to manage a local repository.
<Tip warning={true}>
Expand Down

0 comments on commit 4c2bb66

Please sign in to comment.