Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚩 Scan cache tool: ability to free up space #1025

Closed
Wauplin opened this issue Sep 1, 2022 · 32 comments
Closed

🚩 Scan cache tool: ability to free up space #1025

Wauplin opened this issue Sep 1, 2022 · 32 comments
Labels
enhancement New feature or request
Milestone

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Sep 1, 2022

Originally from @stas00 in slack (internal link):

for me the main query / need is usually to free up some disk space and so I'd look at the top entries of these 3 groups:

  1. nuking the largest entries (obvious)
  2. nuking the long not accessed entries (sorted by access time)
  3. may be also nuking the oldest entries (probably not using them anymore) (i.e. sorted by file create time) - but most likely 2. would have already caught this category

In general, this is a feature request already discussed in #990. The approach of "pruning" we are aiming is to provide the user information and a tool to delete a specific revision and then make it as easy as possible for the user to define its own strategy.

@Wauplin Wauplin added the enhancement New feature or request label Sep 1, 2022
@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 1, 2022

Related to #1013. However for now we are more in an approach of "let the user decide" than automatically deleting stuff.

@julien-c
Copy link
Member

julien-c commented Sep 1, 2022

maybe a combination of --limit 5 and --delete in the CLI to actually rm the corresponding files?

@Wauplin Wauplin added this to the v0.10 milestone Sep 1, 2022
@stas00
Copy link
Contributor

stas00 commented Sep 1, 2022

I think a very useful approach would be to be able to sort plus dump ready to delete commands, as in:

rm -rf ~/.cache/huggingface/hub/models--valhalla--t5-base-squad
rm -rf ~/.cache/huggingface/hub/models--nielsr--deformable-detr
rm -rf ~/.cache/huggingface/hub/models--gpt2-large
[...]

That way now it's really trivial for a user to copy-n-paste and also edit the list if they don't want to delete something.

or perhaps this format? so that the decision making data is close by

# 12.2GB
rm -rf ~/.cache/huggingface/hub/models--valhalla--t5-base-squad
# 10GB
rm -rf ~/.cache/huggingface/hub/models--nielsr--deformable-detr
# 5GB
rm -rf ~/.cache/huggingface/hub/models--gpt2-large
[...]

or better even more detailed, but still friendly to be run in shell

# 12.2GB / last used 2 months ago
rm -rf ~/.cache/huggingface/hub/models--valhalla--t5-base-squad
# 10GB / last used 2 days ago
rm -rf ~/.cache/huggingface/hub/models--nielsr--deformable-detr
# 5GB / last used 1 year
rm -rf ~/.cache/huggingface/hub/models--gpt2-large
[...]

so as it'd be already sorted either by size or access I could then dump the output in the editor, massage it to remove what I want to keep and then run it, thus cleaning the cache in one swoop.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 2, 2022

@stas00 I really like the idea of outputting a list of commands that the user can edit manually before deleting.

However I'm not sure we can do that with a list of rm -rf ... commands as deleting specific revisions is a bit more complex than deleting an entire folder. The rm -rf .. approach works perfectly fine if someone wants to delete entire repos ("i don't need gpt2-large" for example) but I think a big use case we have is people wanting the delete all past revisions expect main.

To delete a revision, we need to delete the corresponding snapshot folder containing the symlinks but also the blobs that were referred only by this revision (and not the others). So doing it with rm -rf ... commands would need to output potentially a giant command with all blob files listed 1 by 1 which is impossible for a dataset repo with thousands of files.

We can still keep the idea of printing a list to edit but maybe it could be a list of commit_hash:

# 12.2GB
huggingface_cli scan-cache --delete e2983b237dccf3ab4937c97fa717319a9ca1a96d 6c0e6080953db56375760c0471a8c5f2929baf11
# 10GB
huggingface_cli scan-cache --delete 0e4ab15e8865b843b3a348275facfd198cf22f03
# 5GB
huggingface_cli scan-cache --delete 60b8d3fe22aebb024b573f1cca224db3126d10f3
[...]

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 2, 2022

About the last access time, I am not sure how reliable it is on different OS. Download time can be a more reliable indicator but less interesting (cc @julien-c 's example: gpt2 is a few years old in his cache but still useful to keep). Also, access time can "easily" be determined for a full repo but a lot more difficult for specific revisions as they share the same blobs.

I'm not 100% against the idea, I'm just concerned how we can make this information useful to the users.

@stas00
Copy link
Contributor

stas00 commented Sep 2, 2022

Oh, right! I didn't think of the revisions!

This is super important!

So actually a lot of the times the user will want to wipe out all outdated revisions but the current one.

Should we provide an API for deletion then?

huggingface_cli delete-cache entry-name <revision name | all>

or something like that?

delete all:

huggingface_cli delete-cache models--nielsr--deformable-detr

delete all revisions but last one:

huggingface_cli delete-cache --keep-last models--nielsr--deformable-detr 

delete specific revision:

huggingface_cli delete-cache models--nielsr--deformable-detr 0e4ab15e8865b843b3a348275facfd198cf22f03

and this also means that the scan should print out the breakdown of each revision's space it's taking, no?

@stas00
Copy link
Contributor

stas00 commented Sep 2, 2022

re: access time

We could start with just full repo access time - and combined with --keep-last that should work really well.

So if someone asked me to test some code with its repo and I don't normally use it I could easily wipe it out.

It'd do it in 2 steps

  1. delete all revisions that aren't current - across all of cache:
huggingface_cli delete-cache --all --keep--last

note: I added another flag --all which I didn't propose in one comment up.

  1. now run scan-cache and sort by last accessed

now I can delete repos I don't really use

How is that?

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 5, 2022

Thanks for all the ideas :)

I start to think that we should keep your idea of generating editable commands to keep it as simple and flexible as possible. What I'm afraid of with all the --all, --keep-last, --sort, --filter, --limit, etc arguments is that we will end up with a quite complex CLI that doesn't cover absolutely all cases (and therefore we would have feature requests for specific cases).

A realistic possibility I see is to add a --dry-run mode to generate the list of revisions that will be deleted. We can split the work in different parts:

1. --dry-run to output revisions to delete

The --dry-run mode outputs a human-readable/human-editable list of revisions. Every model is listed with download time, last access time and size. For each repo, we list all the revision hashes and their corresponding refs (or (detached)). Revisions that are planned to be kept are commented out while revisions to delete are not.

In a --keep-last example, the revision associated to main is commented out, followed by all revisions that will be deleted:

huggingface_cli delete-cache --keep-last --dry-run

## Model valhalla/t5-base-squad
## Downloaded 1 year ago. Last used 2 months ago
## Total size: 12.2G. Would delete 7.4G.
# 3f4b5290db15d7d3821c190acef6fa7c40069da1 # main, 0.4.1
e2983b237dccf3ab4937c97fa717319a9ca1a96d # refs/pr/1, 0.2.0
6c0e6080953db56375760c0471a8c5f2929baf11 # (detached)

## Model nielsr/deformable-detr
## Downloaded 2 days ago. Last used 2 days ago
## Total size: 10G. Would delete 0.
# 0e4ab15e8865b843b3a348275facfd198cf22f03  # main

## Dataset SLPL/naab
## Download 1 year ago. Last used 5 days ago
## Total size: 7.2G. Would delete 5G.
# 60b8d3fe22aebb024b573f1cca224db3126d10f3 # main
a5ae8a094021fa4094b270166440d4a165939446 # (detached)

[...]

2. Pass a list of revisions to delete-cache

The above list can be manually edited and then passed again to the delete-cache command. The command should be able to parse the list of revisions to delete by splitting the "#" correctly.

eval("huggingface_cli delete-cache --keep-last --dry-run") > revisions.txt
# or "huggingface_cli delete-cache --keep-last --dry-run --export=revisions.txt" ?

(...) # edit revisions.txt manually

cat revisions.txt | huggingface_cli delete-cache

2.b. Eventually pass revision hash manually

# Single revision
huggingface_cli delete-cache a5ae8a094021fa4094b270166440d4a165939446

# Or multiple revisions
huggingface_cli delete-cache e2983b237dccf3ab4937c97fa717319a9ca1a96d 6c0e6080953db56375760c0471a8c5f2929baf11

3. Bonus: more options

I put those options as "bonus" since they don't need to exist to make previous steps work. I think we should start we --dry-run and --keep-last and then discuss more complex options in different issues/PRs.

  • --keep-last to keep only main revision
huggingface_cli delete-cache --keep-last
  • --filter="repo_id_pattern" to filter repos
huggingface_cli delete-cache --filter="allenai/*"
  • --sort=key to sort repos and --limit=5 to truncate the output
huggingface_cli delete-cache --sort="total_size" --limit=5

--filter, --sort and --limit are typically arguments that we want to be common to both scan-cache and delete-cache.

What do you think about this proposition ?


Note:

  • Revision hashes are unique across repos so no need to provide "repo_type + repo_id + revision hash" to describe a revision; "revision hash" is enough
  • We would need to have doc for windows users as well. Absolutely no idea how pipes are working there.

@stas00
Copy link
Contributor

stas00 commented Sep 6, 2022

That would work, thank you for the super-detailed preview of how it might work, @Wauplin

I think I'd rather --dry-run generate the actual delete commands, so the user can either edit down the output and run it directly or simply copy-n-paste the specific non-dry deletes they want - more versatility, no?

So:

huggingface_cli delete-cache --keep-last --dry-run > cmds.txt
# edit cmds.txt and when happy execute to purge
sh cmds.txt
# or copy-n-paste from cmds.txt and execute just specific purge commands

--keep-last would of course not generate any delete commands if the cache entry only has the last entry, correct?

and of course eventually it'd be great to have a gui tool where one can quickly select what's wanted/unwanted and delete in one click. something like https://en.wikipedia.org/wiki/Disk_Usage_Analyzer, which also visually shows which is bigger. But of course, this can happen much later (and the difficulty would be to make it cross-platform).

eval("huggingface_cli delete-cache --keep-last --dry-run") > revisions.txt

what is the eval for? won't just huggingface_cli delete-cache --keep-last --dry-run > revisions.txt work?

We would need to have doc for windows users as well. Absolutely no idea how pipes are working there.

on windows it'd depend on the shell they use - some work like bash, others are more limited.

and my proposal includes no pipes.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 7, 2022

I think I'd rather --dry-run generate the actual delete commands, so the user can either edit down the output and run it directly or simply copy-n-paste the specific non-dry deletes they want - more versatility, no?

I see the idea but I am not sure how the cmds.txt content would be. Do you see it as a list of huggingface_cli delete-cache commands like this:

## Model valhalla/t5-base-squad
## Downloaded 1 year ago. Last used 2 months ago
## Total size: 12.2G. Would delete 7.4G.
# huggingface_cli delete-cache 3f4b5290db15d7d3821c190acef6fa7c40069da1 # main, 0.4.1
huggingface_cli delete-cache e2983b237dccf3ab4937c97fa717319a9ca1a96d # refs/pr/1, 0.2.0
huggingface_cli delete-cache 6c0e6080953db56375760c0471a8c5f2929baf11 # (detached)

## Model nielsr/deformable-detr
## Downloaded 2 days ago. Last used 2 days ago
## Total size: 10G. Would delete 0.
# huggingface_cli delete-cache 0e4ab15e8865b843b3a348275facfd198cf22f03  # main

## Dataset SLPL/naab
## Download 1 year ago. Last used 5 days ago
## Total size: 7.2G. Would delete 5G.
# huggingface_cli delete-cache 60b8d3fe22aebb024b573f1cca224db3126d10f3 # main
huggingface_cli delete-cache a5ae8a094021fa4094b270166440d4a165939446 # (detached)

?

If yes I'm just worried that it would be sub-optimal as each delete-cache command first require to scan the cache folder so potentially we scan 100 times to delete 100 revisions (instead of 1 for dry-run and 1 for execution). With the previous snippet from #1025 (comment), the user would have to build its own huggingface_cli delete-cache a5ae8a... 094021... commands for manual deletion so maybe we can add a small "instruction part" at the beginning of the generated file/output to explain how to use it ?

(also I find it very verbose in cmds.txt compared to just hashes)

eval("huggingface_cli delete-cache --keep-last --dry-run") > revisions.txt

My bad, I was sure the "> revisions.txt" was parsed by python argparse which we don't want but that's not the case.

and my proposal includes no pipes

True, that's an advantage to generate a file of commands. Maybe we could have an option to take as input a file instead of using a pipe e.g.:

huggingface_cli delete-cache --keep-last --dry-run > revisions.txt

(...)

huggingface_cli delete-cache --from-dry-run=revisions.txt

@stas00
Copy link
Contributor

stas00 commented Sep 7, 2022

I'm just worried that it would be sub-optimal as each delete-cache command first require to scan the cache folder so potentially we scan 100 times to delete 100 revisions

why do you need to re-scan the cache if you are already given the exact hash to delete?

perhaps this is simply the case of a missing indexing component? i.e. adding an index that always gets updated on adding/deleting revisions and never require a manual rescan if the entrypoint is provided by the user - a hash in this case?

of course such index should be able to invalidate entries if a user deleted them manually and not through the API.

(also I find it very verbose in cmds.txt compared to just hashes)

but you can't act on hashes directly, you have to manually add them to a command that isn't there.

but why not have both ways and let the user choose? --dump-hash-only if you prefer the terse version?

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 8, 2022

1.

why do you need to re-scan the cache if you are already given the exact hash to delete?

Even if I know the hash of the revision I want to delete I need to make sure the blob files that I delete are not referenced by another revision in the repo.

repo_A
    blobs
        aaaaaa
        bbbbbb
        cccccc
    snapshots
        123456
             config.json -> ./repo_Ablobs/aaaaaa
             some_data.txt -> ./repo_A/blobs/bbbbbb
        789101
             config.json -> ./repo_A/blobs/aaaaaa
             some_data.txt -> ./repo_A/blobs/cccccc

Let's imagine a huggingface-cli delete-cache 789101 command. I want to delete blob cccccc but not the blob aaaaaa. To know that I need to scan both revisions in order to check which blob is referenced are not. Knowing the revision id to delete is the only information I need from the user but it's not enough to know which blobs to delete, hence the scan.

Just to mention it, it could be somehow possible to count the number of symlinks referrecing a blob on some platforms (Linux for instance) but I prefer to avoid that.

2.

perhaps this is simply the case of a missing indexing component? i.e. adding an index that always gets updated on adding/deleting revisions and never require a manual rescan if the entrypoint is provided by the user - a hash in this case?

of course such index should be able to invalidate entries if a user deleted them manually and not through the API.

Yes, a proper index at the root of the cache would solve this problem. However I feel a bit lazy to invest time building an index component that would need to be robust enough when inserting/deleting files only to get a faster cache scan. If it really starts to be a burden for the users to scan the cache directory then we could think about it but here we just want to ship a "free up some space" feature.

A very rough benchmark on my local "5k-files"-cache takes ~300ms. For a user with 1M cached files we can estimate the scan to ~1min which shouldn't be to much of a hassle for the vast majority of our users, right ?

3.

but you can't act on hashes directly, you have to manually add them to a command that isn't there.

but why not have both ways and let the user choose? --dump-hash-only if you prefer the terse version?

Yes that's true that it's not as convenient/optimal as having a cache index and a cmds.txt file with commands already ready-to-use. However let's remember that it will only be used by "power users" that will run a --dry-run and then analyse the file line by line and then execute only parts of it.

I expect most users to use the huggingface_cli delete-cache --keep-last command and that's it. If you don't see a too big inconvenient, I think I'll start to move forward on this feature. Other options and additions will still be possible later :)

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 8, 2022

Overall, thanks a lot for the time you took to give feedback on this feature. It really helped a lot !

@stas00
Copy link
Contributor

stas00 commented Sep 8, 2022

Even if I know the hash of the revision I want to delete I need to make sure the blob files that I delete are not referenced by another revision in the repo.

OK, so you don't need to rescan the whole cache - just the specific repo - and so on average you'd have just 2 or few more revisions there - shouldn't really be a big overhead, no? Repos with 1 revision won't even show up in this use case.

I agree that having a proper index can be a nice-to-have one-day feature.

I expect most users to use the huggingface_cli delete-cache --keep-last command and that's it. If you don't see a too big inconvenient, I think I'll start to move forward on this feature. Other options and additions will still be possible later :)

Indeed, but this is only one need. And it'd be the most used purging use-case. (which also makes me think that it should be the default and not require --keep-last? not sure, just thinking aloud here). Perhaps we can have delete-cache and prune-cache (the latter will just delete the old stuff. Similar to felling a tree and pruning a tree. The former is destructive, the later is remedial.

So I propose prune-cache == delete-cache --keep-last

The other 2 important needs are to be able to identify and purge whole repos that either:
(a) haven't been used in a long time - dead weight
(b) are too large - if one needs urgently a disc space a user can choose to delete a few repos and later restore them. it's easier to nuke the largest repo in this case. though in this case a simple du -s will do the trick. In fact I was just doing exactly that yesterday when I needed to make space urgently.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 8, 2022

OK, so you don't need to rescan the whole cache - just the specific repo - and so on average you'd have just 2 or few more revisions there - shouldn't really be a big overhead, no? Repos with 1 revision won't even show up in this use case.

Yes makes sense. Big and time-consuming scans will be needed only on large datasets with a lot of revisions. For this we can't do much anyway.

So I propose prune-cache == delete-cache --keep-last

About prune-cache, delete-cache and options, let's keep the idea here since it really makes sense. Once the core-part is implemented, those will be cosmetic adjustments of the CLI.

I think we are good to go ! 🔥

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 9, 2022

Another idea popped-up when I started to work on it. Couldn't we just use an external library to make a nice multi-select view in the terminal ?

The user can:

  • select revisions to delete 1 by 1
  • select an entire repo to delete

From a process point of view:

  • We scan the cache once (we cannot get rid of this part)
  • Given the --keep-last, --filter, --sort=size, --limit,... options we preselect a set of revisions/repos
  • The user can then choose which ones to delete
  • Each time the user selects a new entry, we compute how much space would be saved (fast to compute since no re-scan)
  • Once selected, we prompt a "Are you sure [y/N]" and that's it.

@stas00 @julien-c What do you think about that ? No need for a dry-run mode, no need for an external file/output, no need to copy-paste commands,... We can even have a --skip-confirm to disable the selection menu.

This adds an external dependency but it doesn't seem too problematic to me as it is only cli-related (let's make it optional). I found this package (simple-term-menu) which can easily do the job. I made a small demo that can be improved in term of wording/information displayed ⬇️

Screencast.from.09-09-2022.12.46.48.webm

@stas00
Copy link
Contributor

stas00 commented Sep 9, 2022

I think that's a great idea, @Wauplin, as long as you don't put yourself into a situation of then committing to figure out how to make it work cross-platform.

We can of course make a statement that it'd only work in some cases but not always, and in that case use the manual approach?

If we go with your proposal I'd just suggest

  1. to add the memory usage and perhaps last time used time stamp next to each selectable option.
  2. figure out what to do if someone selects the main repo entry and also one of the revisions (do we delete the whole thing or just the selected revisions)

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 12, 2022

Could idea @stas00. Let's make this a beta-feature that is optional. It seems there are some minor issues on Windows but yeah I definitely don't want to debug that. Previous message was just a demo but I'll do a more careful research to use a package that seems stable enough and that does the job.

  1. Yes definitely. This was a simple example but more information would be better. I also thought that I could display only the first X characters instead of the full revision name. In the background I would still have the long version but the user only needs a short id I think (e.g. 60b8d3fe instead of 60b8d3fe22aebb024b573f1cca224db3126d10f3).

2. What I wanted to do is:
- when all revisions of a repo are selected, automatically select the repo as well
- when all revisions and the repo are selected and the user deselects a revision, the repo is automatically de-selected
- when a user selects a repo, all revisions are automatically selected
- when a user de-selects a repo, all revisions are automatically de-selected
=> implemented differently (see #1025 (comment))

By doing this, we ensure repo/revisions selections are always consistent and the user can understand it. I think it's possible to automatically selects/deselects items when the user selects/deselects one with a callback somewhere.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 12, 2022

@stas00 I worked on it today and made this PR #1046.

This is a demo of how it looks like. As said in the PR description, still is a version where no options is implemented. Once it's merged, adding --keep-last --filter --sort ... will be done in separated issues.

figure out what to do if someone selects the main repo entry and also one of the revisions (do we delete the whole thing or just the selected revisions)

In the end I think it's easier if the user can only select revisions. If all revisions of a repo are selected, the full repo is deleted. It's less logic to implement and explain in documentation.

@stas00
Copy link
Contributor

stas00 commented Sep 12, 2022

All sounds great, @Wauplin

that I could display only the first X characters instead of the full revision name. In the background

Yes, that's exactly how git operates, you only need the first X unique characters of the cache - usually 8 is sufficient - great idea on having the caching support short strings.

Also let's make sure that there is the manual version in addition to the GUI, that way if gui doesn't work for a user they always have the fallback.

and the demo looks great.

Thank you for working on this.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 13, 2022

Also let's make sure that there is the manual version in addition to the GUI, that way if gui doesn't work for a user they always have the fallback.

@stas00 yes indeed. That's now done with the --disable-tui flag. I tried to keep the process as close as possible between the TUI mode and the no-TUI mode. Without TUI, the user have to manually edit a temporary file that has been generated.

See documentation for both modes here.

@stas00
Copy link
Contributor

stas00 commented Sep 13, 2022

that's perfect, thank you, @Wauplin

@stas00
Copy link
Contributor

stas00 commented Sep 15, 2022

one more addition I have just thought of - should this feature also wipe out clean the downloads dir as it often has all kinds of left-overs and it's not always obvious that it can be safely wiped out.

Thank you!

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 15, 2022

What is the download directory for you ? Can you run tree on it so that I see what we are talking about?

@stas00
Copy link
Contributor

stas00 commented Sep 15, 2022

$ find ~/.cache/huggingface/ -type d -name "downloads"
/home/stas/.cache/huggingface/evaluate/downloads
/home/stas/.cache/huggingface/datasets/downloads

$ du -s /home/stas/.cache/huggingface/datasets/downloads
141G    /home/stas/.cache/huggingface/datasets/downloads

$ tree /home/stas/.cache/huggingface/datasets/downloads
├── 078890d95f804daa5ec1175957649d92c3006dde04e9b584454c42bd78ebc241.991ca2ccf495f2a8f6a30a8acf52a1c41174aaf5054cdab0e373a855b6d0a747.py
├── 078890d95f804daa5ec1175957649d92c3006dde04e9b584454c42bd78ebc241.991ca2ccf495f2a8f6a30a8acf52a1c41174aaf5054cdab0e373a855b6d0a747.py.json
├── 078890d95f804daa5ec1175957649d92c3006dde04e9b584454c42bd78ebc241.991ca2ccf495f2a8f6a30a8acf52a1c41174aaf5054cdab0e373a855b6d0a747.py.lock
├── 11ba2d79cea2e93248b0441726361c9bb586ed8c2b366d3eca0e2a09b09560cf
├── 11ba2d79cea2e93248b0441726361c9bb586ed8c2b366d3eca0e2a09b09560cf.json
├── 11ba2d79cea2e93248b0441726361c9bb586ed8c2b366d3eca0e2a09b09560cf.lock
├── 22cc09b66984393aefb3ad73a828ca57357f3a8401195388305da809d3418088
├── 22cc09b66984393aefb3ad73a828ca57357f3a8401195388305da809d3418088.json
├── 22cc09b66984393aefb3ad73a828ca57357f3a8401195388305da809d3418088.lock
[...]
├── extracted
│   ├── 08b49d88c65e1b539a4c3f5ceed84425b29c5f4b41c8679730a03158c7eb99e3
│   │   └── wikitext-2
│   │       ├── wiki.test.tokens
│   │       ├── wiki.train.tokens
│   │       └── wiki.valid.tokens
[...]
│   ├── ab2462bd81d79dc0ec989ab6c93d61decf0d806ef975a44bddd8a6d7e69c798a
│   │   ├── legal
│   │   │   └── Legal notices.txt
│   │   ├── rapid2019.de-en.de
│   │   └── rapid2019.de-en.en
│   ├── cd100a34a529fcef7abec93d9ba199b79b639c1b908bd7d1686c9e6c2b0e0932
│   │   └── wikitext-103
│   │       ├── wiki.test.tokens
│   │       ├── wiki.train.tokens
│   │       └── wiki.valid.tokens
[...]

as you can see it's full of things that were downloaded from the hub and also things extracted from what was downloaded - a ton of stuff that was temporarily used.

@julien-c
Copy link
Member

that's datasets (or evaluate) specific though so out of scope from this repo IMO

@stas00
Copy link
Contributor

stas00 commented Sep 15, 2022

oh, right! I didn't realize they are not part of the models - I just thought it was all hub's cache.

So to manage datasets caches we should start a similar query at datasets? As this is another huge disc space sinker in addition to models.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 16, 2022

@stas00 I am really not sure how datasets uses this downloads folder but that's definitely something to handle separately yes. I guess the best would be to start a discussion on how datasets can use the default huggingface_hub cache directory if that is even possible. Seems that it is used to unzip files which at some point might be useful in huggingface_hub for a larger re-usability.

Also to mention that #1046 is finally merged ! :)

@LysandreJik
Copy link
Member

I believe the datasets team is also in the process/thinking about integrating the updated cache within their caching system, so let's revisit this once that's taken care of

cc @lhoestq

@julien-c
Copy link
Member

(and @Wauplin don't hesitate to help the datasets team unify/upgrade their Hub integration 🎉 )

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 19, 2022

(yes I am in contact with @lhoestq about that. I'm planning to work on that mid-week for having a good integration :) )

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 20, 2022

Closing this issue as a first version of huggingface-cli delete-cache is implemented :)
I created a new issue to keep track of the ideas we mentionned #1065.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants