-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚩 Scan cache tool: ability to free up space #1025
Comments
Related to #1013. However for now we are more in an approach of "let the user decide" than automatically deleting stuff. |
maybe a combination of |
I think a very useful approach would be to be able to sort plus dump ready to delete commands, as in:
That way now it's really trivial for a user to copy-n-paste and also edit the list if they don't want to delete something. or perhaps this format? so that the decision making data is close by
or better even more detailed, but still friendly to be run in shell
so as it'd be already sorted either by size or access I could then dump the output in the editor, massage it to remove what I want to keep and then run it, thus cleaning the cache in one swoop. |
@stas00 I really like the idea of outputting a list of commands that the user can edit manually before deleting. However I'm not sure we can do that with a list of To delete a revision, we need to delete the corresponding snapshot folder containing the symlinks but also the blobs that were referred only by this revision (and not the others). So doing it with We can still keep the idea of printing a list to edit but maybe it could be a list of commit_hash:
|
About the last access time, I am not sure how reliable it is on different OS. Download time can be a more reliable indicator but less interesting (cc @julien-c 's example: gpt2 is a few years old in his cache but still useful to keep). Also, access time can "easily" be determined for a full repo but a lot more difficult for specific revisions as they share the same blobs. I'm not 100% against the idea, I'm just concerned how we can make this information useful to the users. |
Oh, right! I didn't think of the revisions! This is super important! So actually a lot of the times the user will want to wipe out all outdated revisions but the current one. Should we provide an API for deletion then?
or something like that? delete all:
delete all revisions but last one:
delete specific revision:
and this also means that the scan should print out the breakdown of each revision's space it's taking, no? |
re: access time We could start with just full repo access time - and combined with So if someone asked me to test some code with its repo and I don't normally use it I could easily wipe it out. It'd do it in 2 steps
note: I added another flag
now I can delete repos I don't really use How is that? |
Thanks for all the ideas :) I start to think that we should keep your idea of generating editable commands to keep it as simple and flexible as possible. What I'm afraid of with all the A realistic possibility I see is to add a 1.
|
That would work, thank you for the super-detailed preview of how it might work, @Wauplin I think I'd rather So:
and of course eventually it'd be great to have a gui tool where one can quickly select what's wanted/unwanted and delete in one click. something like https://en.wikipedia.org/wiki/Disk_Usage_Analyzer, which also visually shows which is bigger. But of course, this can happen much later (and the difficulty would be to make it cross-platform).
what is the eval for? won't just
on windows it'd depend on the shell they use - some work like bash, others are more limited. and my proposal includes no pipes. |
I see the idea but I am not sure how the
? If yes I'm just worried that it would be sub-optimal as each (also I find it very verbose in
My bad, I was sure the "> revisions.txt" was parsed by python argparse which we don't want but that's not the case.
True, that's an advantage to generate a file of commands. Maybe we could have an option to take as input a file instead of using a pipe e.g.:
|
why do you need to re-scan the cache if you are already given the exact hash to delete? perhaps this is simply the case of a missing indexing component? i.e. adding an index that always gets updated on adding/deleting revisions and never require a manual rescan if the entrypoint is provided by the user - a hash in this case? of course such index should be able to invalidate entries if a user deleted them manually and not through the API.
but you can't act on hashes directly, you have to manually add them to a command that isn't there. but why not have both ways and let the user choose? |
1.
Even if I know the hash of the revision I want to delete I need to make sure the blob files that I delete are not referenced by another revision in the repo.
Let's imagine a Just to mention it, it could be somehow possible to count the number of symlinks referrecing a blob on some platforms (Linux for instance) but I prefer to avoid that. 2.
Yes, a proper index at the root of the cache would solve this problem. However I feel a bit lazy to invest time building an index component that would need to be robust enough when inserting/deleting files only to get a faster cache scan. If it really starts to be a burden for the users to scan the cache directory then we could think about it but here we just want to ship a "free up some space" feature. A very rough benchmark on my local "5k-files"-cache takes ~300ms. For a user with 1M cached files we can estimate the scan to ~1min which shouldn't be to much of a hassle for the vast majority of our users, right ? 3.
Yes that's true that it's not as convenient/optimal as having a cache index and a I expect most users to use the |
Overall, thanks a lot for the time you took to give feedback on this feature. It really helped a lot ! |
OK, so you don't need to rescan the whole cache - just the specific repo - and so on average you'd have just 2 or few more revisions there - shouldn't really be a big overhead, no? Repos with 1 revision won't even show up in this use case. I agree that having a proper index can be a nice-to-have one-day feature.
Indeed, but this is only one need. And it'd be the most used purging use-case. (which also makes me think that it should be the default and not require So I propose The other 2 important needs are to be able to identify and purge whole repos that either: |
Yes makes sense. Big and time-consuming scans will be needed only on large datasets with a lot of revisions. For this we can't do much anyway.
About prune-cache, delete-cache and options, let's keep the idea here since it really makes sense. Once the core-part is implemented, those will be cosmetic adjustments of the CLI. I think we are good to go ! 🔥 |
Another idea popped-up when I started to work on it. Couldn't we just use an external library to make a nice multi-select view in the terminal ? The user can:
From a process point of view:
@stas00 @julien-c What do you think about that ? No need for a dry-run mode, no need for an external file/output, no need to copy-paste commands,... We can even have a This adds an external dependency but it doesn't seem too problematic to me as it is only cli-related (let's make it optional). I found this package (simple-term-menu) which can easily do the job. I made a small demo that can be improved in term of wording/information displayed ⬇️ Screencast.from.09-09-2022.12.46.48.webm |
I think that's a great idea, @Wauplin, as long as you don't put yourself into a situation of then committing to figure out how to make it work cross-platform. We can of course make a statement that it'd only work in some cases but not always, and in that case use the manual approach? If we go with your proposal I'd just suggest
|
Could idea @stas00. Let's make this a beta-feature that is optional. It seems there are some minor issues on Windows but yeah I definitely don't want to debug that. Previous message was just a demo but I'll do a more careful research to use a package that seems stable enough and that does the job.
By doing this, we ensure repo/revisions selections are always consistent and the user can understand it. I think it's possible to automatically selects/deselects items when the user selects/deselects one with a callback somewhere. |
@stas00 I worked on it today and made this PR #1046. This is a demo of how it looks like. As said in the PR description, still is a version where no options is implemented. Once it's merged, adding
In the end I think it's easier if the user can only select revisions. If all revisions of a repo are selected, the full repo is deleted. It's less logic to implement and explain in documentation. |
All sounds great, @Wauplin
Yes, that's exactly how git operates, you only need the first X unique characters of the cache - usually 8 is sufficient - great idea on having the caching support short strings. Also let's make sure that there is the manual version in addition to the GUI, that way if gui doesn't work for a user they always have the fallback. and the demo looks great. Thank you for working on this. |
@stas00 yes indeed. That's now done with the See documentation for both modes here. |
that's perfect, thank you, @Wauplin |
one more addition I have just thought of - should this feature also wipe out clean the Thank you! |
What is the |
as you can see it's full of things that were downloaded from the hub and also things extracted from what was downloaded - a ton of stuff that was temporarily used. |
that's |
oh, right! I didn't realize they are not part of the models - I just thought it was all hub's cache. So to manage datasets caches we should start a similar query at |
@stas00 I am really not sure how Also to mention that #1046 is finally merged ! :) |
I believe the datasets team is also in the process/thinking about integrating the updated cache within their caching system, so let's revisit this once that's taken care of cc @lhoestq |
(and @Wauplin don't hesitate to help the datasets team unify/upgrade their Hub integration 🎉 ) |
(yes I am in contact with @lhoestq about that. I'm planning to work on that mid-week for having a good integration :) ) |
Closing this issue as a first version of |
Originally from @stas00 in slack (internal link):
for me the main query / need is usually to free up some disk space and so I'd look at the top entries of these 3 groups:
In general, this is a feature request already discussed in #990. The approach of "pruning" we are aiming is to provide the user information and a tool to delete a specific revision and then make it as easy as possible for the user to define its own strategy.
The text was updated successfully, but these errors were encountered: