Storage-Efficient Bulk Conversion of Single-File Model Layout to Diffusers-Multifolder #9892

wbclark · 2024-08-23T16:16:44Z

wbclark
Aug 23, 2024

Is your feature request related to a problem? Please describe.
A major drawback of using single-file models is their inefficient use of disk storage, as a user who has downloaded several models in the single-file format is likely storing many redundant copies of individual model components that were re-used across models.

It's common for models to merge or fine-tune only the UNet, for example, leading to a situation where the user is potentially storing many redundant copies of identical VAEs and text encoders, eating up a non-trivial amount of their disk space.

However, currently even when converting single-file models to the diffusers-multifolder format using the scripts provided in this repository, each model’s components (e.g., UNet, VAE, text encoder) are stored separately, leading still to redundant storage if multiple models share identical components.

Describe the solution you'd like.
I propose a feature that facilitates the conversion of downloaded single-file models to the diffusers-multifolder layout in a storage-efficient manner. The core idea is to identify and eliminate duplicate model components across multiple models by:

Converting models to the diffusers-multifolder format.
Computing cryptographic hashes for the weight files of each model component
Storing each component, named by its hash, in a centralized directory
Linking from each model’s directory to these centralized components instead of storing duplicates OR adding the name/hash/location to the config.json (may be preferable for Windows users since it doesn't require enabling developer mode)

This could be supported by providing additional operations such as:

Perform a dry run to assess potential disk space savings.
Convert models individually in sequence, cleaning up the original model files after each conversion (useful when disk space is already limited), optionally pausing before the cleanup if validating the converted model is needed

Describe alternatives you've considered.
If I understand correctly (please correct me if I'm wrong), the huggingface_hub caching system currently performs some de-duplication, but:

The conversion scripts in this repository don't add models to this cache
The caching system avoids duplicated blobs within the same model namespace, but not across model namespaces

Additional context.

I'd like to work on implementing this feature, and I'm proposing it here first to ensure it fits within the scope of this project, and to refine the proposal further if necessary.
In my opinion, this feature would drive support for and adoption of the diffusers-multifolder layout across the ecosystem.

Thank you for your consideration and feedback!

wbclark · 2024-08-29T03:04:47Z

wbclark
Aug 29, 2024
Author

Hello, has anybody had an opportunity to look into this feature proposal?

0 replies

2024-09-23T15:02:49Z

github-actions[bot]
bot Sep 23, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

0 replies

asomoza · 2024-09-23T15:45:23Z

asomoza
Sep 23, 2024
Maintainer

Hi, thank you for your well elaborated proposal. This is an idea we have from some time but is not that fast to implement since we support a quite a bit of different models.

Converting or not the single file format is not something we can do arbitrarily, people like that format because it's easier to maintain for normal users that don't know how to use the console or git, even if they have to download each time a 25GB size file, they will still do it, I've seen this with SD3 and Flux. So that part should be done by the user beforehand.

But I do think that a system where the users won't have to download (from the hub) again the same file if they already have it, its a really good idea.

cc: @yiyixuxu

0 replies

yiyixuxu · 2024-09-24T23:54:57Z

yiyixuxu
Sep 24, 2024
Maintainer

Thanks for the issue @wbclark

We work with huggingface hub and love it, so it makes more sense for us to continue using it and improve upon; we probably do not want to reinvent a different system just for diffusers. given that context, I'm reading this proposal here

If I understand correctly (please correct me if I'm wrong), the huggingface_hub caching system currently performs some de->duplication, but:

The conversion scripts in this repository don't add models to this cache
The caching system avoids duplicated blobs within the same model namespace, but not across model >namespaces

can you help me understand how would it help if we add the converted model into the cache system?
also cc @Wauplin here to see if he has any thoughts on de-duplications across different model repos

0 replies

Wauplin · 2024-10-15T13:35:44Z

Wauplin
Oct 15, 2024

Hey there, sorry for the late response. For now we don't want to do cross-repo deduplication as it would potentially introduce other types of security issues -typically a corrupted file from one repo ending up corrupting another one-. However we are in the process of revamping our backend infra to make file storing and downloading more efficient. This will impact how files are cached locally and might result in cross-repo deduplication. In the meantime, we don't want to change the existing logic.

0 replies

2024-11-08T15:03:25Z

github-actions[bot]
bot Nov 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage-Efficient Bulk Conversion of Single-File Model Layout to Diffusers-Multifolder #9892

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Storage-Efficient Bulk Conversion of Single-File Model Layout to Diffusers-Multifolder #9892

wbclark Aug 23, 2024

Replies: 6 comments

wbclark Aug 29, 2024 Author

github-actions[bot] bot Sep 23, 2024

asomoza Sep 23, 2024 Maintainer

yiyixuxu Sep 24, 2024 Maintainer

Wauplin Oct 15, 2024

github-actions[bot] bot Nov 8, 2024

wbclark
Aug 23, 2024

wbclark
Aug 29, 2024
Author

github-actions[bot]
bot Sep 23, 2024

asomoza
Sep 23, 2024
Maintainer

yiyixuxu
Sep 24, 2024
Maintainer

Wauplin
Oct 15, 2024

github-actions[bot]
bot Nov 8, 2024