Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StableDiffusionXLInstructPix2PixPipeline doesn't work with cosxl_edit #7621

Closed
apolinario opened this issue Apr 9, 2024 · 14 comments
Closed
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@apolinario
Copy link
Collaborator

Describe the bug

CosXL Edit is an InstructPix2Pix model (https://huggingface.co/stabilityai/cosxl) released together with CosXL, however trying to load it gives a size mismatch error

Reproduction

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "cosxl_edit.safetensors"
)

Logs

tokenizer_config.json: 100%
 905/905 [00:00<00:00, 13.7kB/s]
vocab.json: 100%
 961k/961k [00:00<00:00, 10.2MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 17.4MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 20.1kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.0MB/s]
config.json: 100%
 4.52k/4.52k [00:00<00:00, 250kB/s]
tokenizer_config.json: 100%
 904/904 [00:00<00:00, 50.1kB/s]
vocab.json: 100%
 862k/862k [00:00<00:00, 34.1MB/s]
merges.txt: 100%
 525k/525k [00:00<00:00, 22.2MB/s]
special_tokens_map.json: 100%
 389/389 [00:00<00:00, 21.6kB/s]
tokenizer.json: 100%
 2.22M/2.22M [00:00<00:00, 16.5MB/s]
config.json: 100%
 4.88k/4.88k [00:00<00:00, 253kB/s]
Some weights of the model checkpoint were not used when initializing CLIPTextModelWithProjection: 
 ['text_model.embeddings.position_ids']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-01c040bbaf7e> in <cell line: 5>()
      3 from diffusers.utils import load_image
      4 
----> 5 pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
      6     file, torch_dtype=torch.float16
      7 )

4 frames
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
    116             kwargs = smoothly_deprecate_use_auth_token(fn_name=fn.__name__, has_token=has_token, kwargs=kwargs)
    117 
--> 118         return fn(*args, **kwargs)
    119 
    120     return _inner_fn  # type: ignore

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in from_single_file(cls, pretrained_model_link_or_path, **kwargs)
    287                 init_kwargs[name] = passed_class_obj[name]
    288             else:
--> 289                 components = build_sub_model_components(
    290                     init_kwargs,
    291                     class_name,

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file.py in build_sub_model_components(pipeline_components, pipeline_class_name, component_name, original_config, checkpoint, local_files_only, load_safety_checker, model_type, image_size, torch_dtype, **kwargs)
     59         upcast_attention = kwargs.pop("upcast_attention", None)
     60 
---> 61         unet_components = create_diffusers_unet_model_from_ldm(
     62             pipeline_class_name,
     63             original_config,

/usr/local/lib/python3.10/dist-packages/diffusers/loaders/single_file_utils.py in create_diffusers_unet_model_from_ldm(pipeline_class_name, original_config, checkpoint, num_in_channels, upcast_attention, extract_ema, image_size, torch_dtype, model_type)
   1320         from ..models.modeling_utils import load_model_dict_into_meta
   1321 
-> 1322         unexpected_keys = load_model_dict_into_meta(unet, diffusers_format_unet_checkpoint, dtype=torch_dtype)
   1323         if unet._keys_to_ignore_on_load_unexpected is not None:
   1324             for pat in unet._keys_to_ignore_on_load_unexpected:

/usr/local/lib/python3.10/dist-packages/diffusers/models/modeling_utils.py in load_model_dict_into_meta(model, state_dict, device, dtype, model_name_or_path)
    150         if empty_state_dict[param_name].shape != param.shape:
    151             model_name_or_path_str = f"{model_name_or_path} " if model_name_or_path is not None else ""
--> 152             raise ValueError(
    153                 f"Cannot load {model_name_or_path_str}because {param_name} expected shape {empty_state_dict[param_name]}, but got {param.shape}. If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example."
    154             )

ValueError: Cannot load because conv_in.weight expected shape tensor(..., device='meta', size=(320, 4, 3, 3)), but got torch.Size([320, 8, 3, 3]). If you want to instead overwrite randomly initialized weights, please make sure to pass both `low_cpu_mem_usage=False` and `ignore_mismatched_sizes=True`. For more information, see also: https://github.com/huggingface/diffusers/issues/1619#issuecomment-1345604389 as an example.


### System Info

diffusers==0.27.2

### Who can help?

@sayakpaul , @yiyixuxu 
@apolinario apolinario added the bug Something isn't working label Apr 9, 2024
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Apr 9, 2024

should be able to get the checkpoint in

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Apr 9, 2024

cc @DN6 here
let's make sure to support SDXL InstructPix2Pix out of box in #7496

we should support every model listed in here https://github.com/comfyanonymous/ComfyUI/blob/4201181b35402e0a992b861f8d2f0e0b267f52fa/comfy/supported_models.py#L479

@apolinario
Copy link
Collaborator Author

apolinario commented Apr 9, 2024

This worked with num_in_channels=8 (as in: didn't error). However perceptually isn't behaving as it should

Edit image:
image

Edit prompt Turn sky into a cloudy one:
image

import torch
from diffusers import StableDiffusionXLInstructPix2PixPipeline, EDMEulerScheduler

inst_file = "cosxl_edit.safetensors"

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    inst_file, num_in_channels=8,
).to("cuda")

pipe.scheduler = EDMEulerScheduler(sigma_min=0.002, sigma_max=120.0, sigma_data=1.0, prediction_type="v_prediction")

resolution = 1024
image = load_image(
    "https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
).resize((resolution, resolution))

edit_instruction = "Turn sky into a cloudy one"
edited_image = pipe(
    prompt=edit_instruction,
    image=image,
    height=resolution,
    width=resolution,
    #guidance_scale=3.0,
    #image_guidance_scale=1.5,
    num_inference_steps=20,
).images[0]

@sayakpaul
Copy link
Member

sayakpaul commented Apr 10, 2024

Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.

If it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline with each components initialized separately?

unet = ...
text_encoder = ...
text_encoder_2 = ...
vae = ...
scheduler = ...

pipeline = ...

@apolinario
Copy link
Collaborator Author

Not sure if it's the exact guidance formulation that we have in the InstructPix2Pix pipeline though. That would matter a lot.

ComfyUI uses the same InstructPix2PixConditioning node for it that they use for InstructPix2Pix itself. Overall this is how Comfy supported the CosXL models. Once that was in, the nodes for supporting it seem similar to InstructPix2Pix vanilla.
comfyanonymous/ComfyUI@1088d18

This are the nodes for the comfyui official edit workflow
image

AIf it's possible, could you try to initialize the StableDiffusionXLInstructPix2PixPipeline with each components initialized separately?

As I'm using from_single_file, I think the methods UNet2DConditionModel etc don't have it afaik. How do you think that would help with debugging/making it work?

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Apr 10, 2024

@apolinario

just have to scale the image_latents

adding this to the pipeline

        # 6. Prepare Image latents
        image_latents = self.prepare_image_latents(
            image,
            batch_size,
            num_images_per_prompt,
            prompt_embeds.dtype,
            device,
            do_classifier_free_guidance,
        )
        image_latents = latents * self.vae.config.scaling_factor

edited

@sayakpaul
Copy link
Member

Nice finding. However, the SD Pix2Pix doesn't have it :o

@apolinario
Copy link
Collaborator Author

Awesome! What's the best way to proceed here? Modify the pipeline to detect if scaling is needed or not or create a new one?

@sayakpaul
Copy link
Member

sayakpaul commented Apr 10, 2024

I think the following could work:

  • after introducing the sigma scheduling changes to the EDM schedulers (as discussed internally with Suraj), we serialise the pipeline in the diffusers format. This gives us the scheduler with all the right configurations.
  • in the pipelining code, we check if the scheduler has the EDM type and if so, we scale the latents.

WDYT? @yiyixuxu would love your thoughts too.

@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Apr 11, 2024

I think we should modify the pipeline to detect if scaling is needed

based on my understanding, how we scale latent is not dependent on the scheduler type but more specific to how this model is trained, i.e. in most of our pipelines, the image_latents are scaled regardless of which scheduler you use

image_latents = self.vae.config.scaling_factor * image_latents

so I think we should add a pipeline config e.g. something like is_cosxl, that the user can pass to from_single_file()

cc @DN6 here

@sayakpaul
Copy link
Member

so I think we should add a pipeline config e.g. something like is_cosxl, that the user can pass to from_single_file(), with this flag, we can map it to the correct scheduler config too in from_single_file

If we introduce that only for from_single_file(), won't that introduce a discrepancy between from_pretrained() and from_single_file() methods of InstructPix2Pix then? I thought we were trying to reduce these kinds of discrepancies with Dhruv's refactor.

@DN6
Copy link
Collaborator

DN6 commented Apr 12, 2024

If the argument is added to the pipeline and is only a pipeline argument then that wouldn't be a discrepancy. What we want is to avoid configuring models via pipeline invocations

@sayakpaul
Copy link
Member

What we want is to avoid configuring models via pipeline invocations

Like this?

pipe = StableDiffusionXLInstructPix2PixPipeline.from_single_file(
    "https://huggingface.co/stabilityai/cosxl/blob/main/cosxl.safetensors", num_in_channels=8,
)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

4 participants