Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/inversion pp #104

Merged
merged 9 commits into from
Jan 5, 2023
Merged

Feat/inversion pp #104

merged 9 commits into from
Jan 5, 2023

Conversation

cloneofsimo
Copy link
Owner

@cloneofsimo cloneofsimo commented Dec 30, 2022

Few utilities, including CLIP evaluation, CLIP evaluation preparation, random initialization with sigma.

So I've implemented CLIP text alignment, Image alignment in this PR #67 (comment)

image

Expect to see some results like above figure, from Custom Diffusion.
https://arxiv.org/abs/2212.04488

@cloneofsimo
Copy link
Owner Author

Bit more commits coming...

@brian6091
Copy link
Collaborator

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

@cloneofsimo
Copy link
Owner Author

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

@brian6091
Copy link
Collaborator

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

@cloneofsimo
Copy link
Owner Author

So these seemed to work very well, I'll add these with updated example runfile + example dataset.

@cloneofsimo
Copy link
Owner Author

cloneofsimo commented Jan 5, 2023

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

Oh so I was thinking, we have target subject X to train, testing on prompt Y and see how well it creates image : generated image Z should be faithful to prompt Y, and to subject X. that measure : sim(Z,Y), sim(Z,X) is what we are trying to get here

X : <Custom Yellow Clock>
Y : "photo of clock on the tree"
Z : (Photo of custom yellow clock on tree)

So our only source (ground truth) images are X, since Y are text, and Z is generated with SD
At least that's what I've understood from textual inversion paper and CLIP score papers... It's not explained clearly in the paper as well, so please correct me if I'm wrong!!

@cloneofsimo
Copy link
Owner Author

I am seeing incredible quality improvements combining all the fancy latest tricks. Especially giving high norm prior gave much editability unlike before.

I'm so proud of this 🤣

output

@cloneofsimo
Copy link
Owner Author

I'll merge this I guess

@cloneofsimo cloneofsimo merged commit b379afd into develop Jan 5, 2023
@brian6091
Copy link
Collaborator

Will start to look at this today. Just to clarify, the target_imgs input:

def evaluate_pipe(
pipe,
target_images: List[Image.Image],
...
):

should be a list of images matching the example prompts. I'm guessing these could be generated by the same pipe if necessary?

Target images are the reference images, so they are like ground truth images

Right, got it, but I imagine when we actually use this, we won't have ground truth images for all the prompts, so we can generate "ground truth" using the pipe (probably better to use the original model, rather than the trained).

Oh so I was thinking, we have target subject X to train, testing on prompt Y and see how well it creates image : generated image Z should be faithful to prompt Y, and to subject X. that measure : sim(Z,Y), sim(Z,X) is what we are trying to get here

X : <Custom Yellow Clock> Y : "photo of clock on the tree" Z : (Photo of custom yellow clock on tree)

So our only source (ground truth) images are X, since Y are text, and Z is generated with SD At least that's what I've understood from textual inversion paper and CLIP score papers... It's not explained clearly in the paper as well, so please correct me if I'm wrong!!

Ok this makes sense now. Thanks!

@brian6091
Copy link
Collaborator

I am seeing incredible quality improvements combining all the fancy latest tricks. Especially giving high norm prior gave much editability unlike before.

I'm so proud of this 🤣

Amazing! But you can't just drop the image without telling us what tricks you used? And what is high norm prior???

@cloneofsimo cloneofsimo deleted the feat/inversion_pp branch January 5, 2023 20:20
@cloneofsimo
Copy link
Owner Author

cloneofsimo commented Jan 5, 2023

In this PR I made 5 changes to get it work :

  1. I used multivector initialization, so it's extended latent. Which quite surprisingly isn't yet implemented in hf textual inversion.
  2. Gradient accumulation
  3. Face conditioned loss, but when textual inversion, we set high blur amount to recognize other features as well
  4. Norm prior : so we give gaussian prior on the norm to be 0.4, and so if the norm is too large, we project it closer to 0.4
  5. Full precision on textual inversion : high precision is needed during inversion. I don't know why, but seems to be the case.

@brian6091
Copy link
Collaborator

Thanks for the secret sauce. Very clever the multivector initialization. Does that mean your prompts include all the tokens together?

@hafriedlander
Copy link
Collaborator

This has gotten bigger than last time I looked :). I haven't had time to understand all the changes, but the results speak for themselves. Great work!

@cloneofsimo
Copy link
Owner Author

They have <krk> in prompts, and they are substituted by extended tokens( <s1><s2><s3> in my case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants