-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Make compel work with SD-XL #41
Conversation
d96bbb4
to
ab15819
Compare
thanks @patrickvonplaten because people are going to ask - how does this play with >75 tokens? especially considering the need to run |
@patrickvonplaten i had a closer look and there's a couple of issues. first of all this simply won't work:
this will passing all of the Compel syntax markers into CLIP, which is not good. instead i think this should be designed as follows:
doing number 2. at least will solve the problem of having to make a separate call to get the pooled outputs, and you'll get syntax handling on the pooled outputs the same as for the non-pooled outputs. |
1.) For the pooled embedding vector I don't think it's a problem that all syntax markers are passed as it's a pooled vector not a sequential hidden states vector. E.g. for an input sentence:
where pooled = pooled_vector_of("A cat AND a ball") compared to: pooled = torch.mean([pooled_vector_of("A cat"), pooled_vector_of("A ball")] or some other "merging" operation. Pooling vectors only need to get the "general" gist of the input text, they don't need to have contextualized embeddings for every token in the input. The way I understood it: What do you think? 2.) Regrading sequences that run over the max limit I'd just truncate / cut them for now. We could revisit later if we think it makes sense, but the orginal implementation also cuts it for now (see here). RE:
I don't fully understand. Do you maybe just want to take over the PR? Feel free to open a new one if this one is too far away from what you were thinking. |
This comment was marked as outdated.
This comment was marked as outdated.
Note that we're just talking about "long prompt parsing" of the pooled vector where it's much less clear how we can average multiple pooled vectors. The cross attention vectors are just like before "long prompt parsed". We cannot provide a pooled prompt vector of arbitrary length by definition (contrary to the cross attention vectors). |
This PR allows to use the Compel library with SD-XL. This is a first proposal.
The reason I went for this design is because the different text encoders even use different tokenizers (they don't match 100% - the pad token ID is different).
The following seems to work (I have not tested it thoroughly):
For
"ball++"
I'm gettinFro
"ball--"
I'm getting: