Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with number of tokens for CoCa #458

Open
vturrisi opened this issue Mar 6, 2023 · 16 comments
Open

Issue with number of tokens for CoCa #458

vturrisi opened this issue Mar 6, 2023 · 16 comments

Comments

@vturrisi
Copy link

vturrisi commented Mar 6, 2023

Hey,
When calling _encode_image from CoCa, it should return two tensors, the image-level features (cls token/global avg) and the individual token features, so (image_size / 14) ** 2, right? However, it's only returning 255 tokens, so it seems like there's a token missing. I've attached a minimal example below.

import open_clip
import timm
from PIL import Image

(
    model,
    _,
    val_transform,
) = open_clip.create_model_and_transforms(
    "coca_ViT-L-14",
    pretrained="laion2B-s13B-b90k",
    precision="amp",
)
model.to("cuda")
model.eval()

img = Image.new("RGB", (224, 224))
img = val_transform(img).unsqueeze(0).to("cuda")

latent, tokens = model._encode_image(img)
print(latent.shape)  # torch.Size([1, 768])
print(tokens.shape)  # torch.Size([1, 255, 768])

model = timm.create_model("vit_large_patch14_224_clip_laion2b", pretrained=True)
model.to("cuda")
model.eval()
tokens = model.forward_features(img)
print(
    tokens.shape
)  # torch.Size([1, 257, 1024]) cls token + 256 tokens (224/14 * 224/14)
@gpucce
Copy link
Contributor

gpucce commented Mar 7, 2023

Hi @vturrisi, so there are two things, the output from _encode_image in CoCa is already passed through attentional pooling, so in principle it might be different from the same in clip, it is defined in the coca config.

On the other end in the coca paper it might be that this should be 256 and I made a small mistake in the config, I have to check, however that would be a coincidence

@rwightman
Copy link
Collaborator

@gpucce @vturrisi the seq len of the output of the attention pooler is determined by the size of the latent query, that’s 256 for CoCa in current config, this is technically no longer a ‘class’ and ‘spatial’ token but it’s still split like that so one of the 256 tokens is treated as the pooled, and the remaining as embeds, not sure if this was the intention, but you no longer have the original sequence length after the pooler

@gpucce
Copy link
Contributor

gpucce commented Mar 7, 2023

@rwightman @vturrisi this was the intention, because the embedding that makes sense to use for contrastive downstream tasks with coca is the one output by the pooler.

The only detail I am not 100% sure is if that should have been 257 to match the coca paper exactly.

@rwightman
Copy link
Collaborator

@gpucce to match the paper, looking at it, it’s not a matter of making it 257, it’s a matter of needing two separate poolers, one for contrastive token w/ n=1, and the other for cap with n=256, regardless of what length it is, the behaviour is diff if you make it one pooler vs two (paper is two). And in any case, there is still a disconnect between the original sequence length as determined by the image tokens and the output after the pooler….

@gpucce
Copy link
Contributor

gpucce commented Mar 7, 2023

@rwightman ah now I see, I made a mistake reading the paper, I thought it worked how I wrote it.

@rwightman
Copy link
Collaborator

@gpucce hmm, yeah, that’s a pickle, it obv still works fairly well (not abnormal in the wonderful world of DL / AI), we should add support for two poolers of (n=1 and n=256), keeping the current single one as an option for weight compat for now, and do a comparison, I think if anything it’d improve the contrastive more than the captioning behaviour….

@gpucce
Copy link
Contributor

gpucce commented Mar 7, 2023

@rwightman ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention

@rwightman
Copy link
Collaborator

rwightman commented Mar 7, 2023 via email

@gpucce
Copy link
Contributor

gpucce commented Mar 7, 2023

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

@vturrisi
Copy link
Author

vturrisi commented Mar 7, 2023

@gpucce @rwightman thanks for the fast response. I didn't realize that in the paper they did this attention pooling and I wanted to play around with the indivisible visual tokens. Even though the image tokens are not used directly, I think they can still be useful, no? When the fix comes, is there also a way to expose these tokens to the user?

@rwightman
Copy link
Collaborator

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

@gpucce
Copy link
Contributor

gpucce commented Mar 8, 2023

@vturrisi I am planning on doing a PR that should improve huggingface integration and a few other changes, I will add in that as soon as I start working on that and I tag you

@vturrisi
Copy link
Author

vturrisi commented Mar 8, 2023

Sounds good!

@MrPetrichor
Copy link

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

Hi gpucce,

I think there might be a misunderstanding regarding the dual poolers. When there are two poolers involved, it implies that both poolers contain a MultiheadAttention component, each with its own set of projection layers. Therefore, the parameters of these projection layers in the two poolers are different and serve distinct purposes. The Contrastive Pooler is designed with a learnable token to extract information from the keys and values for contrastive learning. On the other hand, the Caption Pooler is equipped with 256 learnable tokens to handle captioning tasks. Given their entirely different objectives, their parameters are expected to vary significantly.

Currently, if the same pooler setup with 256 learnable tokens is being used, where one token is utilized for contrastive learning and the rest for captioning, this setup might lead to suboptimal results or perhaps no impact at all—it's hard to say for certain without further testing. This is my understanding of the paper. If you have time, you might want to experiment with this setup. Thank you for your contribution!

Warm regards,

@MrPetrichor
Copy link

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

Hi rwightman,

I've been following your discussion and wanted to share my thoughts as well. I believe that having two poolers might result in a noticeable change in performance metrics. I'm looking forward to the new version of CoCa!

Best regards,

@Arsiuuu
Copy link

Arsiuuu commented Sep 30, 2024

I wonder how to extract patch/local features in 768 dim of CoCa for downstream tasks? Should I use the attn_pool (for caption) to get (256,768)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants