Issue with number of tokens for CoCa #458

vturrisi · 2023-03-06T23:46:28Z

Hey,
When calling _encode_image from CoCa, it should return two tensors, the image-level features (cls token/global avg) and the individual token features, so (image_size / 14) ** 2, right? However, it's only returning 255 tokens, so it seems like there's a token missing. I've attached a minimal example below.

import open_clip
import timm
from PIL import Image

(
    model,
    _,
    val_transform,
) = open_clip.create_model_and_transforms(
    "coca_ViT-L-14",
    pretrained="laion2B-s13B-b90k",
    precision="amp",
)
model.to("cuda")
model.eval()

img = Image.new("RGB", (224, 224))
img = val_transform(img).unsqueeze(0).to("cuda")

latent, tokens = model._encode_image(img)
print(latent.shape)  # torch.Size([1, 768])
print(tokens.shape)  # torch.Size([1, 255, 768])

model = timm.create_model("vit_large_patch14_224_clip_laion2b", pretrained=True)
model.to("cuda")
model.eval()
tokens = model.forward_features(img)
print(
    tokens.shape
)  # torch.Size([1, 257, 1024]) cls token + 256 tokens (224/14 * 224/14)

The text was updated successfully, but these errors were encountered:

gpucce · 2023-03-07T00:11:43Z

Hi @vturrisi, so there are two things, the output from _encode_image in CoCa is already passed through attentional pooling, so in principle it might be different from the same in clip, it is defined in the coca config.

On the other end in the coca paper it might be that this should be 256 and I made a small mistake in the config, I have to check, however that would be a coincidence

rwightman · 2023-03-07T00:38:17Z

@gpucce @vturrisi the seq len of the output of the attention pooler is determined by the size of the latent query, that’s 256 for CoCa in current config, this is technically no longer a ‘class’ and ‘spatial’ token but it’s still split like that so one of the 256 tokens is treated as the pooled, and the remaining as embeds, not sure if this was the intention, but you no longer have the original sequence length after the pooler

gpucce · 2023-03-07T00:45:11Z

@rwightman @vturrisi this was the intention, because the embedding that makes sense to use for contrastive downstream tasks with coca is the one output by the pooler.

The only detail I am not 100% sure is if that should have been 257 to match the coca paper exactly.

rwightman · 2023-03-07T00:50:53Z

@gpucce to match the paper, looking at it, it’s not a matter of making it 257, it’s a matter of needing two separate poolers, one for contrastive token w/ n=1, and the other for cap with n=256, regardless of what length it is, the behaviour is diff if you make it one pooler vs two (paper is two). And in any case, there is still a disconnect between the original sequence length as determined by the image tokens and the output after the pooler….

gpucce · 2023-03-07T00:58:53Z

@rwightman ah now I see, I made a mistake reading the paper, I thought it worked how I wrote it.

rwightman · 2023-03-07T01:13:52Z

@gpucce hmm, yeah, that’s a pickle, it obv still works fairly well (not abnormal in the wonderful world of DL / AI), we should add support for two poolers of (n=1 and n=256), keeping the current single one as an option for weight compat for now, and do a comparison, I think if anything it’d improve the contrastive more than the captioning behaviour….

gpucce · 2023-03-07T01:17:43Z

@rwightman ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention

rwightman · 2023-03-07T01:29:06Z

I don't see how they'd be equivalent with the softmax there...

…

On Mon, Mar 6, 2023, 3:17 PM Giovanni Puccetti ***@***.***> wrote: @rwightman <https://github.com/rwightman> ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention — Reply to this email directly, view it on GitHub <#458 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICDUTZMHR4O2KBH57ZDW22EEHANCNFSM6AAAAAAVRYRRMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gpucce · 2023-03-07T06:33:45Z

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

vturrisi · 2023-03-07T07:12:32Z

@gpucce @rwightman thanks for the fast response. I didn't realize that in the paper they did this attention pooling and I wanted to play around with the indivisible visual tokens. Even though the image tokens are not used directly, I think they can still be useful, no? When the fix comes, is there also a way to expose these tokens to the user?

rwightman · 2023-03-07T08:19:58Z

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

gpucce · 2023-03-08T14:39:29Z

@vturrisi I am planning on doing a PR that should improve huggingface integration and a few other changes, I will add in that as soon as I start working on that and I tag you

vturrisi · 2023-03-08T14:40:18Z

Sounds good!

MrPetrichor · 2024-05-24T09:40:06Z

I don't see how they'd be equivalent with the softmax there...

@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows.

Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway.

Hi gpucce,

I think there might be a misunderstanding regarding the dual poolers. When there are two poolers involved, it implies that both poolers contain a MultiheadAttention component, each with its own set of projection layers. Therefore, the parameters of these projection layers in the two poolers are different and serve distinct purposes. The Contrastive Pooler is designed with a learnable token to extract information from the keys and values for contrastive learning. On the other hand, the Caption Pooler is equipped with 256 learnable tokens to handle captioning tasks. Given their entirely different objectives, their parameters are expected to vary significantly.

Currently, if the same pooler setup with 256 learnable tokens is being used, where one token is utilized for contrastive learning and the rest for captioning, this setup might lead to suboptimal results or perhaps no impact at all—it's hard to say for certain without further testing. This is my understanding of the paper. If you have time, you might want to experiment with this setup. Thank you for your contribution!

Warm regards,

MrPetrichor · 2024-05-24T09:42:55Z

@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant…

Hi rwightman,

I've been following your discussion and wanted to share my thoughts as well. I believe that having two poolers might result in a noticeable change in performance metrics. I'm looking forward to the new version of CoCa!

Best regards,

Arsiuuu · 2024-09-30T15:08:07Z

I wonder how to extract patch/local features in 768 dim of CoCa for downstream tasks? Should I use the attn_pool (for caption) to get (256,768)?

gpucce mentioned this issue Mar 8, 2023

Make coca and HF work together #447

Draft

4 tasks

iejMac mentioned this issue Jun 27, 2023

CoCa v2: fixes and improvements #554

Open

5 tasks

rwightman mentioned this issue May 8, 2024

How to extract 1024 width patch embeddings and CLS embedding #844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with number of tokens for CoCa #458

Issue with number of tokens for CoCa #458

vturrisi commented Mar 6, 2023

gpucce commented Mar 7, 2023 •

edited

Loading

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023 via email

gpucce commented Mar 7, 2023 •

edited

Loading

vturrisi commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 8, 2023

vturrisi commented Mar 8, 2023

MrPetrichor commented May 24, 2024

MrPetrichor commented May 24, 2024

Arsiuuu commented Sep 30, 2024

Issue with number of tokens for CoCa #458

Issue with number of tokens for CoCa #458

Comments

vturrisi commented Mar 6, 2023

gpucce commented Mar 7, 2023 • edited Loading

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 7, 2023

rwightman commented Mar 7, 2023 via email

gpucce commented Mar 7, 2023 • edited Loading

vturrisi commented Mar 7, 2023

rwightman commented Mar 7, 2023

gpucce commented Mar 8, 2023

vturrisi commented Mar 8, 2023

MrPetrichor commented May 24, 2024

MrPetrichor commented May 24, 2024

Arsiuuu commented Sep 30, 2024

gpucce commented Mar 7, 2023 •

edited

Loading

gpucce commented Mar 7, 2023 •

edited

Loading