-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with number of tokens for CoCa #458
Comments
Hi @vturrisi, so there are two things, the output from _encode_image in CoCa is already passed through attentional pooling, so in principle it might be different from the same in clip, it is defined in the coca config. On the other end in the coca paper it might be that this should be 256 and I made a small mistake in the config, I have to check, however that would be a coincidence |
@gpucce @vturrisi the seq len of the output of the attention pooler is determined by the size of the latent query, that’s 256 for CoCa in current config, this is technically no longer a ‘class’ and ‘spatial’ token but it’s still split like that so one of the 256 tokens is treated as the pooled, and the remaining as embeds, not sure if this was the intention, but you no longer have the original sequence length after the pooler |
@rwightman @vturrisi this was the intention, because the embedding that makes sense to use for contrastive downstream tasks with coca is the one output by the pooler. The only detail I am not 100% sure is if that should have been 257 to match the coca paper exactly. |
@gpucce to match the paper, looking at it, it’s not a matter of making it 257, it’s a matter of needing two separate poolers, one for contrastive token w/ n=1, and the other for cap with n=256, regardless of what length it is, the behaviour is diff if you make it one pooler vs two (paper is two). And in any case, there is still a disconnect between the original sequence length as determined by the image tokens and the output after the pooler…. |
@rwightman ah now I see, I made a mistake reading the paper, I thought it worked how I wrote it. |
@gpucce hmm, yeah, that’s a pickle, it obv still works fairly well (not abnormal in the wonderful world of DL / AI), we should add support for two poolers of (n=1 and n=256), keeping the current single one as an option for weight compat for now, and do a comparison, I think if anything it’d improve the contrastive more than the captioning behaviour…. |
@rwightman ok, indeed I was thinking about it, I believe that two poolers or one with one extra query are the same, except for the shared linear layer inside MultiHeadAttention |
I don't see how they'd be equivalent with the softmax there...
…On Mon, Mar 6, 2023, 3:17 PM Giovanni Puccetti ***@***.***> wrote:
@rwightman <https://github.com/rwightman> ok, indeed I was thinking about
it, I believe that two poolers or one with one extra query are the same,
except for the shared linear layer inside MultiHeadAttention
—
Reply to this email directly, view it on GitHub
<#458 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLQICDUTZMHR4O2KBH57ZDW22EEHANCNFSM6AAAAAAVRYRRMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@rwightman maybe I am just in denial, however, each row of the attention is one query dot product with all keys, and in turn softmax is over each row and then each output vector is the weighted sum of all values based on one of the attention rows. Since keys and values would be the same in both poolers, this should make one pooler with one extra query the same as two poolers, however the linear layer after attention that multiplies all the output could be messing things up, so need to do as you said anyway. |
@gpucce @rwightman thanks for the fast response. I didn't realize that in the paper they did this attention pooling and I wanted to play around with the indivisible visual tokens. Even though the image tokens are not used directly, I think they can still be useful, no? When the fix comes, is there also a way to expose these tokens to the user? |
@gpucce right yeah, chunking q is fine and done in other situations so this should be as well … mixed up my dim for the softmax. But yeah, the projections being shared is definitely different btw this and paper, question is how much it matters, might not be that significant… |
@vturrisi I am planning on doing a PR that should improve huggingface integration and a few other changes, I will add in that as soon as I start working on that and I tag you |
Sounds good! |
Hi gpucce, I think there might be a misunderstanding regarding the dual poolers. When there are two poolers involved, it implies that both poolers contain a MultiheadAttention component, each with its own set of projection layers. Therefore, the parameters of these projection layers in the two poolers are different and serve distinct purposes. The Contrastive Pooler is designed with a learnable token to extract information from the keys and values for contrastive learning. On the other hand, the Caption Pooler is equipped with 256 learnable tokens to handle captioning tasks. Given their entirely different objectives, their parameters are expected to vary significantly. Currently, if the same pooler setup with 256 learnable tokens is being used, where one token is utilized for contrastive learning and the rest for captioning, this setup might lead to suboptimal results or perhaps no impact at all—it's hard to say for certain without further testing. This is my understanding of the paper. If you have time, you might want to experiment with this setup. Thank you for your contribution! Warm regards, |
Hi rwightman, I've been following your discussion and wanted to share my thoughts as well. I believe that having two poolers might result in a noticeable change in performance metrics. I'm looking forward to the new version of CoCa! Best regards, |
I wonder how to extract patch/local features in 768 dim of CoCa for downstream tasks? Should I use the attn_pool (for caption) to get (256,768)? |
Hey,
When calling
_encode_image
from CoCa, it should return two tensors, the image-level features (cls token/global avg) and the individual token features, so(image_size / 14) ** 2
, right? However, it's only returning 255 tokens, so it seems like there's a token missing. I've attached a minimal example below.The text was updated successfully, but these errors were encountered: