Long context CLIP #876
Replies: 7 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
-
BTW: maybe Long-CLIP is your need?
|
Beta Was this translation helpful? Give feedback.
-
Moving this to discussions now for reference. |
Beta Was this translation helpful? Give feedback.
-
I wonder whether replacing the default tokenizer with others will work? |
Beta Was this translation helpful? Give feedback.
-
Just wanted to double check. Does this mean that all models in this library have a context length of 77 at most? |
Beta Was this translation helpful? Give feedback.
-
I think the first thing to solve on this topic is producing a good long
captions image text *open* large scale dataset. Is there anything yet?
If not, probably it would mean running one of the small VLM on 1B images
…On Tue, Oct 22, 2024, 16:13 Ross Wightman ***@***.***> wrote:
@sachinruk <https://github.com/sachinruk> yes, they were trained from
scratch with noisy internet image-text web data (openai wit, laion,
datacomp, dfn, webli) that typically has fairly short captions so 32-77
tokens is the range here.
Having quality text beyond that range requires either adapting an existing
longer context LLM as part of a VLM, or if training from scratch, a LOT of
image-text data with higher quality, longer captions (which would be a
challenge an billion scale).
This one might be the one exception
https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
.. it was an experiment using an existing text encoder.
Could also fine-tune one of these existing models on a decent size
image-text data with longer captions and increase the context length...
—
Reply to this email directly, view it on GitHub
<#876 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437V4WGY7XLBLJRNV5S3Z4ZMQXAVCNFSM6AAAAABQMGZSDKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBRHAYDAOA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
The second very important thing is having tasks to evaluate the meaning of
long text alignment with images.
I am guessing the interest here is about using CLIP as an evaluator or
conditioner for GenAI.
In that case you may want to consider whether you really need full
attention understanding of the text and the link of it with images.
If you actually don't... Then what about cutting your text in N pieces of
size 77 and pooling the resulting embeddings? You can get infinite context
length by doing that.
…On Tue, Oct 22, 2024, 16:59 Romain Beaumont ***@***.***> wrote:
I think the first thing to solve on this topic is producing a good long
captions image text *open* large scale dataset. Is there anything yet?
If not, probably it would mean running one of the small VLM on 1B images
On Tue, Oct 22, 2024, 16:13 Ross Wightman ***@***.***>
wrote:
> @sachinruk <https://github.com/sachinruk> yes, they were trained from
> scratch with noisy internet image-text web data (openai wit, laion,
> datacomp, dfn, webli) that typically has fairly short captions so 32-77
> tokens is the range here.
>
> Having quality text beyond that range requires either adapting an
> existing longer context LLM as part of a VLM, or if training from scratch,
> a LOT of image-text data with higher quality, longer captions (which would
> be a challenge an billion scale).
>
> This one might be the one exception
> https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
> .. it was an experiment using an existing text encoder.
>
> Could also fine-tune one of these existing models on a decent size
> image-text data with longer captions and increase the context length...
>
> —
> Reply to this email directly, view it on GitHub
> <#876 (reply in thread)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAR437V4WGY7XLBLJRNV5S3Z4ZMQXAVCNFSM6AAAAABQMGZSDKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBRHAYDAOA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***
> com>
>
|
Beta Was this translation helpful? Give feedback.
-
Hi,
Is there any CLIP model that manages longer context length than 77 (ideally >256)
Is there a reason why the context length is set to 77? LAION has too short alt-texts overall?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions