[Model] SiglipVisionModel ported from transformers #6942

ChristopherCho · 2024-07-30T11:26:22Z

This PR implemented SiglipVisionModel for VLMs.

Some of the pre-trained SiglipVisionModel cannot use vLLM's Attention layer.
Therefore, I implemented alternative attention layers if vLLM's one is impossible.
I tried vllm_flash_attn backend which doesn't work properly with CUDA Error.
Thus, only the basic attention mechanism is working properly for now.
Modified Paligemma to use implemented SiglipVisionModel.

github-actions · 2024-07-30T11:26:34Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-07-30T11:29:11Z

Just saw this. Thanks for the implementation, I'll leave the review to @ywang96 since he worked on PaliGemma.

jeejeelee · 2024-07-30T15:35:45Z

The calculation of attention can use the MEA (Memory Efficient Attention) ops from xformers. see: https://facebookresearch.github.io/xformers/components/ops.html

ChristopherCho · 2024-07-31T03:11:49Z

@jeejeelee

The calculation of attention can use the MEA (Memory Efficient Attention) ops from xformers. see: https://facebookresearch.github.io/xformers/components/ops.html

Thanks! I added xformers MEA and torch sdpa to give various options.

jeejeelee · 2024-07-31T16:00:57Z

Hi, thank you for your excellent work. I'd like to know if your implementation supports the following model:

import timm 
 model = timm.create_model(
                "vit_so400m_patch14_siglip_384.webli",
                pretrained=False,
                num_classes=0,
                dynamic_img_size=True,
                dynamic_img_pad=True,
            )

ChristopherCho · 2024-08-01T04:56:05Z

@jeejeelee
Hi, I believe that the pre-trained Siglip model vit_so400m_patch14_siglip_384.webli is just the same as the one on the Huggingface.
Also, afaik, the pre-trained Paligemma model uses the same pre-trained Siglip vision encoder.

When I tried loading the pre-trained Paligemma model with the following codes, I could successfully load and infer.

import requests
from PIL import Image
from vllm import LLM, SamplingParams

model_id = "google/paligemma-3b-mix-224"
prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
image = Image.open(requests.get(image_file, stream=True).raw)

llm = LLM(model=model_id)
sampling_params = SamplingParams(
    temperature=0.0
)

input_dict = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}
outputs = llm.generate(input_dict, sampling_params)

print(outputs[0].outputs[0].text)

So, as the pre-trained Paligemma works fine, the pre-trained Siglip vision model should work.
But there are some caveats.

As the current version of the code is only for supporting VLMs, it does not contain any textual part of the Siglip model (e.g. SiglipTextTransformer, SiglipTextModel, etc.). Therefore, the full Siglip model cannot be loaded with this code.
The aforementioned Siglip model is the exact case in which you cannot use vLLM's attention. That is because the head size is 72 (the hidden size of the model is 1152 and the number of heads is 16), which is not in the supported head sizes.
As I implemented the fallback logic for this case, it works fine but does not use the vLLM's paged attention.

ywang96 · 2024-08-01T05:24:25Z

@jeejeelee Hi, I believe that the pre-trained Siglip model vit_so400m_patch14_siglip_384.webli is just the same as the one on the Huggingface. Also, afaik, the pre-trained Paligemma model uses the same pre-trained Siglip vision encoder.

When I tried loading the pre-trained Paligemma model with the following codes, I could successfully load and infer.
import requests
from PIL import Image
from vllm import LLM, SamplingParams

model_id = "google/paligemma-3b-mix-224"
prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
image = Image.open(requests.get(image_file, stream=True).raw)

llm = LLM(model=model_id)
sampling_params = SamplingParams(
    temperature=0.0
)

input_dict = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image,
    }
}
outputs = llm.generate(input_dict, sampling_params)

print(outputs[0].outputs[0].text)
So, as the pre-trained Paligemma works fine, the pre-trained Siglip vision model should work. But there are some caveats.

As the current version of the code is only for supporting VLMs, it does not contain any textual part of the Siglip model (e.g. SiglipTextTransformer, SiglipTextModel, etc.). Therefore, the full Siglip model cannot be loaded with this code.

The aforementioned Siglip model is the exact case in which you cannot use vLLM's attention. That is because the head size is 72 (the hidden size of the model is 1152 and the number of heads is 16), which is not in the supported head sizes.
As I implemented the fallback logic for this case, it works fine but does not use the vLLM's paged attention.

Hey @ChristopherCho! Thank you very much for this PR and I really appreciate it - will review this tonight!

ywang96

Hey @ChristopherCho - Thank you very much for the PR!

I took a first pass and left some comments. Mostly I'm wondering if we should really use vLLM's attention module in the ViT when it's only used once per sequence, and I suggest simply using the attention modules from transformers for now.

ywang96 · 2024-08-02T05:12:23Z

vllm/model_executor/models/siglip.py

+            self.attn = Attention(
+                self.num_heads,
+                self.head_dim,
+                self.scale,
+                cache_config=cache_config,
+                quant_config=quant_config,
+            )


Currently for ClipVisionModel, we don't use the vLLM internal Attention since the ViT encoder only runs once at the prefill time per sequence, thus I don't think there's much value leveraging a KV cache for this.

Have you seen a significant performance speedup using vLLM Attention compared to transformers Attention? If not, I think we'd rather just use the one from transformers for simplicity for now since this is not the major bottleneck in the whole inference pipeline.

Indeed, there weren't significant improvements by using vLLM Attention.
I believe it is due to the reason that you mentioned. It does not leverage the advantages of using a KV cache.
I removed the vLLM Attention part, but keep the log at bb570c3 for the future.

ywang96 · 2024-08-02T05:37:37Z

vllm/model_executor/models/paligemma.py

-                    # We only do sharding for language model and
-                    # not vision model for now.
-                    use_default_weight_loading = True


Please revert this for now - if we're going to apply TP on the vision tower, we should do it in another separate PR with CLIPVisionModel together. Ideally, we should not apply a infrastructure change to only one model.

Reverted via dee55d0

ywang96 · 2024-08-02T05:39:40Z

vllm/model_executor/models/siglip.py

+SIGLIP_ATTENTION_CLASSES = {
+    "eager": SiglipAttention,
+    "flash_attention_2": SiglipFlashAttention2,
+    "sdpa": SiglipSdpaAttention,
+    "xformers": SiglipxFormersAttention,
+}


I really appreciate that you went out and implemented these (regardless if we're going to use them or not)!

vllm/model_executor/models/siglip.py

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

ywang96 · 2024-08-05T01:11:42Z

@ChristopherCho Now that we merged #7020 - I think there's some benefit of enabling TP for this model, given that SigLip is 400M parameters in 3B PaliGemma model.

However, could you take a look at the implementation of attention here instead of using the vLLM attention? The latter creates kv cache that I don't think we should be using for ViT when it's only run at the prefill stage.

ChristopherCho · 2024-08-05T01:32:34Z

@ywang96 I've removed vLLM Attention for now and utilized the basic attention mechanism, which is just the same as the SiglipAttention.

By the way, I've found that you mentioned the xformers MEA which is implemented here.
Would it be better to utilize this as the baseline attention mechanism? I've tested both the basic attention mechanism and the MEA and found that the MEA was a little bit slower in my environment.

ywang96 · 2024-08-05T01:44:55Z

@ywang96 I've removed vLLM Attention for now and utilized the basic attention mechanism, which is just the same as the SiglipAttention.

By the way, I've found that you mentioned the xformers MEA which is implemented here. Would it be better to utilize this as the baseline attention mechanism? I've tested both the basic attention mechanism and the MEA and found that the MEA was a little bit slower in my environment.

Let's use default MHA implementation for this PR: I think if you use MEA then we need to necessarily TP the attention block (since it's using the vLLM TP layers). We can leave a TODO here and do the TP in a later PR!

ChristopherCho · 2024-08-05T02:13:45Z

@ywang96

Okay, I've implemented the code to use transformers SiglipAttention for now, but keep TP versions for TODO in this PR.
By the way, thank you for your comments!

ywang96 · 2024-08-05T02:29:27Z

/ready

ywang96 · 2024-08-05T02:30:36Z

Overall LGTM! I will just need to test this PR locally myself to make sure everything works fine!

ywang96

@ChristopherCho I've made a few more changes to this PR afterwards and verified it works with both TP=1 and TP>1. Thank you again for working on this!

Co-authored-by: Roger Wang <ywang@roblox.com>

Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: Roger Wang <ywang@roblox.com>

ChristopherCho added 6 commits July 30, 2024 14:58

feat: initial siglip implementation

9222552

fix: typo fixed

5e09410

fix: change paligemma to use ported siglip

8af6456

fix: style fixed

c00edeb

feat: modify paligemma to fully utilize siglip

db99a08

feat: sync model methods for paligemma

3e3b032

DarkLight1337 requested a review from ywang96 July 30, 2024 11:27

DarkLight1337 mentioned this pull request Jul 30, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

33 tasks

fix: style fix

f04da2b

ywang96 self-assigned this Jul 30, 2024

ChristopherCho added 5 commits July 31, 2024 10:18

fix: sync with transformers siglip

b3ccec5

fix: style fix

106e193

fix: faulty weight loading logic for vision model

5b9242f

feat: add various attention mechanisms

3dc8ea0

fix: style update

5afa010

fix: remove unnecessary comments

7fdb13d

ywang96 reviewed Aug 2, 2024

View reviewed changes

ywang96 mentioned this pull request Aug 2, 2024

[Model]Refactor MiniCPMV #7020

Merged

ChristopherCho and others added 2 commits August 5, 2024 10:01

fix: remove unrequired docstring

c47e54a

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

fix: remove unrequired docstring

cac1933

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

ChristopherCho added 2 commits August 5, 2024 10:15

fix: detach vllm attention

2d1aeec

fix: remove vllm attention

bb570c3

fix: revert vision tower weight loading

dee55d0

ChristopherCho added 2 commits August 5, 2024 11:23

fix: use basic SiglipAttention for now

bffc385

fix: remove unnecessary weight loading logic

681b36d

ChristopherCho force-pushed the siglip-support branch from b0cacf2 to 681b36d Compare August 5, 2024 02:23

ChristopherCho requested a review from ywang96 August 5, 2024 02:29

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 5, 2024

ywang96 added 4 commits August 4, 2024 20:11

cleanup

fb4972d

typing

d15a299

update

9ef79b9

Merge remote-tracking branch 'upstream/main' into siglip-support

54faf5d

ywang96 approved these changes Aug 5, 2024

View reviewed changes

ywang96 enabled auto-merge (squash) August 5, 2024 05:17

jeejeelee mentioned this pull request Aug 5, 2024

[Usage]: Can it support using the siglip model as the vision model in a multimodal model? #7144

Closed

ywang96 merged commit c0d8f16 into vllm-project:main Aug 5, 2024
67 checks passed

dtrifiro mentioned this pull request Aug 5, 2024

Sync with upstream@v0.5.4-7-g9118217f opendatahub-io/vllm#120

Closed

ChristopherCho mentioned this pull request Aug 6, 2024

[VLM][Model] TP support for ViTs #7186

Merged

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[Model] SiglipVisionModel ported from transformers (vllm-project#6942)

0d87577

Co-authored-by: Roger Wang <ywang@roblox.com>

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Model] SiglipVisionModel ported from transformers (vllm-project#6942)

7c30496

Co-authored-by: Roger Wang <ywang@roblox.com>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model] SiglipVisionModel ported from transformers (vllm-project#6942)

df3e67c

Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Model] SiglipVisionModel ported from transformers (vllm-project#6942)

ded383c

Co-authored-by: Roger Wang <ywang@roblox.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] SiglipVisionModel ported from transformers #6942

[Model] SiglipVisionModel ported from transformers #6942

ChristopherCho commented Jul 30, 2024 •

edited by DarkLight1337

Loading

github-actions bot commented Jul 30, 2024

DarkLight1337 commented Jul 30, 2024 •

edited

Loading

jeejeelee commented Jul 30, 2024

ChristopherCho commented Jul 31, 2024

jeejeelee commented Jul 31, 2024

ChristopherCho commented Aug 1, 2024

ywang96 commented Aug 1, 2024

ywang96 left a comment

ywang96 Aug 2, 2024

ChristopherCho Aug 5, 2024 •

edited

Loading

ywang96 Aug 2, 2024

ChristopherCho Aug 5, 2024

ywang96 Aug 2, 2024

ywang96 commented Aug 5, 2024 •

edited

Loading

ChristopherCho commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ChristopherCho commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ywang96 left a comment

[Model] SiglipVisionModel ported from transformers #6942

[Model] SiglipVisionModel ported from transformers #6942

Conversation

ChristopherCho commented Jul 30, 2024 • edited by DarkLight1337 Loading

github-actions bot commented Jul 30, 2024

DarkLight1337 commented Jul 30, 2024 • edited Loading

jeejeelee commented Jul 30, 2024

ChristopherCho commented Jul 31, 2024

jeejeelee commented Jul 31, 2024

ChristopherCho commented Aug 1, 2024

ywang96 commented Aug 1, 2024

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 Aug 2, 2024

Choose a reason for hiding this comment

ChristopherCho Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

ywang96 Aug 2, 2024

Choose a reason for hiding this comment

ChristopherCho Aug 5, 2024

Choose a reason for hiding this comment

ywang96 Aug 2, 2024

Choose a reason for hiding this comment

ywang96 commented Aug 5, 2024 • edited Loading

ChristopherCho commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ChristopherCho commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ywang96 commented Aug 5, 2024

ywang96 left a comment

Choose a reason for hiding this comment

ChristopherCho commented Jul 30, 2024 •

edited by DarkLight1337

Loading

DarkLight1337 commented Jul 30, 2024 •

edited

Loading

ChristopherCho Aug 5, 2024 •

edited

Loading

ywang96 commented Aug 5, 2024 •

edited

Loading