[Feature]: Output state configuration of vision encoder In VLM #9186

litianjian · 2024-10-09T08:50:00Z

Anything you want to discuss about vllm.

When siglip or clip acts as a multimodal vision encoder, there will have several cases:

The output state of an intermediate layer is used without layer normalization
The output state of the last layer is used without layer normalization
The output state of the last layer is used with layer normalization

For example, In the LLaVA-Next code implementation, post_layernorm is not used.

#8106 #8155

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

litianjian added the misc label Oct 9, 2024

litianjian mentioned this issue Oct 10, 2024

[Bugfix] Fix missing post_layernorm in CLIP #8155

Merged

DarkLight1337 changed the title ~~[Misc]: Output state configuration of vision encoder In VLM~~ [Feature]: Output state configuration of vision encoder In VLM Oct 10, 2024

DarkLight1337 added feature request and removed misc labels Oct 10, 2024

DarkLight1337 mentioned this issue Oct 10, 2024

[VLM] Post-layernorm override and quant config in vision encoder #9217

Merged

DarkLight1337 closed this as completed in #9217 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Output state configuration of vision encoder In VLM #9186

[Feature]: Output state configuration of vision encoder In VLM #9186

litianjian commented Oct 9, 2024

[Feature]: Output state configuration of vision encoder In VLM #9186

[Feature]: Output state configuration of vision encoder In VLM #9186

Comments

litianjian commented Oct 9, 2024

Anything you want to discuss about vllm.

Before submitting a new issue...