Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Output state configuration of vision encoder In VLM #9186

Closed
1 task done
litianjian opened this issue Oct 9, 2024 · 0 comments · Fixed by #9217
Closed
1 task done

[Feature]: Output state configuration of vision encoder In VLM #9186

litianjian opened this issue Oct 9, 2024 · 0 comments · Fixed by #9217

Comments

@litianjian
Copy link
Contributor

Anything you want to discuss about vllm.

When siglip or clip acts as a multimodal vision encoder, there will have several cases:

  • The output state of an intermediate layer is used without layer normalization
  • The output state of the last layer is used without layer normalization
  • The output state of the last layer is used with layer normalization

For example, In the LLaVA-Next code implementation, post_layernorm is not used.

#8106 #8155

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@litianjian litianjian added the misc label Oct 9, 2024
@DarkLight1337 DarkLight1337 changed the title [Misc]: Output state configuration of vision encoder In VLM [Feature]: Output state configuration of vision encoder In VLM Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@DarkLight1337 @litianjian and others