Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rainkert
Copy link

@rainkert rainkert commented Jan 21, 2025

This pull request adds the necessary support to the vLLM framework for the Baichuan-M1 model.

HuggingFace Page:
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct

The Baichuan-M1 (M stands for medicine) model is a medical-enhanced general large model, designed to deliver exceptional performance in healthcare applications while maintaining strong general capabilities. This update ensures that VLLM can seamlessly handle inference for the Baichuan-M1 model, providing both compatibility and optimal performance for a wide range of natural language processing tasks, especially in the medical domain.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link

mergify bot commented Jan 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rainkert.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added documentation Improvements or additions to documentation needs-rebase labels Jan 21, 2025
@mergify mergify bot removed the needs-rebase label Jan 21, 2025
@jeejeelee jeejeelee added the new model Requests to new models label Jan 21, 2025
@jameswu2014
Copy link

LGTM

@rainkert
Copy link
Author

@youkaichao @zhuohan123 @DarkLight1337 @WoosukKwon
We will be releasing our model on Hugging Face on January 24th(The day after tomorrow), but you can review the code beforehand to identify any issues so we can address them in advance.

Signed-off-by: dangshunya <dangshunya@baichuan-inc.com>
@rainkert
Copy link
Author

ping @youkaichao @DarkLight1337 @njhill @comaniac @zhuohan123 @WoosukKwon @alexm-redhat
We've released our new model today, plz review this PR and merge ASAP.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jan 24, 2025

The model itself LGTM, but I'm not so sure about the custom KV cache. Is anyone else familiar with this part of the code?

@DarkLight1337 DarkLight1337 mentioned this pull request Jan 24, 2025
8 tasks
@simon-mo
Copy link
Collaborator

Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345.

@rainkert
Copy link
Author

rainkert commented Jan 25, 2025

Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345.

Because the kvcache used by ordinary layers and swa layers is inconsistent (we have 2 kv heads in normal attention, but 8 kv heads in swa), we cannot simply treat them the same way as in #10584 , but instead need to separately calculate the memory usage.

@heheda12345
Copy link
Collaborator

For vLLM v1 engine, you can support normal attention with different hidden size by extending this function.

def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec,

Then you can try #10584 in v1 to support the mix of normal attention and SWA.
If that works, we can raise an error to ask the user to use v1 engine to run this model if they do not enable vLLM v1.

@simon-mo simon-mo mentioned this pull request Jan 27, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation new model Requests to new models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants