-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Enable Inference Support for the New Baichuan-M1 Model #12251
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
LGTM |
@youkaichao @zhuohan123 @DarkLight1337 @WoosukKwon |
Signed-off-by: dangshunya <dangshunya@baichuan-inc.com>
ping @youkaichao @DarkLight1337 @njhill @comaniac @zhuohan123 @WoosukKwon @alexm-redhat |
The model itself LGTM, but I'm not so sure about the custom KV cache. Is anyone else familiar with this part of the code? |
Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345. |
Because the kvcache used by ordinary layers and swa layers is inconsistent (we have 2 kv heads in normal attention, but 8 kv heads in swa), we cannot simply treat them the same way as in #10584 , but instead need to separately calculate the memory usage. |
For vLLM v1 engine, you can support normal attention with different hidden size by extending this function. vllm/vllm/v1/core/kv_cache_utils.py Line 410 in bf21481
Then you can try #10584 in v1 to support the mix of normal attention and SWA. If that works, we can raise an error to ask the user to use v1 engine to run this model if they do not enable vLLM v1. |
This pull request adds the necessary support to the vLLM framework for the Baichuan-M1 model.
HuggingFace Page:
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct
The Baichuan-M1 (M stands for medicine) model is a medical-enhanced general large model, designed to deliver exceptional performance in healthcare applications while maintaining strong general capabilities. This update ensures that VLLM can seamlessly handle inference for the Baichuan-M1 model, providing both compatibility and optimal performance for a wide range of natural language processing tasks, especially in the medical domain.