-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate vllm with example Lora and Mistral #3077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"--prompt-json", | ||
action=argparse.BooleanOptionalAction, | ||
default=False, | ||
help="Flag the imput prompt is a json format with prompt parameters", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
enable_lora: true | ||
max_loras: 4 | ||
max_cpu_loras: 4 | ||
max_num_seqs: 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm uses paged attention which typically allows for a larger batch sizes. We need to figure out a way to saturate the engine as setting batchSize == max_num_seqs will lead to under utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a similar strategy like for the micro-batching to always have enough requests for the engine available. Preferred would be an async mode which will just route all requests to the backend and gets replies asynchronously (as discussed earlier)
) | ||
|
||
model_archiver.generate_model_archive(config) | ||
shutil.move(LORA_SRC_PATH / "model", mar_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the files and delete them we can run the test only once before we need to put them back manually. Can we use symbolic links instead?
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Checklist: