-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example llama3 on inf2 #3133
Example llama3 on inf2 #3133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left some comments, please address the move of the neuron handlers into example folder before merging.
@@ -1,6 +1,6 @@ | |||
# Large model inference on Inferentia2 | |||
|
|||
This folder briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe's features: | |||
This folder briefs on serving the [Llama 2 and Llama 3](https://huggingface.co/meta-llama) model a on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for text completion with TorchServe's features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...model on an AWS Inferentia2..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something went wrong with your previous PR where you said you moved the handlers into example dir. Please make sure the file in ts/torch_handler/distributed/ gets cleaned up. My assumption is that this file got moved here. Please clarify if there where changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous PR is about the microbatching+streamer. Here is about continuous batching. They are two different base handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, not sure if we're referring to the same PR. I meant #3035 which touched ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py as well as files under examples/large_models/inferentia2/llama2/continuous_batching so I assumed that we were talking about moving the cb_handler as well. Anyways, please make sure to remove ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py with this pr.
tp_degree: 24 | ||
max_length: 256 | ||
max_new_tokens: 50 | ||
batch_size: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was paged attention supported in inferentia? Does batch size of 8 give enough flexibility in that case? Would be good to discuss this in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will run benchmark and update batch size.
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Checklist: