-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example for Llama2 on Inf2 #2458
Example for Llama2 on Inf2 #2458
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2458 +/- ##
==========================================
- Coverage 70.87% 70.29% -0.59%
==========================================
Files 83 84 +1
Lines 3839 3871 +32
Branches 58 58
==========================================
Hits 2721 2721
- Misses 1114 1146 +32
Partials 4 4
... and 1 file with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save_split_checkpoints.py is common. Please move it to large_models/util/ and rename it as inf2_save_split_checkpoints.py.
0dd2d87
to
b9f2654
Compare
631b253
to
5e8b713
Compare
549ee49
to
a214142
Compare
Please add feature
|
6aeb67b
to
b747cd3
Compare
Issue to track follow up tasks on this PR: #2600 |
tp_degree=tp_degree, | ||
) | ||
logger.info("Starting to compile the model") | ||
self.model.to_neuron() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@namannandan I am wondering if compilation can be done a head of time and we just load the compiled graphs here the way it was working for inf1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the _save_compiled_artifacts . It is able to generate a neuron model. However, the transformers_neuronx still needs to recompile. I already let Neuron team know they need more work on the experimental feature _save_compiled_artifacts.
Addressed review comments. Follow up tasks tracked here: #2600
Description
This PR adds an example that details the steps to compile and run the Llama2 model on Inferentia2 for text completion with micro batching and response streaming support.
Model: https://huggingface.co/meta-llama/Llama-2-13b-hf
Instance type:
inf2.24xlarge
Type of change
Test