-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve llama models performance #587
Conversation
This reduces encoding (prefill) time.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
80332b5
to
34c8404
Compare
The TGI tests are failing because I need to remove the test llama neuron model under the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
That looks awesome @dacorvo !
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
What does this PR do?
This modifies the default Neuron configuration when exporting Llama models for inference, setting the attention layout to "BSH" instead of "HSB".
This configuration has almost no impact on the token generation time (a.k.a
decode
), and significantly reduces the context encoding time (a.k.aprefill
) for Llama2-7b and Llama3-8B.Benchmarks updates:
In the process, the TGI router version is bumped to 2.0.2.