-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor documentation and improve tgi deployment #610
Conversation
This allow the checkpoint files to be visible even if they have been created by another user (like the docker root user).
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
c56c6ff
to
fed15f3
Compare
Although pre-compilation avoids overhead during the inference, a compiled Neuron model has some limitations: | ||
* the input shapes and data types used during the compilation cannot be changed. | ||
* Neuron models are specialized for each hardware and SDK version, which means: | ||
* Models compiled with Neuron can no longer be executed in non-Neuron environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was in the original file. I think it means you cannot run them on CPU.
>>> from optimum.exporters.tasks import TasksManager | ||
>>> from optimum.exporters.neuron.model_configs import * # Register neuron specific configs to the TasksManager | ||
|
||
# Save the model | ||
>>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/") | ||
>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys()) | ||
>>> print(distilbert_tasks) | ||
['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make a command in the CLI to list these at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, why not.
text-generation-inference/tgi_env.py
Outdated
@@ -184,6 +176,7 @@ def main(): | |||
work properly | |||
:return: | |||
""" | |||
logging.basicConfig(level=logging.DEBUG, force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be at the DEBUG level by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I need to revert this.
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reorganising the doc. For the export / inference part, I feel like there are some some tips/reminders of inference shouldn't be removed.
docs/source/guides/export_model.mdx
Outdated
@@ -167,9 +159,18 @@ Input shapes: | |||
|
|||
``` | |||
|
|||
In the last section, you can see some input shape options to pass for exporting static neuron model, meaning that exact shape inputs should be used during the inference as given during compilation. If you are going to use variable-size inputs, you can pad your inputs to the shape used for compilation as a workaround. If you want the batch size to be dynamic, you can pass `--dynamic-batch-size` to enable dynamic batching, which means that you will be able to use inputs with difference batch size during inference, but it comes with a potential tradeoff in terms of latency. | |||
### Exporting standard (non-LLM) NLP models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Exporting standard (non-LLM) NLP models | |
### Exporting non-LLM models |
The traced model could be applied to ViT, Timm models and audio models as well.
# 'POSITIVE' | ||
``` | ||
|
||
`compiler_args` are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explaining compiler_args
is important as well.
|
||
<Tip> | ||
|
||
Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel like this tip should be removed. We should remind users that padding are put in place, and compile with large static shapes means wasted compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you suggest is a bit different than what was written originally. Can you suggest a new tip and indicate where I should insert it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like:
Be careful, we pad the inputs to the shapes used for the compilation. But the inputs that you will feed into the model during the inference should have shapes inferior to the static shapes for compilation. And the padding brings computation overhead.
under the snippet of re-loading pre-compiled model I think (under line 247)
Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, the doc looks great now!
What does this PR do?
This first refactors the documentation to merge two similar pages related to models export and inference.
This then improves the TGI deployment user experience in several ways: