Refactor documentation and improve tgi deployment #610

dacorvo · 2024-05-27T15:59:24Z

What does this PR do?

This first refactors the documentation to merge two similar pages related to models export and inference.

This then improves the TGI deployment user experience in several ways:

also export tokenizer when exporting LLM models,
remove redundant HF_BATCH_SIZE and HF_SEQUENCE_LENGTH environment variables,
reduce CPU usage when launching the service on a local model,
add a dedicated TGI documentation page with simplified instructions,
add a reference to the export documentation when no cached configuration is found.

This allow the checkpoint files to be visible even if they have been created by another user (like the docker root user).

HuggingFaceDocBuilderDev · 2024-05-27T16:03:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/guides/export_model.mdx

michaelbenayoun · 2024-05-28T11:34:23Z

docs/source/guides/export_model.mdx

+Although pre-compilation avoids overhead during the inference, a compiled Neuron model has some limitations:
+* the input shapes and data types used during the compilation cannot be changed.
+* Neuron models are specialized for each hardware and SDK version, which means:
+  * Models compiled with Neuron can no longer be executed in non-Neuron environment.


What does it mean?

That was in the original file. I think it means you cannot run them on CPU.

michaelbenayoun · 2024-05-28T11:35:36Z

docs/source/guides/export_model.mdx

+>>> from optimum.exporters.tasks import TasksManager
+>>> from optimum.exporters.neuron.model_configs import *  # Register neuron specific configs to the TasksManager

-# Save the model
->>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
+>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys())
+>>> print(distilbert_tasks)
+['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification']


Should we make a command in the CLI to list these at some point?

Yes, why not.

docs/source/guides/export_model.mdx

michaelbenayoun · 2024-05-28T11:41:26Z

text-generation-inference/tgi_env.py

@@ -184,6 +176,7 @@ def main():
    work properly
    :return:
    """
+    logging.basicConfig(level=logging.DEBUG, force=True)


We want to be at the DEBUG level by default?

Good catch. I need to revert this.

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

JingyaHuang

Thanks for reorganising the doc. For the export / inference part, I feel like there are some some tips/reminders of inference shouldn't be removed.

optimum/exporters/neuron/__main__.py

JingyaHuang · 2024-05-28T12:01:31Z

docs/source/guides/export_model.mdx

@@ -167,9 +159,18 @@ Input shapes:

 ```

-In the last section, you can see some input shape options to pass for exporting static neuron model, meaning that exact shape inputs should be used during the inference as given during compilation. If you are going to use variable-size inputs, you can pad your inputs to the shape used for compilation as a workaround. If you want the batch size to be dynamic, you can pass `--dynamic-batch-size` to enable dynamic batching, which means that you will be able to use inputs with difference batch size during inference, but it comes with a potential tradeoff in terms of latency.
+### Exporting standard (non-LLM) NLP models


Suggested change

### Exporting standard (non-LLM) NLP models

### Exporting non-LLM models

The traced model could be applied to ViT, Timm models and audio models as well.

docs/source/guides/models.mdx

JingyaHuang · 2024-05-28T12:09:05Z

docs/source/guides/models.mdx

-# 'POSITIVE'
-```
-
-`compiler_args` are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine.


Explaining compiler_args is important as well.

JingyaHuang · 2024-05-28T12:10:40Z

docs/source/guides/models.mdx

-
-<Tip>
-
-Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.


I don't feel like this tip should be removed. We should remind users that padding are put in place, and compile with large static shapes means wasted compute.

What you suggest is a bit different than what was written originally. Can you suggest a new tip and indicate where I should insert it ?

something like:

Be careful, we pad the inputs to the shapes used for the compilation. But the inputs that you will feed into the model during the inference should have shapes inferior to the static shapes for compilation. And the padding brings computation overhead.

under the snippet of re-loading pre-compiled model I think (under line 247)

docs/source/guides/neuronx_tgi.mdx

Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>

JingyaHuang

Thanks for the PR, the doc looks great now!

dacorvo added 7 commits May 27, 2024 11:04

feat(decoder): export Tokenizer if available

06b13fd

feat(decoder): extend checkpoint folder permissions

c8afe66

This allow the checkpoint files to be visible even if they have been created by another user (like the docker root user).

feat(tgi): remove redundant env var

01ffdff

docs(tgi): use privileged option

43c60fa

docs(tgi): simplify deployment instructions

4229b33

fix(tgi): reduce CPU mem usage when loading neuron model

01366ca

docs(inference): merge two similar pages

124c26d

dacorvo added 2 commits May 28, 2024 07:13

docs: move TGI README to documentation

80b8931

feat(tgi): reference export documentation in error message

fed15f3

dacorvo force-pushed the improve_tgi_deployment branch from c56c6ff to fed15f3 Compare May 28, 2024 07:16

dacorvo marked this pull request as ready for review May 28, 2024 07:19

dacorvo requested review from michaelbenayoun, JingyaHuang and philschmid May 28, 2024 07:19

michaelbenayoun reviewed May 28, 2024

View reviewed changes

dacorvo and others added 2 commits May 28, 2024 13:45

Apply suggestions from code review

dad6206

Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>

review(tgi): revert to info traces

14cf229

JingyaHuang reviewed May 28, 2024

View reviewed changes

dacorvo and others added 3 commits May 28, 2024 14:43

Apply suggestions from code review

8ef5485

Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>

review: add details on export parameters

e41c1e6

review: add padding tip

932258e

dacorvo requested review from michaelbenayoun and JingyaHuang May 28, 2024 14:26

JingyaHuang approved these changes May 28, 2024

View reviewed changes

dacorvo merged commit ad9e51b into main May 28, 2024
13 checks passed

dacorvo deleted the improve_tgi_deployment branch May 28, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor documentation and improve tgi deployment #610

Refactor documentation and improve tgi deployment #610

dacorvo commented May 27, 2024

HuggingFaceDocBuilderDev commented May 27, 2024

michaelbenayoun May 28, 2024

dacorvo May 28, 2024

michaelbenayoun May 28, 2024

dacorvo May 28, 2024

michaelbenayoun May 28, 2024

dacorvo May 28, 2024

JingyaHuang left a comment

JingyaHuang May 28, 2024

JingyaHuang May 28, 2024

JingyaHuang May 28, 2024

dacorvo May 28, 2024

JingyaHuang May 28, 2024

JingyaHuang left a comment

	### Exporting standard (non-LLM) NLP models
	### Exporting non-LLM models


		<Tip>

		Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.

Refactor documentation and improve tgi deployment #610

Refactor documentation and improve tgi deployment #610

Conversation

dacorvo commented May 27, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented May 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment