Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation and improve tgi deployment #610

Merged
merged 14 commits into from
May 28, 2024
Merged

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented May 27, 2024

What does this PR do?

This first refactors the documentation to merge two similar pages related to models export and inference.

This then improves the TGI deployment user experience in several ways:

  • also export tokenizer when exporting LLM models,
  • remove redundant HF_BATCH_SIZE and HF_SEQUENCE_LENGTH environment variables,
  • reduce CPU usage when launching the service on a local model,
  • add a dedicated TGI documentation page with simplified instructions,
  • add a reference to the export documentation when no cached configuration is found.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo dacorvo marked this pull request as ready for review May 28, 2024 07:19
docs/source/guides/export_model.mdx Outdated Show resolved Hide resolved
Although pre-compilation avoids overhead during the inference, a compiled Neuron model has some limitations:
* the input shapes and data types used during the compilation cannot be changed.
* Neuron models are specialized for each hardware and SDK version, which means:
* Models compiled with Neuron can no longer be executed in non-Neuron environment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was in the original file. I think it means you cannot run them on CPU.

Comment on lines +203 to +208
>>> from optimum.exporters.tasks import TasksManager
>>> from optimum.exporters.neuron.model_configs import * # Register neuron specific configs to the TasksManager

# Save the model
>>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys())
>>> print(distilbert_tasks)
['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make a command in the CLI to list these at some point?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, why not.

docs/source/guides/export_model.mdx Outdated Show resolved Hide resolved
@@ -184,6 +176,7 @@ def main():
work properly
:return:
"""
logging.basicConfig(level=logging.DEBUG, force=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be at the DEBUG level by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I need to revert this.

dacorvo and others added 2 commits May 28, 2024 13:45
Co-authored-by: Michael Benayoun <mickbenayoun@gmail.com>
Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reorganising the doc. For the export / inference part, I feel like there are some some tips/reminders of inference shouldn't be removed.

optimum/exporters/neuron/__main__.py Show resolved Hide resolved
@@ -167,9 +159,18 @@ Input shapes:

```

In the last section, you can see some input shape options to pass for exporting static neuron model, meaning that exact shape inputs should be used during the inference as given during compilation. If you are going to use variable-size inputs, you can pad your inputs to the shape used for compilation as a workaround. If you want the batch size to be dynamic, you can pass `--dynamic-batch-size` to enable dynamic batching, which means that you will be able to use inputs with difference batch size during inference, but it comes with a potential tradeoff in terms of latency.
### Exporting standard (non-LLM) NLP models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Exporting standard (non-LLM) NLP models
### Exporting non-LLM models

The traced model could be applied to ViT, Timm models and audio models as well.

docs/source/guides/models.mdx Show resolved Hide resolved
# 'POSITIVE'
```

`compiler_args` are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explaining compiler_args is important as well.


<Tip>

Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like this tip should be removed. We should remind users that padding are put in place, and compile with large static shapes means wasted compute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you suggest is a bit different than what was written originally. Can you suggest a new tip and indicate where I should insert it ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like:

Be careful, we pad the inputs to the shapes used for the compilation. But the inputs that you will feed into the model during the inference should have shapes inferior to the static shapes for compilation. And the padding brings computation overhead.

under the snippet of re-loading pre-compiled model I think (under line 247)

docs/source/guides/neuronx_tgi.mdx Outdated Show resolved Hide resolved
dacorvo and others added 3 commits May 28, 2024 14:43
Co-authored-by: Jingya HUANG <44135271+JingyaHuang@users.noreply.github.com>
Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, the doc looks great now!

@dacorvo dacorvo merged commit ad9e51b into main May 28, 2024
13 checks passed
@dacorvo dacorvo deleted the improve_tgi_deployment branch May 28, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants