Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Latency highlight for OV devices + update of Optimize Inference for master #23575

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,44 +23,43 @@ Optimize Inference
optimizations that can be done independently. Inference
speed depends on latency and throughput.


Runtime optimization, or deployment optimization, focuses on tuning inference parameters and execution means (e.g., the optimum number of requests executed simultaneously). Unlike model-level optimizations, they are highly specific to the hardware and case they are used for, and often come at a cost.
``ov::hint::inference_precision`` is a "typical runtime configuration" which trades accuracy for performance, allowing ``fp16/bf16`` execution for the layers that remain in ``fp32`` after quantization of the original ``fp32`` model.

Therefore, optimization should start with defining the use case. For example, if it is about processing millions of samples by overnight jobs in data centers, throughput could be prioritized over latency. On the other hand, real-time usages would likely trade off throughput to deliver the results at minimal latency. A combined scenario is also possible, targeting the highest possible throughput, while maintaining a specific latency threshold.

It is also important to understand how the full-stack application would use the inference component "end-to-end." For example, to know what stages need to be orchestrated to save workload devoted to fetching and preparing input data.

For more information on this topic, see the following articles:

* :doc:`Supported Devices <../../about-openvino/compatibility-and-support/supported-devices>`
* :doc:`Inference Devices and Modes <inference-devices-and-modes>`
* :ref:`Inputs Pre-processing with the OpenVINO <inputs_pre_processing>`
* :ref:`Async API <async_api>`
* :ref:`The 'get_tensor' Idiom <tensor_idiom>`
* For variably-sized inputs, consider :doc:`dynamic shapes <dynamic-shapes>`


See the :doc:`latency <optimize-inference/optimizing-latency>` and :doc:`throughput <optimize-inference/optimizing-throughput>` optimization guides, for **use-case-specific optimizations**

Writing Performance-Portable Inference Applications
###################################################

Although inference performed in OpenVINO Runtime can be configured with a multitude of low-level performance settings, it is not recommended in most cases. Firstly, achieving the best performance with such adjustments requires deep understanding of device architecture and the inference engine.


Secondly, such optimization may not translate well to other device-model combinations. In other words, one set of execution parameters is likely to result in different performance when used under different conditions. For example:

* both the CPU and GPU support the notion of :doc:`streams <./optimize-inference/optimizing-throughput/advanced_throughput_options>`, yet they deduce their optimal number very differently.
* Even among devices of the same type, different execution configurations can be considered optimal, as in the case of instruction sets or the number of cores for the CPU and the batch size for the GPU.
* Different models have different optimal parameter configurations, considering factors such as compute vs memory-bandwidth, inference precision, and possible model quantization.
* Execution "scheduling" impacts performance strongly and is highly device-specific, for example, GPU-oriented optimizations like batching, combining multiple inputs to achieve the optimal throughput, :doc:`do not always map well to the CPU <optimize-inference/optimizing-low-level-implementation>`.


To make the configuration process much easier and its performance optimization more portable, the option of :doc:`Performance Hints <optimize-inference/high-level-performance-hints>` has been introduced. It comprises two high-level "presets" focused on either **latency** or **throughput** and, essentially, makes execution specifics irrelevant.

The Performance Hints functionality makes configuration transparent to the application, for example, anticipates the need for explicit (application-side) batching or streams, and facilitates parallel processing of separate infer requests for different input sources

Runtime, or deployment optimization focuses on tuning inference and execution parameters. Unlike
model-level optimization, it is highly specific to the hardware you use and the goal you want
to achieve. You need to plan whether to prioritize accuracy or performance,
:doc:`throughput <optimize-inference/optimizing-throughput>` or :doc:`latency <optimize-inference/optimizing-latency>`,
or aim at the golden mean. You should also predict how scalable your application needs to be
and how exactly it is going to work with the inference component. This way, you will be able
to achieve the best results for your product.

.. note::

For more information on this topic, see the following articles:

* :doc:`Inference Devices and Modes <inference-devices-and-modes>`
* :ref:`Inputs Pre-processing with the OpenVINO <inputs_pre_processing>`
* :ref:`Async API <async_api>`
* :ref:`The 'get_tensor' Idiom <tensor_idiom>`
* For variably-sized inputs, consider :doc:`dynamic shapes <dynamic-shapes>`

Performance-Portable Inference
################################

To make configuration easier and performance optimization more portable, OpenVINO offers the
:doc:`Performance Hints <optimize-inference/high-level-performance-hints>` feature. It comprises
two high-level “presets” focused on latency **(default)** or throughput.

Although inference with OpenVINO Runtime can be configured with a multitude
of low-level performance settings, it is not recommended, as:

* It requires deep understanding of device architecture and the inference engine.
* It may not translate well to other device-model combinations. For example:

* CPU and GPU deduce their optimal number of streams differently.
* Different devices of the same type, favor different execution configurations.
* Different models favor different parameter configurations (e.g., compute vs memory-bandwidth,
inference precision, and possible model quantization).
* Execution “scheduling” impacts performance strongly and is highly device specific. GPU-oriented
optimizations :doc:`do not always map well to the CPU <optimize-inference/optimizing-low-level-implementation>`.

Additional Resources
####################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ The hints, in contrast, respect the actual model, so the parameters for optimal
Performance Hints: Latency and Throughput
#########################################

As discussed in the :doc:`Optimization Guide <../optimize-inference>` there are a few different metrics associated with inference speed. Throughput and latency are some of the most widely used metrics that measure the overall performance of an application.
As discussed in the :doc:`Optimization Guide <../optimize-inference>` there are a few different metrics associated with inference speed. Latency and throughput are some of the most widely used metrics that measure the overall performance of an application.

Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely ``ov::hint::PerformanceMode::THROUGHPUT`` and ``ov::hint::PerformanceMode::LATENCY``.
Therefore, in order to ease the configuration of the device, OpenVINO offers two dedicated hints, namely ``ov::hint::PerformanceMode::LATENCY`` **(default)** and ``ov::hint::PerformanceMode::THROUGHPUT``.

For more information on conducting performance measurements with the ``benchmark_app``, refer to the last section in this document.

Expand Down
Loading