From eaabf9ecf7eedc6e5c02235e84f298c93bc851ef Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Mon, 9 Oct 2023 20:19:44 -0700 Subject: [PATCH 01/11] Add documentation on request cancellation --- docs/README.md | 3 ++ docs/user_guide/request_cancellation.md | 69 +++++++++++++++++++++++++ 2 files changed, 72 insertions(+) create mode 100644 docs/user_guide/request_cancellation.md diff --git a/docs/README.md b/docs/README.md index 6fa3da5180..6528634633 100644 --- a/docs/README.md +++ b/docs/README.md @@ -69,6 +69,7 @@ The User Guide describes how to configure Triton, organize and configure your mo * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)] * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)] * Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)] +* Cancelling Inflight Inference Request [[Overview](README.md#cancelling-inflight-inference-request) || [Details](user_guide/request_cancellation.md)] * Analyzing Performance [[Overview](README.md#performance-analysis)] * Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)] * Debugging Guide [Details](./user_guide/debugging_guide.md) @@ -165,6 +166,8 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) - [Shared Memory Extension](protocol/extension_shared_memory.md) +### Cancelling Inflight Inference Request +Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. ### Performance Analysis Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. - [Performance Tuning Guide](user_guide/performance_tuning.md) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md new file mode 100644 index 0000000000..ca98d0f03d --- /dev/null +++ b/docs/user_guide/request_cancellation.md @@ -0,0 +1,69 @@ + + +# Request Cancellation + +Starting from 23.10, Triton supports handling request cancellation received +from the gRPC client or a C API user. Long running inference requests such +as for auto generative large language models may run for an indeterminate +amount of time or indeterminate number of steps. Additionally clients may +enqueue a large number of requests as part of a sequence or request stream +and later determine the results are no longer needed. Continuing to process +requests whose results are no longer required can significantly impact server +resources. + +[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel` +and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation +status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + +In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can +now detect cancelation from the client and attempt to terminate request. +At present, only gRPC python client supports issuing request cancellation +to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation) +for more details on how to issue requests from the client-side. +See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for +finer details. + + +Upon receiving request cancellation, triton does its best to cancel request +at various points. However, once a request has been given to the backend +for execution, it is upto the individual backends to detect and handle +request termination. +Currently, following backend(s) support(s) early termination: + - [vLLM backend](https://github.com/triton-inference-server/vllm_backend) + +**For the backend developer**: The backend APIs have also been enhanced to let the +backend detect whether the request received from Triton core has been cancelled. +See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled` +in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) +for more details. The backend upon detecting request cancellation can stop processing +it any further. +The python models running behind python backend can also query the cancellation status +of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling) +section in python backend documentation for more details. + From 99cb38e97ee0efa2e1c4d8ec069b585d305c253d Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Tue, 10 Oct 2023 11:26:11 -0700 Subject: [PATCH 02/11] Include python backend --- docs/user_guide/request_cancellation.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index ca98d0f03d..03519dc1b8 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -42,12 +42,12 @@ and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellat status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can -now detect cancelation from the client and attempt to terminate request. +now detect cancellation from the client and attempt to terminate request. At present, only gRPC python client supports issuing request cancellation to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation) for more details on how to issue requests from the client-side. See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for -finer details. +finer details. Upon receiving request cancellation, triton does its best to cancel request @@ -56,9 +56,14 @@ for execution, it is upto the individual backends to detect and handle request termination. Currently, following backend(s) support(s) early termination: - [vLLM backend](https://github.com/triton-inference-server/vllm_backend) + - [python backend](https://github.com/triton-inference-server/python_backend) + +Python backend is a special case where we expose the APIs to detect cancellation +status of the request but it is upto the `model.py` developer to detect whether +the request is cancelled and terminate further execution. **For the backend developer**: The backend APIs have also been enhanced to let the -backend detect whether the request received from Triton core has been cancelled. +backend detect whether the request received from Triton core has been cancelled. See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled` in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) for more details. The backend upon detecting request cancellation can stop processing From eb4702da0db59717367465afcbf50264d551fff3 Mon Sep 17 00:00:00 2001 From: Tanmay Verma Date: Tue, 10 Oct 2023 11:34:14 -0700 Subject: [PATCH 03/11] Update docs/user_guide/request_cancellation.md Co-authored-by: Iman Tabrizian --- docs/user_guide/request_cancellation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 03519dc1b8..4cf19fb2ee 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -68,7 +68,7 @@ See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCance in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h) for more details. The backend upon detecting request cancellation can stop processing it any further. -The python models running behind python backend can also query the cancellation status +The Python models running behind Python backend can also query the cancellation status of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling) section in python backend documentation for more details. From f25f6a071d3c7806ecbc6ae1d93545de33d38f60 Mon Sep 17 00:00:00 2001 From: Tanmay Verma Date: Tue, 10 Oct 2023 11:48:10 -0700 Subject: [PATCH 04/11] Update docs/user_guide/request_cancellation.md Co-authored-by: Neelay Shah --- docs/user_guide/request_cancellation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 4cf19fb2ee..e25bef5dd3 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -59,7 +59,7 @@ Currently, following backend(s) support(s) early termination: - [python backend](https://github.com/triton-inference-server/python_backend) Python backend is a special case where we expose the APIs to detect cancellation -status of the request but it is upto the `model.py` developer to detect whether +status of the request but it is up to the `model.py` developer to detect whether the request is cancelled and terminate further execution. **For the backend developer**: The backend APIs have also been enhanced to let the From 7918a17ad0bacde0601354f3d81e71e38936ed82 Mon Sep 17 00:00:00 2001 From: Tanmay Verma Date: Tue, 10 Oct 2023 13:05:08 -0700 Subject: [PATCH 05/11] Update docs/README.md Co-authored-by: Neelay Shah --- docs/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/README.md b/docs/README.md index 6528634633..2d6511ad92 100644 --- a/docs/README.md +++ b/docs/README.md @@ -69,7 +69,7 @@ The User Guide describes how to configure Triton, organize and configure your mo * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)] * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)] * Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)] -* Cancelling Inflight Inference Request [[Overview](README.md#cancelling-inflight-inference-request) || [Details](user_guide/request_cancellation.md)] +* Cancelling Inference Requests [[Overview](README.md#cancelling-inference-requests) || [Details](user_guide/request_cancellation.md)] * Analyzing Performance [[Overview](README.md#performance-analysis)] * Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)] * Debugging Guide [Details](./user_guide/debugging_guide.md) From e1083a5577d9883653ec3e7d6ade1460567a483c Mon Sep 17 00:00:00 2001 From: Tanmay Verma Date: Tue, 10 Oct 2023 13:05:19 -0700 Subject: [PATCH 06/11] Update docs/user_guide/request_cancellation.md Co-authored-by: Ryan McCormick --- docs/user_guide/request_cancellation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index e25bef5dd3..38703ea3a0 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -54,7 +54,7 @@ Upon receiving request cancellation, triton does its best to cancel request at various points. However, once a request has been given to the backend for execution, it is upto the individual backends to detect and handle request termination. -Currently, following backend(s) support(s) early termination: +Currently, the following backends support early termination: - [vLLM backend](https://github.com/triton-inference-server/vllm_backend) - [python backend](https://github.com/triton-inference-server/python_backend) From 16a99bf826b5971c391ca822ede77cf0c47755ed Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Tue, 10 Oct 2023 13:08:02 -0700 Subject: [PATCH 07/11] Remove inflight term from the main documentation --- docs/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/README.md b/docs/README.md index 2d6511ad92..8a893b5b1f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -166,7 +166,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) - [Shared Memory Extension](protocol/extension_shared_memory.md) -### Cancelling Inflight Inference Request +### Cancelling Inference Requests Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. ### Performance Analysis Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. From f630170c3bc77f33a022768069423129c5595751 Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Tue, 10 Oct 2023 15:27:43 -0700 Subject: [PATCH 08/11] Address review comments --- docs/README.md | 2 +- docs/user_guide/request_cancellation.md | 26 ++++++++++++++++++++++++- 2 files changed, 26 insertions(+), 2 deletions(-) diff --git a/docs/README.md b/docs/README.md index 8a893b5b1f..22e0c0d691 100644 --- a/docs/README.md +++ b/docs/README.md @@ -167,7 +167,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript) - [Shared Memory Extension](protocol/extension_shared_memory.md) ### Cancelling Inference Requests -Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. +Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature. ### Performance Analysis Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment. - [Performance Tuning Guide](user_guide/performance_tuning.md) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 38703ea3a0..5873cc5f51 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -37,10 +37,16 @@ and later determine the results are no longer needed. Continuing to process requests whose results are no longer required can significantly impact server resources. +## Issuing Request Cancellation + +### Triton C API + [In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel` and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). +### gRPC Endpoint + In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can now detect cancellation from the client and attempt to terminate request. At present, only gRPC python client supports issuing request cancellation @@ -49,8 +55,26 @@ for more details on how to issue requests from the client-side. See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for finer details. +## Handling in Triton Core + +Triton core checks for requests that have been cancelled at some critical points +when using [dynamic](./model_configuration.md#dynamic-batcher) or +[sequence batching](./model_configuration.md#sequence-batcher). We also test for +the cancelled requests after every [ensemble](./model_configuration.md#ensemble-scheduler) +step and terminate further processing the requests. + +On detecting a cancelled request, Triton core responds with CANCELLED status. If a request +is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher), +then all the pending requests in the same sequence will also be cancelled. The sequence +is represented by the requests that has identical sequence id. + +**Note**: Currently, Triton core does not detect cancellation status of a request once +it is forwarded to [rate limiter](./rate_limiter.md). Improving the request cancellation +detection and handling within Triton core is work in progress. + +## Handling in Backend -Upon receiving request cancellation, triton does its best to cancel request +Upon receiving request cancellation, triton does its best to terminate request at various points. However, once a request has been given to the backend for execution, it is upto the individual backends to detect and handle request termination. From c230a35cc0829b090ba128813274b22a6fac81ff Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Tue, 10 Oct 2023 15:44:19 -0700 Subject: [PATCH 09/11] Fix --- docs/user_guide/request_cancellation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 5873cc5f51..8068de6ba2 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -79,8 +79,8 @@ at various points. However, once a request has been given to the backend for execution, it is upto the individual backends to detect and handle request termination. Currently, the following backends support early termination: - - [vLLM backend](https://github.com/triton-inference-server/vllm_backend) - - [python backend](https://github.com/triton-inference-server/python_backend) +- [vLLM backend](https://github.com/triton-inference-server/vllm_backend) +- [python backend](https://github.com/triton-inference-server/python_backend) Python backend is a special case where we expose the APIs to detect cancellation status of the request but it is up to the `model.py` developer to detect whether From 8e6918418694549ff6f59191e64a4cbf42a016a3 Mon Sep 17 00:00:00 2001 From: Tanmay Verma Date: Tue, 10 Oct 2023 18:04:14 -0700 Subject: [PATCH 10/11] Update docs/user_guide/request_cancellation.md Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com> --- docs/user_guide/request_cancellation.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 8068de6ba2..00f3545640 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -59,9 +59,10 @@ finer details. Triton core checks for requests that have been cancelled at some critical points when using [dynamic](./model_configuration.md#dynamic-batcher) or -[sequence batching](./model_configuration.md#sequence-batcher). We also test for -the cancelled requests after every [ensemble](./model_configuration.md#ensemble-scheduler) -step and terminate further processing the requests. +[sequence](./model_configuration.md#sequence-batcher) batching. The checking is +also performed between each +[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates +further processing if the request is cancelled. On detecting a cancelled request, Triton core responds with CANCELLED status. If a request is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher), From 11ade3d53d4e0a6c03c82b7e6d4d2067904094af Mon Sep 17 00:00:00 2001 From: tanmayv25 Date: Tue, 10 Oct 2023 18:09:30 -0700 Subject: [PATCH 11/11] Fix --- docs/user_guide/request_cancellation.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md index 00f3545640..49865f25c8 100644 --- a/docs/user_guide/request_cancellation.md +++ b/docs/user_guide/request_cancellation.md @@ -42,8 +42,10 @@ resources. ### Triton C API [In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel` -and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation -status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). +and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query +whether cancellation has been issued on an inflight request respectively. Read more +about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h). + ### gRPC Endpoint