From eaabf9ecf7eedc6e5c02235e84f298c93bc851ef Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Mon, 9 Oct 2023 20:19:44 -0700
Subject: [PATCH 01/11] Add documentation on request cancellation

---
 docs/README.md                          |  3 ++
 docs/user_guide/request_cancellation.md | 69 +++++++++++++++++++++++++
 2 files changed, 72 insertions(+)
 create mode 100644 docs/user_guide/request_cancellation.md

diff --git a/docs/README.md b/docs/README.md
index 6fa3da5180..6528634633 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -69,6 +69,7 @@ The User Guide describes how to configure Triton, organize and configure your mo
 * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)]
 * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)]
 * Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)]
+* Cancelling Inflight Inference Request [[Overview](README.md#cancelling-inflight-inference-request) || [Details](user_guide/request_cancellation.md)]
 * Analyzing Performance [[Overview](README.md#performance-analysis)]
 * Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)]
 * Debugging Guide [Details](./user_guide/debugging_guide.md)
@@ -165,6 +166,8 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t
   - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
   - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
 - [Shared Memory Extension](protocol/extension_shared_memory.md)
+### Cancelling Inflight Inference Request
+Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
 ### Performance Analysis
 Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
 - [Performance Tuning Guide](user_guide/performance_tuning.md)
diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
new file mode 100644
index 0000000000..ca98d0f03d
--- /dev/null
+++ b/docs/user_guide/request_cancellation.md
@@ -0,0 +1,69 @@
+<!--
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Request Cancellation
+
+Starting from 23.10, Triton supports handling request cancellation received
+from the gRPC client or a C API user. Long running inference requests such
+as for auto generative large language models may run for an indeterminate
+amount of time or indeterminate number of steps. Additionally clients may
+enqueue a large number of requests as part of a sequence or request stream
+and later determine the results are no longer needed. Continuing to process
+requests whose results are no longer required can significantly impact server
+resources.
+
+[In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
+and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation
+status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
+
+In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can
+now detect cancelation from the client and attempt to terminate request.
+At present, only gRPC python client supports issuing request cancellation
+to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation)
+for more details on how to issue requests from the client-side.
+See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for
+finer details. 
+
+
+Upon receiving request cancellation, triton does its best to cancel request
+at various points. However, once a request has been given to the backend
+for execution, it is upto the individual backends to detect and handle
+request termination.
+Currently, following backend(s) support(s) early termination:
+    - [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
+
+**For the backend developer**: The backend APIs have also been enhanced to let the
+backend detect whether the request received from Triton core has been cancelled. 
+See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled`
+in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h)
+for more details. The backend upon detecting request cancellation can stop processing
+it any further.
+The python models running behind python backend can also query the cancellation status
+of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling)
+section in python backend documentation for more details.
+

From 99cb38e97ee0efa2e1c4d8ec069b585d305c253d Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 11:26:11 -0700
Subject: [PATCH 02/11] Include python backend

---
 docs/user_guide/request_cancellation.md | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index ca98d0f03d..03519dc1b8 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -42,12 +42,12 @@ and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellat
 status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
 
 In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can
-now detect cancelation from the client and attempt to terminate request.
+now detect cancellation from the client and attempt to terminate request.
 At present, only gRPC python client supports issuing request cancellation
 to the server endpoint. See [request-cancellation](https://github.com/triton-inference-server/client#request-cancellation)
 for more details on how to issue requests from the client-side.
 See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for
-finer details. 
+finer details.
 
 
 Upon receiving request cancellation, triton does its best to cancel request
@@ -56,9 +56,14 @@ for execution, it is upto the individual backends to detect and handle
 request termination.
 Currently, following backend(s) support(s) early termination:
     - [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
+    - [python backend](https://github.com/triton-inference-server/python_backend)
+
+Python backend is a special case where we expose the APIs to detect cancellation
+status of the request but it is upto the `model.py` developer to detect whether
+the request is cancelled and terminate further execution.
 
 **For the backend developer**: The backend APIs have also been enhanced to let the
-backend detect whether the request received from Triton core has been cancelled. 
+backend detect whether the request received from Triton core has been cancelled.
 See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCancelled`
 in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h)
 for more details. The backend upon detecting request cancellation can stop processing

From eb4702da0db59717367465afcbf50264d551fff3 Mon Sep 17 00:00:00 2001
From: Tanmay Verma <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 11:34:14 -0700
Subject: [PATCH 03/11] Update docs/user_guide/request_cancellation.md

Co-authored-by: Iman Tabrizian <iman.tabrizian@gmail.com>
---
 docs/user_guide/request_cancellation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 03519dc1b8..4cf19fb2ee 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -68,7 +68,7 @@ See `TRITONBACKEND_RequestIsCancelled` and `TRITONBACKEND_ResponseFactoryIsCance
 in [tritonbackend.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h)
 for more details. The backend upon detecting request cancellation can stop processing
 it any further.
-The python models running behind python backend can also query the cancellation status
+The Python models running behind Python backend can also query the cancellation status
 of request and response_sender. See [this](https://github.com/triton-inference-server/python_backend#request-cancellation-handling)
 section in python backend documentation for more details.
 

From f25f6a071d3c7806ecbc6ae1d93545de33d38f60 Mon Sep 17 00:00:00 2001
From: Tanmay Verma <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 11:48:10 -0700
Subject: [PATCH 04/11] Update docs/user_guide/request_cancellation.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 docs/user_guide/request_cancellation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 4cf19fb2ee..e25bef5dd3 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -59,7 +59,7 @@ Currently, following backend(s) support(s) early termination:
     - [python backend](https://github.com/triton-inference-server/python_backend)
 
 Python backend is a special case where we expose the APIs to detect cancellation
-status of the request but it is upto the `model.py` developer to detect whether
+status of the request but it is up to the `model.py` developer to detect whether
 the request is cancelled and terminate further execution.
 
 **For the backend developer**: The backend APIs have also been enhanced to let the

From 7918a17ad0bacde0601354f3d81e71e38936ed82 Mon Sep 17 00:00:00 2001
From: Tanmay Verma <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 13:05:08 -0700
Subject: [PATCH 05/11] Update docs/README.md

Co-authored-by: Neelay Shah <neelays@nvidia.com>
---
 docs/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/README.md b/docs/README.md
index 6528634633..2d6511ad92 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -69,7 +69,7 @@ The User Guide describes how to configure Triton, organize and configure your mo
 * Collecting Server Metrics [[Overview](README.md#metrics) || [Details](user_guide/metrics.md)]
 * Supporting Custom Ops/layers [[Overview](README.md#framework-custom-operations) || [Details](user_guide/custom_operations.md)]
 * Using the Client API [[Overview](README.md#client-libraries-and-examples) || [Details](https://github.com/triton-inference-server/client)]
-* Cancelling Inflight Inference Request [[Overview](README.md#cancelling-inflight-inference-request) || [Details](user_guide/request_cancellation.md)]
+* Cancelling Inference Requests [[Overview](README.md#cancelling-inference-requests) || [Details](user_guide/request_cancellation.md)]
 * Analyzing Performance [[Overview](README.md#performance-analysis)]
 * Deploying on edge (Jetson) [[Overview](README.md#jetson-and-jetpack)]
 * Debugging Guide [Details](./user_guide/debugging_guide.md)

From e1083a5577d9883653ec3e7d6ade1460567a483c Mon Sep 17 00:00:00 2001
From: Tanmay Verma <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 13:05:19 -0700
Subject: [PATCH 06/11] Update docs/user_guide/request_cancellation.md

Co-authored-by: Ryan McCormick <rmccormick@nvidia.com>
---
 docs/user_guide/request_cancellation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index e25bef5dd3..38703ea3a0 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -54,7 +54,7 @@ Upon receiving request cancellation, triton does its best to cancel request
 at various points. However, once a request has been given to the backend
 for execution, it is upto the individual backends to detect and handle
 request termination.
-Currently, following backend(s) support(s) early termination:
+Currently, the following backends support early termination:
     - [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
     - [python backend](https://github.com/triton-inference-server/python_backend)
 

From 16a99bf826b5971c391ca822ede77cf0c47755ed Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 13:08:02 -0700
Subject: [PATCH 07/11] Remove inflight term from the main documentation

---
 docs/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/README.md b/docs/README.md
index 2d6511ad92..8a893b5b1f 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -166,7 +166,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t
   - [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
   - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
 - [Shared Memory Extension](protocol/extension_shared_memory.md)
-### Cancelling Inflight Inference Request
+### Cancelling Inference Requests
 Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
 ### Performance Analysis
 Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.

From f630170c3bc77f33a022768069423129c5595751 Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 15:27:43 -0700
Subject: [PATCH 08/11] Address review comments

---
 docs/README.md                          |  2 +-
 docs/user_guide/request_cancellation.md | 26 ++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index 8a893b5b1f..22e0c0d691 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -167,7 +167,7 @@ Use the [Triton Client](https://github.com/triton-inference-server/client) API t
   - [Javascript](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/javascript)
 - [Shared Memory Extension](protocol/extension_shared_memory.md)
 ### Cancelling Inference Requests
-Triton can detect and handle request that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
+Triton can detect and handle requests that have been cancelled from the client-side. This [document](user_guide/request_cancellation.md) discusses scope and limitations of the feature.
 ### Performance Analysis
 Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.
 - [Performance Tuning Guide](user_guide/performance_tuning.md)
diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 38703ea3a0..5873cc5f51 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -37,10 +37,16 @@ and later determine the results are no longer needed. Continuing to process
 requests whose results are no longer required can significantly impact server
 resources.
 
+## Issuing Request Cancellation
+
+### Triton C API
+
 [In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
 and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation
 status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
 
+### gRPC Endpoint
+
 In addition, [gRPC endpoint](../customization_guide/inference_protocols.md#httprest-and-grpc-protocols) can
 now detect cancellation from the client and attempt to terminate request.
 At present, only gRPC python client supports issuing request cancellation
@@ -49,8 +55,26 @@ for more details on how to issue requests from the client-side.
 See gRPC guide on RPC [cancellation](https://grpc.io/docs/guides/cancellation/) for
 finer details.
 
+## Handling in Triton Core
+
+Triton core checks for requests that have been cancelled at some critical points
+when using [dynamic](./model_configuration.md#dynamic-batcher) or
+[sequence batching](./model_configuration.md#sequence-batcher). We also test for
+the cancelled requests after every [ensemble](./model_configuration.md#ensemble-scheduler)
+step and terminate further processing the requests.
+
+On detecting a cancelled request, Triton core responds with CANCELLED status. If a request
+is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher),
+then all the pending requests in the same sequence will also be cancelled. The sequence
+is represented by the requests that has identical sequence id.
+
+**Note**: Currently, Triton core does not detect cancellation status of a request once
+it is forwarded to [rate limiter](./rate_limiter.md). Improving the request cancellation
+detection and handling within Triton core is work in progress.
+
+## Handling in Backend
 
-Upon receiving request cancellation, triton does its best to cancel request
+Upon receiving request cancellation, triton does its best to terminate request
 at various points. However, once a request has been given to the backend
 for execution, it is upto the individual backends to detect and handle
 request termination.

From c230a35cc0829b090ba128813274b22a6fac81ff Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 15:44:19 -0700
Subject: [PATCH 09/11] Fix

---
 docs/user_guide/request_cancellation.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 5873cc5f51..8068de6ba2 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -79,8 +79,8 @@ at various points. However, once a request has been given to the backend
 for execution, it is upto the individual backends to detect and handle
 request termination.
 Currently, the following backends support early termination:
-    - [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
-    - [python backend](https://github.com/triton-inference-server/python_backend)
+- [vLLM backend](https://github.com/triton-inference-server/vllm_backend)
+- [python backend](https://github.com/triton-inference-server/python_backend)
 
 Python backend is a special case where we expose the APIs to detect cancellation
 status of the request but it is up to the `model.py` developer to detect whether

From 8e6918418694549ff6f59191e64a4cbf42a016a3 Mon Sep 17 00:00:00 2001
From: Tanmay Verma <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 18:04:14 -0700
Subject: [PATCH 10/11] Update docs/user_guide/request_cancellation.md

Co-authored-by: Jacky <18255193+kthui@users.noreply.github.com>
---
 docs/user_guide/request_cancellation.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 8068de6ba2..00f3545640 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -59,9 +59,10 @@ finer details.
 
 Triton core checks for requests that have been cancelled at some critical points
 when using [dynamic](./model_configuration.md#dynamic-batcher) or
-[sequence batching](./model_configuration.md#sequence-batcher). We also test for
-the cancelled requests after every [ensemble](./model_configuration.md#ensemble-scheduler)
-step and terminate further processing the requests.
+[sequence](./model_configuration.md#sequence-batcher) batching. The checking is
+also performed between each
+[ensemble](./model_configuration.md#ensemble-scheduler) steps and terminates
+further processing if the request is cancelled.
 
 On detecting a cancelled request, Triton core responds with CANCELLED status. If a request
 is cancelled when using [sequence_batching](./model_configuration.md#sequence-batcher),

From 11ade3d53d4e0a6c03c82b7e6d4d2067904094af Mon Sep 17 00:00:00 2001
From: tanmayv25 <tanmay2592@gmail.com>
Date: Tue, 10 Oct 2023 18:09:30 -0700
Subject: [PATCH 11/11] Fix

---
 docs/user_guide/request_cancellation.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/user_guide/request_cancellation.md b/docs/user_guide/request_cancellation.md
index 00f3545640..49865f25c8 100644
--- a/docs/user_guide/request_cancellation.md
+++ b/docs/user_guide/request_cancellation.md
@@ -42,8 +42,10 @@ resources.
 ### Triton C API
 
 [In-Process Triton Server C API](../customization_guide/inference_protocols.md#in-process-triton-server-api) has been enhanced with `TRITONSERVER_InferenceRequestCancel`
-and `TRITONSERVER_InferenceRequestIsCancelled` to cancel and query the cancellation
-status of an inflight request. Read more about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
+and `TRITONSERVER_InferenceRequestIsCancelled` to issue cancellation and query
+whether cancellation has been issued on an inflight request respectively. Read more
+about the APIs in [tritonserver.h](https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonserver.h).
+
 
 ### gRPC Endpoint