Feature/grpc streaming #2186

lxning · 2023-03-20T17:31:49Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

This PR support the new feature of GRPC server side streaming. it includes:

new GRPC API StreamPredictions
GRPC client and server updates
TS frontend workerthread communication with backend for continuous response
Backend supports send intermediate response
regression test test_inference_stream_apis

Fixes #(issue)
#2180

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Regression test log
reg.txt

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

codecov · 2023-03-20T18:04:59Z

Codecov Report

Merging #2186 (0e3e627) into master (41a3af3) will decrease coverage by 0.13%.
The diff coverage is 38.46%.

❗ Current head 0e3e627 differs from pull request most recent head 8132f23. Consider uploading reports for the commit 8132f23 to get more accurate results

@@            Coverage Diff             @@
##           master    #2186      +/-   ##
==========================================
- Coverage   71.45%   71.32%   -0.13%     
==========================================
  Files          73       73              
  Lines        3296     3306      +10     
  Branches       57       57              
==========================================
+ Hits         2355     2358       +3     
- Misses        941      948       +7

Impacted Files	Coverage Δ
ts/protocol/otf_message_handler.py	`72.58% <25.00%> (-2.15%)`	⬇️
ts/model_service_worker.py	`65.89% <50.00%> (ø)`
ts/service.py	`77.46% <50.00%> (-0.80%)`	⬇️
ts/context.py	`67.10% <100.00%> (+0.43%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

docs/grpc_api.md

msaroufim

Left a bunch of minor feedback but not sure I have enough context on what this PR is trying to do to give system-level feedback

msaroufim · 2023-03-21T04:35:25Z

docs/grpc_api.md

@@ -70,3 +71,28 @@ python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_cla
 ```bash
 python ts_scripts/torchserve_grpc_client.py unregister densenet161
 ```
+## GRPC Server Side Streaming
+TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.


n00b q: what does intermediate response mean? I initially understood this feature as send partial batches back so what's the scenario in which it'd be useful to use this feature? Or is this ineternal only to the large model work?

@msaroufim TS backend message protocol does not allow send partial batch(eg, batchSize=10, only send 5 batches result) to frontend. (see code).

This feature is used for use case such as generative AI where the latency is pretty high to generate full result. This feature allows users to send partial result back to client gradually.

msaroufim · 2023-03-21T04:37:35Z

docs/grpc_api.md

+    if type(data) is list:
+        for i in range (3):
+            send_intermediate_predict_response(["hello"], context.request_ids, "Intermediate Prediction success", 200, context)
+        return ["hello world "]


should this be an async request

It can be async. Customized handler can decides sync or async based on its real use case.

msaroufim · 2023-03-21T04:40:18Z

frontend/server/src/main/java/org/pytorch/serve/wlm/BatchAggregator.java

@@ -56,8 +62,8 @@ public BaseModelRequest getRequest(String threadName, WorkerState state)
    }

    public void sendResponse(ModelWorkerResponse message) {
+        boolean jobDone = true;


this variable name is a bit confusing, outside of the context of streaming - the job is not done yet, maybe streamcomplete or something of the sort would be cleaerer

The most use cases are non-streaming. They only require one single message retrieving for a batch of jobs. Var "jobDone" is used to reflect if the message retrieving is completed for a batch of jobs.

msaroufim · 2023-03-21T04:40:59Z

frontend/server/src/main/java/org/pytorch/serve/wlm/Model.java

@@ -201,8 +200,9 @@ public void pollBatch(String threadId, long waitTime, Map<String, Job> jobsRepo)
            logger.trace("get first job: {}", Objects.requireNonNull(j).getJobId());

            jobsRepo.put(j.getJobId(), j);
-            // describe request job batch size always is 1
-            if (j.getCmd() == WorkerCommands.DESCRIBE) {
+            // batch size always is 1 for describe request job and stream prediction request job


Not sure I understood this limitation why batch size 1

For the generative AI, it is expensive to process one single request. It will make latency higher if batch size >1.

Not really true. We might want a batch of streams as well.It is upto the client.

I think another issue with batch_size > 1 is that there isn't a way to differentiate which stream chunk belongs to which request. Maybe we can utilize the requestId in the job to associate the chunk to request but that is assigned to be a uuid when the frontend receives a request but the client is unable to differentiate.

there are two issues when batch size is set >1.

it breaks current the protocol b/w frontend and backend. eg. some request intermediate result are success, some are failures.

the latency most likely will be even higher if batch size is larger than 1.

msaroufim · 2023-03-21T04:44:44Z

test/pytest/test_gRPC_inference_api.py

+            )
+        )
+
+        print(response.msg)


are these prints necessary? slightly worries that messages will fill out our CI logs which are already long to make search frustrating

it is helpful for debugging the regression test failure point (ie. which model registration fails).

msaroufim · 2023-03-21T04:46:00Z

ts_scripts/torchserve_grpc_client.py

+        for resp in responses:
+            prediction = resp.prediction.decode("utf-8")
+            print(prediction)
+    except grpc.RpcError as e:


Should we also catch the UnicodeDecodeError?

it is not necessary to exit client for utf error. Only rpc error is fatal.

ts/context.py

msaroufim · 2023-03-21T04:49:51Z

test/pytest/test_gRPC_inference_api.py

+    for resp in responses:
+        prediction.append(resp.prediction.decode("utf-8"))
+
+    return " ".join(prediction)


will this be a list of partial predictions so a single prediction or a list of multiplee predictions with batch size 1

prediction is a list of partial prediction responses. Here returns joining all of the partial response together to make the later comparing with expected values much easier.

HamidShojanazeri · 2023-03-22T05:42:16Z

docs/grpc_api.md

@@ -70,3 +71,28 @@ python ts_scripts/torchserve_grpc_client.py infer densenet161 examples/image_cla
 ```bash
 python ts_scripts/torchserve_grpc_client.py unregister densenet161
 ```
+## GRPC Server Side Streaming
+TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.


Suggested change

TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference full response latency is high, and the inference intermediate results are sent to client. This new API automatically forces the batchSize to be one.

TorchServe GRPC APIs adds a server side streaming of the inference API "StreamPredictions" to allow a sequence of inference responses to be sent over the same GRPC stream. This new API is only recommended for the use case when the inference latency of the full response is high and the inference intermediate results are sent to client. An example could be LLMs for generative applications, where generating "n" number of tokens can have high latency, in this case user can receive each generated token once ready until the full response completes. This new API automatically forces the batchSize to be one.

excray · 2023-03-24T17:18:25Z

frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java

+                    ModelInferenceRequest inferReq = (ModelInferenceRequest) req;
+                    boolean streamNext = true;
+                    while (streamNext) {
+                        reply = replies.poll(responseTimeout, TimeUnit.SECONDS);


Looks like responseTimeout is same for streaming and non-streaming as well. clients might want different timeouts for streaming and non streaming api right?

responseTimeout is planned to move to model level config.

lxning added 3 commits March 18, 2023 00:44

feat:grpc server side streaming

0e61718

add regression test for grpc stream

152ca9c

add word in lint spell list

3ce64aa

lxning requested review from vdantu and msaroufim March 20, 2023 17:42

lxning self-assigned this Mar 20, 2023

lxning added the enhancement New feature or request label Mar 20, 2023

lxning added this to the v0.8.0 milestone Mar 20, 2023

lxning requested a review from HamidShojanazeri March 20, 2023 17:44

lxning and others added 2 commits March 20, 2023 15:49

fmt

0fe3371

Merge branch 'master' into feature/grpc_streaming

57a3e4a

HamidShojanazeri reviewed Mar 21, 2023

View reviewed changes

docs/grpc_api.md Show resolved Hide resolved

docs/grpc_api.md Outdated Show resolved Hide resolved

fmt

0a239af

msaroufim requested changes Mar 21, 2023

View reviewed changes

lxning added 2 commits March 21, 2023 11:51

update doc

3175991

add LLMs in wordlist

0d5f89e

msaroufim self-requested a review March 21, 2023 22:29

msaroufim approved these changes Mar 21, 2023

View reviewed changes

HamidShojanazeri approved these changes Mar 22, 2023

View reviewed changes

update doc

451b9a5

excray reviewed Mar 24, 2023

View reviewed changes

Merge branch 'master' into feature/grpc_streaming

8132f23

lxning merged commit d0510ba into master Mar 28, 2023

lxning mentioned this pull request Apr 16, 2023

TorchServe inference stream response support #2234

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/grpc streaming #2186

Feature/grpc streaming #2186

lxning commented Mar 20, 2023 •

edited

Loading

codecov bot commented Mar 20, 2023 •

edited

Loading

msaroufim left a comment

msaroufim Mar 21, 2023

lxning Mar 21, 2023

msaroufim Mar 21, 2023

lxning Mar 21, 2023

msaroufim Mar 21, 2023

lxning Mar 21, 2023

msaroufim Mar 21, 2023

lxning Mar 21, 2023

excray Mar 24, 2023

jinhuang12 Mar 24, 2023

lxning Mar 27, 2023 •

edited

Loading

msaroufim Mar 21, 2023

lxning Mar 21, 2023

msaroufim Mar 21, 2023

lxning Mar 21, 2023

msaroufim Mar 21, 2023

lxning Mar 21, 2023

HamidShojanazeri Mar 22, 2023

excray Mar 24, 2023

lxning Mar 25, 2023

Feature/grpc streaming #2186

Feature/grpc streaming #2186

Conversation

lxning commented Mar 20, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Mar 20, 2023 • edited Loading

Codecov Report

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxning Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lxning commented Mar 20, 2023 •

edited

Loading

codecov bot commented Mar 20, 2023 •

edited

Loading

lxning Mar 27, 2023 •

edited

Loading