stateful inference #2513

lxning · 2023-08-01T20:21:53Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Regression test
reg.txt
Normal Sequential Inference

# model_store/stateful/model-config.yaml
minWorkers: 2
maxWorkers: 2
batchSize: 4
maxBatchDelay: 100
sequenceMaxIdleMSec: 600000
maxNumSequence: 4
maxSequenceJobQueueSize: 10

handler:
  cache:
    capacity: 4

# Start model server and load  example model stateful.mar which responses the accumulated value from the sequential input
torchserve --ncs --start --model-store model_store --models stateful.mar --ts-config benchmarks/config.properties

# Run sequential inference
python ts_scripts/torchserve_grpc_client.py infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
InferStream2 started
prediction: "1"

prediction: "3"

prediction: "6"

Sequence completed!
InferStream2 closed

Expired or Streaming Closed Sequential Inference: the second sequence inference call gets error.

# model_store/stateful/model-config.yaml
minWorkers: 2
maxWorkers: 2
batchSize: 4
maxBatchDelay: 5000
sequenceMaxIdleMSec: 600000
maxNumSequence: 4
maxSequenceJobQueueSize: 10

handler:
  cache:
    capacity: 4

# Start model server and load  example model stateful.mar which responses the accumulated value from the sequential input
torchserve --ncs --start --model-store model_store --models stateful.mar --ts-config benchmarks/config.properties

# Run the first sequential inference
python ts_scripts/torchserve_grpc_client.py infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
InferStream2 started
prediction: "1"

prediction: "3"

prediction: "6"

Sequence completed!
InferStream2 closed

# Run the 2nd sequential inference
python ts_scripts/torchserve_grpc_client.py infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
InferStream2 started
status {
  code: 13
  message: "Model \"stateful\" please check if the sequence is closed, or expired; or exceeds maxSequenceJobQueueSize in log"
  details {
    type_url: "type.googleapis.com/google.rpc.ErrorInfo"
    value: "\n\032InternalServerException.()"
  }
}

status {
  code: 13
  message: "Model \"stateful\" please check if the sequence is closed, or expired; or exceeds maxSequenceJobQueueSize in log"
  details {
    type_url: "type.googleapis.com/google.rpc.ErrorInfo"
    value: "\n\032InternalServerException.()"
  }
}

status {
  code: 13
  message: "Model \"stateful\" please check if the sequence is closed, or expired; or exceeds maxSequenceJobQueueSize in log"
  details {
    type_url: "type.googleapis.com/google.rpc.ErrorInfo"
    value: "\n\032InternalServerException.()"
  }
}

Sequence completed!
InferStream2 closed

Concurrently Run 2 Sequential Inferences on the same worker

# model_store/stateful/model-config.yaml
minWorkers: 2
maxWorkers: 2
batchSize: 4
maxBatchDelay: 5000
sequenceMaxIdleMSec: 600000
maxNumSequence: 4
maxSequenceJobQueueSize: 10

handler:
  cache:
    capacity: 4

# Start model server and load  example model stateful.mar which responses the accumulated value from the sequential input
torchserve --ncs --start --model-store model_store --models stateful.mar --ts-config benchmarks/config.properties

# The first sequential inference
 python ts_scripts/torchserve_grpc_client.py infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt,examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt,examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt,examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
InferStream2 started
prediction: "1"

prediction: "3"

prediction: "6"

prediction: "7"

status {
  code: 13
  message: "Model \"stateful\" please check if the sequence is closed, or expired; or exceeds maxSequenceJobQueueSize in log"
  details {
    type_url: "type.googleapis.com/google.rpc.ErrorInfo"
    value: "\n\032InternalServerException.()"
  }
}

prediction: "9"

prediction: "12"

prediction: "13"

prediction: "15"

prediction: "18"

prediction: "19"

prediction: "21"

Sequence completed!
InferStream2 closed

# The 2nd sequential inference
python ts_scripts/torchserve_grpc_client.py infer_stream2 stateful seq_1 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt,examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
InferStream2 started
prediction: "1"

prediction: "3"

prediction: "6"

prediction: "7"

prediction: "9"

prediction: "12"

Sequence completed!
InferStream2 closed

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

calebho

Not familiar with the implementation details so I can only comment on the API

calebho · 2023-08-18T17:29:37Z

examples/stateful/stateful_handler.py

+
+        self.sequence_ids = {}
+        results = []
+        for idx, row in enumerate(data):


To confirm, is it the case that batchSize is the least upper bound of len(data), i.e.len(data) <= batchSize and for all l such that len(data) <= l, batchSize <= l?

Is it possible for two separate requests to get batched to this worker? If so, suppose there are two separate streaming requests that are batched to this worker. What happens if one client is much much faster than the other? Do we throttle the faster client to match the speed of the slower one by buffering the faster client's messages?

Q1: yes, len(data) <= batchSize. data is a batch of requests received at realtime.

Q2: Yes, a batch of requests comes from different sequences. eg. len(data) = 4, it means there are 4 sequences. Each sequence has its own dedicated jobQ. Only the parameter "maxBatchDelay" decides the msec of batching a group of requests from different sequences. In other words, the different traffic volume of different sequences has no impact on batching latency.

Ok but if two streams produce data at drastically different rates, how do you keep the batch index coherent? For instance, fix a stateful worker. At time t_0, the worker receives data d_0_0 and d_1_0 from two streams. So then len(data) == 2 and data[0] is the payload for stream 0 and data[1] is the payload for stream 1.

At t_1, stream 0 does not produce any data because it took longer than maxBatchDelay, but stream 1 produces data d_1_1. So then len(data) == 1 and data[0] is the payload for stream 1. In the line below, idx == 0, so then you fetch the sequence ID for index 0. It seems like this would fetch the sequence ID for stream 0,

sequence_id = self.context.get_sequence_id(idx)

but you actually want the sequence ID for stream 1. Am I understanding the API semantics correctly? Perhaps I am misunderstanding how context.get_sequence_id works. Does it keep track of which stream corresponds to the elements of the data list passed to the handler?

each request's sequence id is added into its header with key = "ts_request_sequence_id". Backend can get a request's sequence id via its header. This can guarantee we can always get the sequence id regardless the real batch size is changed or the request of a sequence enters into a different batch slot.

codecov · 2023-09-29T05:33:44Z

Codecov Report

Merging #2513 (40991b3) into master (7f4419f) will decrease coverage by 0.02%.
Report is 2 commits behind head on master.
The diff coverage is 50.00%.

❗ Current head 40991b3 differs from pull request most recent head 0a90a87. Consider uploading reports for the commit 0a90a87 to get more accurate results

@@            Coverage Diff             @@
##           master    #2513      +/-   ##
==========================================
- Coverage   72.44%   72.43%   -0.02%     
==========================================
  Files          85       85              
  Lines        3963     3965       +2     
  Branches       58       58              
==========================================
+ Hits         2871     2872       +1     
- Misses       1088     1089       +1     
  Partials        4        4

Files	Coverage Δ
ts/context.py	`77.21% <50.00%> (-0.71%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mreso · 2023-10-12T03:38:47Z

frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java

                                                && model.getParallelLevel() > 1
                                                && model.getParallelType()
                                                        != ModelConfig.ParallelType.PP)
                                ? model.getParallelLevel()
                                : 1;
+                List<CompletableFuture<Void>> futureRequests = new ArrayList<>(repeats);
                for (int i = 0; backendChannel.size() > 0 && i < repeats; i++) {


Got it, in that case we should move the check out of the loop condition and start from the beginning. Otherwise we're getting an undefined delay before we retry sending the job through the check for results (that cannot be there as we never sent the request).

mreso · 2023-10-26T16:59:05Z

frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatchAggregator.java

+                        CompletableFuture.runAsync(
+                                () -> {
+                                    Job job = jobGroup.pollJob((long) model.getMaxBatchDelay());
+                                    if (job != null) {


Can you change this part into pushing the jobs instead of polling?

mreso

We're already in a good shape, left some comments.

mreso · 2023-10-31T18:48:33Z

frontend/server/src/main/java/org/pytorch/serve/job/Job.java

+                break;
+        }
+
+        if (cmd == WorkerCommands.STREAMPREDICT2) {


duplicate still persists

examples/large_models/Huggingface_accelerate/llama2/custom_handler_code.py

examples/stateful/Readme.md

frontend/server/src/main/java/org/pytorch/serve/util/ApiUtils.java

test/postman/inference_stream2_data.json

test/pytest/test_parallelism.py

frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatchAggregator.java

mreso

LGTM

mreso · 2023-11-07T19:34:44Z

examples/large_models/Huggingface_accelerate/llama2/custom_handler_code.py

This should be deleted.

stateful inference-core layer

ed5239e

lxning self-assigned this Aug 1, 2023

lxning added 3 commits August 4, 2023 17:07

add grpc layer

0794f54

add google rpc submodule

4d55643

fmt

a857307

lxning requested review from mreso, chauhang, HamidShojanazeri and calebho August 15, 2023 01:11

lxning added 4 commits August 14, 2023 18:14

update sequence batch img

4ae7404

update sequence batch img

5c0dd97

fmt

0651806

delete used file

3e16993

lxning changed the title ~~[WIP] stateful inference~~ stateful inference Aug 15, 2023

lxning added 5 commits August 14, 2023 19:15

fmt

6aee437

fmt

91b9f99

fix log and update doc

a3f84eb

update log

c60f390

fmt

f5c7707

calebho reviewed Aug 18, 2023

View reviewed changes

msaroufim added the enhancement New feature or request label Aug 25, 2023

lxning added 6 commits September 27, 2023 16:08

merge master and fix conflict

dd23216

make BatchAggregator as base

1ea33cf

fix conflict

fdb03c9

fix conflict

c3a2cca

add SequenceBatchAggregator

ba1bc45

update ci for submodule

f6c888d

lxning added 2 commits October 3, 2023 23:11

merge master

d723754

refactor

077bf27

mreso requested changes Oct 26, 2023

View reviewed changes

lxning and others added 13 commits October 26, 2023 18:21

update readme

4749b74

allow number ofjobGroup is larger than batchsize

80053ca

fmt

44d3986

Merge branch 'master' into feat/stateful

8879393

fix typo

b05e653

add stateful test data

5f7125e

fmt

9bb9245

Merge branch 'master' into feat/stateful

2f83255

fmt

a592f10

fmt

4b9145b

fmt

5fe05cd

Merge branch 'master' into feat/stateful

8e7ce9e

set default maxNumSequence

0a90a87

mreso requested changes Nov 1, 2023

View reviewed changes

lxning and others added 2 commits November 3, 2023 11:50

fmt

fb9cdb5

Merge branch 'master' into feat/stateful

627a31e

lxning enabled auto-merge November 3, 2023 19:22

lxning and others added 5 commits November 3, 2023 15:11

fmt

876d83d

Merge branch 'master' into feat/stateful

4b19885

revert back config.properties

c5a0708

fmt

6dc374a

Merge branch 'master' into feat/stateful

d4ea03d

mreso approved these changes Nov 8, 2023

View reviewed changes

examples/large_models/Huggingface_accelerate/llama2/custom_handler_code.py Outdated

Copy link

Collaborator

mreso Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be deleted.

lxning added this pull request to the merge queue Nov 8, 2023

Merged via the queue into master with commit e1c31e1 Nov 8, 2023
13 checks passed

lxning added this to the v0.10.0 milestone Mar 13, 2024

MaelitoP mentioned this pull request Apr 16, 2024

TorchServe crashes in production with `WorkerThread - IllegalStateException error' #3087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stateful inference #2513

stateful inference #2513

lxning commented Aug 1, 2023 •

edited

Loading

calebho left a comment

calebho Aug 18, 2023

lxning Aug 18, 2023 •

edited

Loading

calebho Aug 21, 2023 •

edited

Loading

lxning Sep 29, 2023

codecov bot commented Sep 29, 2023 •

edited

Loading

mreso Oct 12, 2023

mreso Oct 26, 2023

mreso left a comment

mreso Oct 31, 2023

mreso left a comment

mreso Nov 7, 2023

stateful inference #2513

stateful inference #2513

Conversation

lxning commented Aug 1, 2023 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

calebho left a comment

Choose a reason for hiding this comment

calebho Aug 18, 2023

Choose a reason for hiding this comment

lxning Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

calebho Aug 21, 2023 • edited Loading

Choose a reason for hiding this comment

lxning Sep 29, 2023

Choose a reason for hiding this comment

codecov bot commented Sep 29, 2023 • edited Loading

Codecov Report

mreso Oct 12, 2023

Choose a reason for hiding this comment

mreso Oct 26, 2023

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

mreso Oct 31, 2023

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

mreso Nov 7, 2023

Choose a reason for hiding this comment

lxning commented Aug 1, 2023 •

edited

Loading

lxning Aug 18, 2023 •

edited

Loading

calebho Aug 21, 2023 •

edited

Loading

codecov bot commented Sep 29, 2023 •

edited

Loading