Asynchronous worker communication and vllm integration #3146

mreso · 2024-05-21T00:54:04Z

Description

This PR adds a new asynchronous communications mode to torchServe in order to accommodate inferences engines like vLLM.

Context:
These engines usually include techniques like pages attention which improve memory utilization by fitting as many requests into the engine as possible and evacuate requests if space gets tight. This means the classical synchronized communication model of TorchServe (batch of N requests in -> batch of N requests out) does not work well with these engines. This feature now offer to forward all incoming requests to the backend where they are fed to the engine which start producing responses which are streamed out asynchronously.

Multi-worker note:
While this in theory works with multiple workers it would distribute the incoming requests in a round robin fashion which might lead to non optimal worker/hardware utilization. It is therefore advised to only use a single worker and utilize tensor parallelism to distribute the model over multiple GPUs.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

pytest test/pytest/test_async_worker_comm.py test/pytest/test_example_vllm.py

====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.12, pytest-7.3.1, pluggy-1.3.0
rootdir: /home/ubuntu/serve
plugins: mock-3.10.0, anyio-4.3.0, cov-4.1.0
collected 2 items

test/pytest/test_async_worker_comm.py .                                                                                                                                                                                   [ 50%]
test/pytest/test_example_vllm.py .                                                                                                                                                                                        [100%]

====================================================================================================== 2 passed in 31.83s =======================================================================================================

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…r_comm

namannandan · 2024-06-20T21:38:58Z

examples/large_models/vllm/base_vllm_handler.py

+
+    async def preprocess(self, requests):
+        input_batch = []
+        assert len(requests) == 1, "Expecting batch_size = 1"


Is this to ensure that the async request handling in the frontend immediately forwards the request to the backend without waiting to form a batch?

Correct, in engines like the vllm there is no classical setting as the batch size but number of concurrent requests is influenced by the available memory as well as request length. Therefore, send the request to the engine asap is the best option in my opinion.

namannandan · 2024-06-20T22:16:56Z

frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatchAggregator.java

+                } else {
+                    logger.warn(
+                            "Drop response for inference request {} due to client timeout",
+                            job.getPayload().getRequestId());
+                }


For my understanding, when a job expires, the behavior we expect here is that the backend continues to stream responses until the stopping criteria is reached but the frontend does not send it back to the client? Do we have a way to signal to backend about expired jobs and clean up jobs_in_backend here?

Yes, it would be great to have this kind of functionality. But it will require some kind of command signaling into the backend which we can leverage. E.g. allowing for a C in the otf protocol which we discussed in the past also regarding up and downscaling of thread workers. We can add this functionality in a follow-up PR. For now the behavior will be as you described.

frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatchAggregator.java

…Aggregator.java Remove job from jobs_in_backend on error Co-authored-by: Naman Nandan <namankt55@gmail.com>

namannandan · 2024-06-21T18:18:01Z

ts/async_service.py

+    def __init__(self, service):
+        self.service = service
+        self.service.predict = types.MethodType(predict, self.service)
+        self.in_queue = Queue()


Curious why we need a thread safe queue for the in_queue or we could safely use the non thread-safe async queue, similar to out_queue here as well?

Yes, thats a good point. We need the thread safe queue here as async queue does not provide a timeout. The timeout is actually not strictly necessary for the vllm use case but I wanted to recreate the same batching functionality as in the frontend for future reference. If e.g. someone wants to run multiple workers with batching in the backend.

namannandan · 2024-06-21T18:22:08Z

ts/async_service.py

+                while len(batch) < BATCH_SIZE and (time.time() - st) < MAX_WAIT:
+                    timeout = max(0, MAX_WAIT - (time.time() - st))
+                    request = self.in_queue.get(timeout=timeout)
+                    batch += request


Nit: We may need to check here if we actually got a request or we timed out before adding to batch

When .get() times out which is the only alternative to getting a request an Empty expection is thrown which is caught in line 126 and will just restart the batch window.

namannandan · 2024-06-21T18:28:26Z

ts/async_service.py

+            fetch = Thread(target=self.fetch_batches)
+            fetch.start()
+            receive = Thread(target=self.receive_requests)
+            receive.start()
+            send = Thread(target=self.send_responses)
+            send.start()


If one or more of these threads exit/terminate due to error and the others continue to run, what would the recovery look like? Would we have to restart the thread or somehow trigger an entire worker restart to ensure simple recovery?

Good point, let me see if I can make this more fail resistant.

namannandan

Few follow up questions on backend async service implementation, otherwise looks good to me!

examples/large_models/vllm/Readme.md

ts/model_service_worker.py

examples/large_models/vllm/mistral/model-config.yaml

agunapal

LGTM

…e_model_inference.md

mreso added 9 commits June 11, 2024 00:34

Added dummy async comm worker thread

ab9f950

First version of async worker in frontend running

231f034

[WIP]Running async worker but requests get corrupted if parallel

449be3d

First version running with thread feeding + async predict

5235f03

shorten vllm test time

ebe77d5

Added AsyncVLLMEngine

6dd8d54

Extend vllm test with multiple possible prompts

be39f71

Batch size =1 and remove stream in test

987a607

Switched vllm examples to async comm and added llama3 example

5d3ae53

mreso force-pushed the feature/async_worker_comm branch from 7fb69b9 to 5d3ae53 Compare June 11, 2024 04:17

mreso added 17 commits June 11, 2024 04:19

Fix typo

200e38c

Corrected java file formatting

de1840d

Cleanup and silent chatty debug message

79be426

Added multi-gpu support to vllm examples

889f4c9

fix java format

378fa63

Merge branch 'master' into feature/async_worker_comm

8a43659

Remove debugging messages

e68f873

Fix async comm worker test

0edb81d

Added cl_socket to fixture

f5fc94e

Added multi worker note to vllm example readme

76218fc

Disable tests

8aa4512

Enable async worker comm test

4070fc2

Merge remote-tracking branch 'origin/master' into feature/async_worke…

103ab41

…r_comm

Debug CI

b1fa4c3

Fix python version <= 3.9 issue in async worker

798cd8e

Renamed async worker test

584fa3c

Merge remote-tracking branch 'origin/master' into feature/async_worke…

67260fd

…r_comm

mreso marked this pull request as ready for review June 14, 2024 18:15

Merge branch 'master' into feature/async_worker_comm

952ce50

mreso requested a review from lxning June 14, 2024 18:16

mreso requested a review from namannandan June 14, 2024 18:16

mreso changed the title ~~[WIP] Feature/async worker comm~~ Asynchronous worker communication and vllm integration Jun 14, 2024

Merge branch 'master' into feature/async_worker_comm

68043d6

namannandan reviewed Jun 20, 2024

View reviewed changes

frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatchAggregator.java Outdated Show resolved Hide resolved

mreso and others added 2 commits June 20, 2024 16:46

Update frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatch…

df661d7

…Aggregator.java Remove job from jobs_in_backend on error Co-authored-by: Naman Nandan <namankt55@gmail.com>

Merge branch 'master' into feature/async_worker_comm

c11651b

namannandan reviewed Jun 21, 2024

View reviewed changes

namannandan approved these changes Jun 21, 2024

View reviewed changes

agunapal reviewed Jun 21, 2024

View reviewed changes

examples/large_models/vllm/Readme.md Show resolved Hide resolved

ts/model_service_worker.py Outdated Show resolved Hide resolved

examples/large_models/vllm/mistral/model-config.yaml Show resolved Hide resolved

mreso added 2 commits June 21, 2024 23:49

Unskip vllm example test

ea9c41a

Clean up async worker code

17726fe

agunapal approved these changes Jun 22, 2024

View reviewed changes

mreso added 6 commits June 22, 2024 00:54

Safely remove jobs from jobs_in_backend

b12cf5f

Let worker die if one of the threads in async service dies

d3c0700

Add description of parallelLevel and parallelType=custom to docs/larg…

47dc687

…e_model_inference.md

Added description of parallelLevel to model-archiver readme.md

6a0e3a7

fix typo + added words

1de8678

Merge branch 'master' into feature/async_worker_comm

8d65aa1

mreso enabled auto-merge June 22, 2024 02:51

mreso added 2 commits June 22, 2024 04:22

Fix skip condition for vllm example test

4c1ae6b

Merge branch 'master' into feature/async_worker_comm

4f14680

mreso added this pull request to the merge queue Jun 22, 2024

Merged via the queue into master with commit 5f3df71 Jun 22, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous worker communication and vllm integration #3146

Asynchronous worker communication and vllm integration #3146

mreso commented May 21, 2024 •

edited

Loading

namannandan Jun 20, 2024

mreso Jun 20, 2024

namannandan Jun 20, 2024

mreso Jun 20, 2024

namannandan Jun 21, 2024

mreso Jun 21, 2024

namannandan Jun 21, 2024

mreso Jun 21, 2024

namannandan Jun 21, 2024 •

edited

Loading

mreso Jun 21, 2024

namannandan left a comment

agunapal left a comment

Asynchronous worker communication and vllm integration #3146

Asynchronous worker communication and vllm integration #3146

Conversation

mreso commented May 21, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namannandan Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namannandan left a comment

Choose a reason for hiding this comment

agunapal left a comment

Choose a reason for hiding this comment

mreso commented May 21, 2024 •

edited

Loading

namannandan Jun 21, 2024 •

edited

Loading