[serve] [multiplexing] how to scale up single model on different replicas? #48741

CuriousDolphin · 2024-11-14T13:43:46Z

What happened + What you expected to happen

Hello,
Documentation about model multiplexing say:

Internally, serve router will route the traffic to the corresponding replica based on the model id in the request header. If all replicas holding the model are over-subscribed, ray serve sends the request to a new replica that doesn’t have the model loaded. The replica will load the model from the s3 bucket and cache it.

This makes me think that when a replica that has a model X loaded is under pressure, it is requested to load the model X to another replica that does not have it already loaded, however, I have not been able to validate this behavior.

My environment:
kube-ray: 1.2.2 (deployed with argocd)
ray: 2.38.0 (custom docker image for head and workers)
2 ray worker
1 ray application (multiplexed) with:

max_ongoing_requests= 20
max_queued_requests= 5
max_model_per_replica=5
min_replica=1
max_replica=2
target_ongoing_requests=2

With this configuration when i launch a benchmark (locust) with 50 virtual users, the model is only loaded in the first replica and all traffic is redirect only to the first replica, and the second one remain unloaded.

the only way I managed to get the same model loaded on two different replicas and get autoscaling to work is to set max_num_models_per_replica=1

Versions / Dependencies

kube-ray:1.2.2
ray: 2.38

Reproduction script

you can also test this behaviour with the example in the documentation of serve multiplexing.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

CuriousDolphin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2024

jcotant1 added serve Ray Serve Related Issue and removed serve Ray Serve Related Issue labels Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] [multiplexing] how to scale up single model on different replicas? #48741

[serve] [multiplexing] how to scale up single model on different replicas? #48741

CuriousDolphin commented Nov 14, 2024

[serve] [multiplexing] how to scale up single model on different replicas? #48741

[serve] [multiplexing] how to scale up single model on different replicas? #48741

Comments

CuriousDolphin commented Nov 14, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity