You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
Documentation about model multiplexing say:
Internally, serve router will route the traffic to the corresponding replica based on the model id in the request header. If all replicas holding the model are over-subscribed, ray serve sends the request to a new replica that doesn’t have the model loaded. The replica will load the model from the s3 bucket and cache it.
This makes me think that when a replica that has a model X loaded is under pressure, it is requested to load the model X to another replica that does not have it already loaded, however, I have not been able to validate this behavior.
My environment:
kube-ray: 1.2.2 (deployed with argocd)
ray: 2.38.0 (custom docker image for head and workers)
2 ray worker
1 ray application (multiplexed) with:
With this configuration when i launch a benchmark (locust) with 50 virtual users, the model is only loaded in the first replica and all traffic is redirect only to the first replica, and the second one remain unloaded.
the only way I managed to get the same model loaded on two different replicas and get autoscaling to work is to set max_num_models_per_replica=1
Versions / Dependencies
kube-ray:1.2.2
ray: 2.38
Reproduction script
you can also test this behaviour with the example in the documentation of serve multiplexing.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
CuriousDolphin
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Nov 14, 2024
What happened + What you expected to happen
Hello,
Documentation about model multiplexing say:
This makes me think that when a replica that has a model X loaded is under pressure, it is requested to load the model X to another replica that does not have it already loaded, however, I have not been able to validate this behavior.
My environment:
kube-ray: 1.2.2 (deployed with argocd)
ray: 2.38.0 (custom docker image for head and workers)
2 ray worker
1 ray application (multiplexed) with:
max_ongoing_requests= 20
max_queued_requests= 5
max_model_per_replica=5
min_replica=1
max_replica=2
target_ongoing_requests=2
With this configuration when i launch a benchmark (locust) with 50 virtual users, the model is only loaded in the first replica and all traffic is redirect only to the first replica, and the second one remain unloaded.
the only way I managed to get the same model loaded on two different replicas and get autoscaling to work is to set max_num_models_per_replica=1
Versions / Dependencies
kube-ray:1.2.2
ray: 2.38
Reproduction script
you can also test this behaviour with the example in the documentation of serve multiplexing.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: