Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] [multiplexing] how to scale up single model on different replicas? #48741

Open
CuriousDolphin opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@CuriousDolphin
Copy link

What happened + What you expected to happen

Hello,
Documentation about model multiplexing say:

Internally, serve router will route the traffic to the corresponding replica based on the model id in the request header. If all replicas holding the model are over-subscribed, ray serve sends the request to a new replica that doesn’t have the model loaded. The replica will load the model from the s3 bucket and cache it.

This makes me think that when a replica that has a model X loaded is under pressure, it is requested to load the model X to another replica that does not have it already loaded, however, I have not been able to validate this behavior.

My environment:
kube-ray: 1.2.2 (deployed with argocd)
ray: 2.38.0 (custom docker image for head and workers)
2 ray worker
1 ray application (multiplexed) with:

max_ongoing_requests= 20
max_queued_requests= 5
max_model_per_replica=5
min_replica=1
max_replica=2
target_ongoing_requests=2

With this configuration when i launch a benchmark (locust) with 50 virtual users, the model is only loaded in the first replica and all traffic is redirect only to the first replica, and the second one remain unloaded.

the only way I managed to get the same model loaded on two different replicas and get autoscaling to work is to set max_num_models_per_replica=1

Versions / Dependencies

kube-ray:1.2.2
ray: 2.38

Reproduction script

you can also test this behaviour with the example in the documentation of serve multiplexing.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@CuriousDolphin CuriousDolphin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2024
@jcotant1 jcotant1 added serve Ray Serve Related Issue and removed serve Ray Serve Related Issue labels Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't serve Ray Serve Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants