[serve] Fix issue http proxy downscaling issues #36652

GeneDer · 2023-06-21T17:21:18Z

Why are these changes needed?

2 downscaling issues are fixed by this PR:

When there are only small requests, a node with ongoing requests can get downscaled. We fixed this by adding a _ongoing_requests counter and add a _ongoing_requests_dummy_obj_ref data store so autoscale will not kill those nodes.
- Also note, /-/healthz and /-/routes will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests.
Large requests can put request data into the data store and prevent downscaling despite there are no replicas on the node. We added a new http proxy status DRAINING and set to true when there are no replicas on the node, if not on head node. When the http proxy is DRAINING, both /-/healthz and /-/routes will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests.

Related issue number

Closes #21404

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…stop routing to nodes without replica issue: anyscale/product#21404 Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

…request Signed-off-by: Gene Su <e870252314@gmail.com>

…h_running_replicas() Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

…request Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer

Did some local test dropping the head node exclusion logic and seeing the INACTIVE status and /-/healthz and /-/routes behave as expected with and without replicas. Will test on Anyscale platform when the wheel is built.

Signed-off-by: Gene Su <e870252314@gmail.com>

shrekris-anyscale

Nice work so far!

python/ray/serve/_private/common.py

python/ray/serve/_private/deployment_state.py

python/ray/serve/_private/http_proxy.py

dashboard/client/src/components/StatusChip.tsx

python/ray/serve/_private/http_state.py

python/ray/serve/_private/deployment_state.py

python/ray/serve/_private/http_proxy.py

Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer

Tested on Anyscale to deploy a service with a 10 replicas and seeing the additional worker node spin up

Go to the terminal and run serve.delete(name="default") in a python console seeing the new DRAINING status on the worker node http proxy

Also curl to ping internal ip and seeing the /-/healthz on the worker node no longer succeeding and continue to succeeding on the head node

python/ray/serve/_private/deployment_state.py

python/ray/serve/_private/http_proxy.py

python/ray/serve/_private/http_state.py

python/ray/serve/controller.py

python/ray/serve/_private/http_state.py

Signed-off-by: Gene Su <e870252314@gmail.com>

…request Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

python/ray/serve/_private/http_state.py

…ining state Signed-off-by: Gene Su <e870252314@gmail.com>

Signed-off-by: Gene Su <e870252314@gmail.com>

…request Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer

Verified if the service is deployed with master branch, with only small requests, some requests can get dropped during the downscaling process

Also verified if there are only large requests, the cluster is unable to scale down even when there are no replicas

Go through the same test again with the latest changes.

During downscaling of only small requests. Seeing the dummy object stored in object store

Seeing the new DRAINING status on the http proxies

Waited a bit and seeing nodes are successfully downscaled and no requests are dropped

During the downscaling of only large requests, seeing the object store being used

Waited a bit and seeing the nodes are successfully downscaled and no issue in the requests

…request Signed-off-by: Gene Su <e870252314@gmail.com>

edoakes · 2023-06-26T17:45:22Z

python/ray/serve/controller.py

@@ -317,6 +334,8 @@ async def run_control_loop(self) -> None:
            except Exception:
                logger.exception("Exception updating application state.")

+            self._update_active_nodes()


maybe put this just above http_state.update given they're very related and it'll keep them consistent

Will follow up on this!

2 downscaling issues are fixed by this PR: - When there are only small requests, a node with ongoing requests can get downscaled. We fixed this by adding a `_ongoing_requests` counter and add a `_ongoing_requests_dummy_obj_ref` data store so autoscale will not kill those nodes. - Also note, `/-/healthz` and `/-/routes` will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests. - Large requests can put request data into the data store and prevent downscaling despite there are no replicas on the node. We added a new http proxy status `DRAINING` and set to true when there are no replicas on the node, if not on head node. When the http proxy is `DRAINING`, both `/-/healthz` and `/-/routes` will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

GeneDer added 8 commits June 17, 2023 22:47

[serve] address issue when downscale nodes with ongoing requests and …

385d0b6

…stop routing to nodes without replica issue: anyscale/product#21404 Signed-off-by: Gene Su <e870252314@gmail.com>

manual test and fixes issues

6944a4e

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into fix-issue-http-proxy-downscaling-on-large-…

4631a7c

…request Signed-off-by: Gene Su <e870252314@gmail.com>

add unit test for get_running_replica_node_ids() and get_node_ids_wit…

59b3b5c

…h_running_replicas() Signed-off-by: Gene Su <e870252314@gmail.com>

add test to test http state and proxies

f218394

Signed-off-by: Gene Su <e870252314@gmail.com>

add test for expected responses on head and worker node routes

fc593f0

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into fix-issue-http-proxy-downscaling-on-large-…

098b5ef

…request Signed-off-by: Gene Su <e870252314@gmail.com>

optimize set active call

5cf7163

Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer requested review from edoakes and shrekris-anyscale June 21, 2023 17:21

GeneDer commented Jun 21, 2023

View reviewed changes

GeneDer added 2 commits June 21, 2023 10:36

add new INACTIVE to sort order and color map

a86ed19

Signed-off-by: Gene Su <e870252314@gmail.com>

use blueGrey for inactive

c04edc4

Signed-off-by: Gene Su <e870252314@gmail.com>

shrekris-anyscale reviewed Jun 21, 2023

View reviewed changes

edoakes reviewed Jun 21, 2023

View reviewed changes

GeneDer added 3 commits June 21, 2023 14:00

address comments

f37d17d

Signed-off-by: Gene Su <e870252314@gmail.com>

use set comprehension

d60036b

Signed-off-by: Gene Su <e870252314@gmail.com>

use {} syntax

d8887fa

Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer commented Jun 21, 2023

View reviewed changes

edoakes reviewed Jun 22, 2023

View reviewed changes

GeneDer added 3 commits June 22, 2023 17:54

address comments and use long poll to pass active nodes

da718af

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into fix-issue-http-proxy-downscaling-on-large-…

d102ac2

…request Signed-off-by: Gene Su <e870252314@gmail.com>

fix tests

6965e39

Signed-off-by: Gene Su <e870252314@gmail.com>

edoakes reviewed Jun 23, 2023

View reviewed changes

python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved

GeneDer added 3 commits June 23, 2023 10:52

drop the long poll client on http state and use update to set the dra…

b443d22

…ining state Signed-off-by: Gene Su <e870252314@gmail.com>

linting

fdc68e5

Signed-off-by: Gene Su <e870252314@gmail.com>

Merge branch 'master' into fix-issue-http-proxy-downscaling-on-large-…

f659da5

…request Signed-off-by: Gene Su <e870252314@gmail.com>

GeneDer commented Jun 23, 2023

View reviewed changes

Merge branch 'master' into fix-issue-http-proxy-downscaling-on-large-…

f1a12eb

…request Signed-off-by: Gene Su <e870252314@gmail.com>

edoakes approved these changes Jun 26, 2023

View reviewed changes

edoakes merged commit 921e879 into ray-project:master Jun 26, 2023

GeneDer deleted the fix-issue-http-proxy-downscaling-on-large-request branch June 26, 2023 17:57

GeneDer mentioned this pull request Jun 26, 2023

[serve] update active nodes before updating http states #36820

Merged

8 tasks

kyle-v6x mentioned this pull request Jun 29, 2023

[Serve] HttpProxyActor prevents downscaling when in use with external load balancer #36944

Closed

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Fix issue http proxy downscaling issues #36652

[serve] Fix issue http proxy downscaling issues #36652

GeneDer commented Jun 21, 2023 •

edited

Loading

GeneDer left a comment

shrekris-anyscale left a comment

GeneDer left a comment

GeneDer left a comment

edoakes Jun 26, 2023

GeneDer Jun 26, 2023

[serve] Fix issue http proxy downscaling issues #36652

[serve] Fix issue http proxy downscaling issues #36652

Conversation

GeneDer commented Jun 21, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

GeneDer left a comment

Choose a reason for hiding this comment

shrekris-anyscale left a comment

Choose a reason for hiding this comment

GeneDer left a comment

Choose a reason for hiding this comment

GeneDer left a comment

Choose a reason for hiding this comment

edoakes Jun 26, 2023

Choose a reason for hiding this comment

GeneDer Jun 26, 2023

Choose a reason for hiding this comment

GeneDer commented Jun 21, 2023 •

edited

Loading