Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Fix issue http proxy downscaling issues #36652

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jun 21, 2023

Why are these changes needed?

2 downscaling issues are fixed by this PR:

  • When there are only small requests, a node with ongoing requests can get downscaled. We fixed this by adding a _ongoing_requests counter and add a _ongoing_requests_dummy_obj_ref data store so autoscale will not kill those nodes.
    • Also note, /-/healthz and /-/routes will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests.
  • Large requests can put request data into the data store and prevent downscaling despite there are no replicas on the node. We added a new http proxy status DRAINING and set to true when there are no replicas on the node, if not on head node. When the http proxy is DRAINING, both /-/healthz and /-/routes will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests.

Related issue number

Closes #21404

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

GeneDer added 8 commits June 17, 2023 22:47
…stop routing to nodes without replica

issue: anyscale/product#21404

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…request

Signed-off-by: Gene Su <e870252314@gmail.com>
…h_running_replicas()

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…request

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor Author

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some local test dropping the head node exclusion logic and seeing the INACTIVE status and /-/healthz and /-/routes behave as expected with and without replicas. Will test on Anyscale platform when the wheel is built.

GeneDer added 2 commits June 21, 2023 10:36
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far!

python/ray/serve/_private/common.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
dashboard/client/src/components/StatusChip.tsx Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
GeneDer added 3 commits June 21, 2023 14:00
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor Author

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Anyscale to deploy a service with a 10 replicas and seeing the additional worker node spin up
Screenshot 2023-06-21 at 4 48 38 PM

Go to the terminal and run serve.delete(name="default") in a python console seeing the new DRAINING status on the worker node http proxy
Screenshot 2023-06-21 at 4 49 19 PM

Also curl to ping internal ip and seeing the /-/healthz on the worker node no longer succeeding and continue to succeeding on the head node
Screenshot 2023-06-21 at 4 49 32 PM

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_proxy.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
python/ray/serve/controller.py Outdated Show resolved Hide resolved
python/ray/serve/_private/http_state.py Outdated Show resolved Hide resolved
GeneDer added 3 commits June 22, 2023 17:54
Signed-off-by: Gene Su <e870252314@gmail.com>
…request

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
GeneDer added 3 commits June 23, 2023 10:52
…ining state

Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…request

Signed-off-by: Gene Su <e870252314@gmail.com>
Copy link
Contributor Author

@GeneDer GeneDer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified if the service is deployed with master branch, with only small requests, some requests can get dropped during the downscaling process
Screenshot 2023-06-22 at 10 27 12 PM

Also verified if there are only large requests, the cluster is unable to scale down even when there are no replicas
Screenshot 2023-06-22 at 10 39 12 PM


Go through the same test again with the latest changes.

During downscaling of only small requests. Seeing the dummy object stored in object store
Screenshot 2023-06-23 at 2 37 08 PM
Seeing the new DRAINING status on the http proxies
Screenshot 2023-06-23 at 2 37 19 PM
Waited a bit and seeing nodes are successfully downscaled and no requests are dropped
Screenshot 2023-06-23 at 2 44 33 PM
Screenshot 2023-06-23 at 2 44 39 PM
Screenshot 2023-06-23 at 2 44 49 PM

During the downscaling of only large requests, seeing the object store being used
Screenshot 2023-06-23 at 2 53 00 PM
Waited a bit and seeing the nodes are successfully downscaled and no issue in the requests
Screenshot 2023-06-23 at 3 04 05 PM
Screenshot 2023-06-23 at 3 04 14 PM
Screenshot 2023-06-23 at 3 04 24 PM

…request

Signed-off-by: Gene Su <e870252314@gmail.com>
@@ -317,6 +334,8 @@ async def run_control_loop(self) -> None:
except Exception:
logger.exception("Exception updating application state.")

self._update_active_nodes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe put this just above http_state.update given they're very related and it'll keep them consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up on this!

@edoakes edoakes merged commit 921e879 into ray-project:master Jun 26, 2023
@GeneDer GeneDer deleted the fix-issue-http-proxy-downscaling-on-large-request branch June 26, 2023 17:57
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
2 downscaling issues are fixed by this PR:
- When there are only small requests, a node with ongoing requests can get downscaled. We fixed this by adding a `_ongoing_requests` counter and add a `_ongoing_requests_dummy_obj_ref` data store so autoscale will not kill those nodes.
  - Also note, `/-/healthz` and `/-/routes` will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests.
- Large requests can put request data into the data store and prevent downscaling despite there are no replicas on the node. We added a new http proxy status `DRAINING` and set to true when there are no replicas on the node, if not on head node. When the http proxy is `DRAINING`, both `/-/healthz` and `/-/routes` will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests.

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants