-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Fix issue http proxy downscaling issues #36652
[serve] Fix issue http proxy downscaling issues #36652
Conversation
…stop routing to nodes without replica issue: anyscale/product#21404 Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…request Signed-off-by: Gene Su <e870252314@gmail.com>
…h_running_replicas() Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
…request Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some local test dropping the head node exclusion logic and seeing the INACTIVE
status and /-/healthz
and /-/routes
behave as expected with and without replicas. Will test on Anyscale platform when the wheel is built.
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far!
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on Anyscale to deploy a service with a 10 replicas and seeing the additional worker node spin up
Go to the terminal and run serve.delete(name="default")
in a python console seeing the new DRAINING
status on the worker node http proxy
Also curl to ping internal ip and seeing the /-/healthz
on the worker node no longer succeeding and continue to succeeding on the head node
Signed-off-by: Gene Su <e870252314@gmail.com>
…request Signed-off-by: Gene Su <e870252314@gmail.com>
…ining state Signed-off-by: Gene Su <e870252314@gmail.com>
…request Signed-off-by: Gene Su <e870252314@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified if the service is deployed with master branch, with only small requests, some requests can get dropped during the downscaling process
Also verified if there are only large requests, the cluster is unable to scale down even when there are no replicas
Go through the same test again with the latest changes.
During downscaling of only small requests. Seeing the dummy object stored in object store
Seeing the new DRAINING
status on the http proxies
Waited a bit and seeing nodes are successfully downscaled and no requests are dropped
During the downscaling of only large requests, seeing the object store being used
Waited a bit and seeing the nodes are successfully downscaled and no issue in the requests
…request Signed-off-by: Gene Su <e870252314@gmail.com>
@@ -317,6 +334,8 @@ async def run_control_loop(self) -> None: | |||
except Exception: | |||
logger.exception("Exception updating application state.") | |||
|
|||
self._update_active_nodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this just above http_state.update
given they're very related and it'll keep them consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will follow up on this!
2 downscaling issues are fixed by this PR: - When there are only small requests, a node with ongoing requests can get downscaled. We fixed this by adding a `_ongoing_requests` counter and add a `_ongoing_requests_dummy_obj_ref` data store so autoscale will not kill those nodes. - Also note, `/-/healthz` and `/-/routes` will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests. - Large requests can put request data into the data store and prevent downscaling despite there are no replicas on the node. We added a new http proxy status `DRAINING` and set to true when there are no replicas on the node, if not on head node. When the http proxy is `DRAINING`, both `/-/healthz` and `/-/routes` will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests. Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
2 downscaling issues are fixed by this PR:
_ongoing_requests
counter and add a_ongoing_requests_dummy_obj_ref
data store so autoscale will not kill those nodes./-/healthz
and/-/routes
will not be count towards the ongoing request and will not create the dummy data store object if those are the only requests.DRAINING
and set to true when there are no replicas on the node, if not on head node. When the http proxy isDRAINING
, both/-/healthz
and/-/routes
will return 503 and signal not to route traffics to the node. The ongoing requests will be drained normally or via timeout if long running and no new request to the node should be sent. This helps autoscale to continue downscale the node when the requests are all drained and there are no longer data stores referred by those requests.Related issue number
Closes #21404
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.