Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Handle multiple changed objects per LongPollHost.listen_for_change RPC #48803

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Nov 19, 2024

Why are these changes needed?

Currently, in the LongPollHost/LongPollClient, if multiple objects are updated that a listen_for_change request is waiting for before the async task in the host can run again, only one of those updated objects will be returned. This is inefficient because the LongPollClient will immediately do a listen_for_change RPC again, and that will see outdated snapshot IDs for the updates that weren't returned and get all of the missed updates.

This is because of an asymmetry between

# If there are any keys with outdated snapshot ids,
# return their updated values immediately.
updated_objects = {}
for key, client_snapshot_id in keys_to_snapshot_ids.items():
try:
existing_id = self.snapshot_ids[key]
except KeyError:
# The caller may ask for keys that we don't know about (yet),
# just ignore them.
# This can happen when, for example,
# a deployment handle is manually created for an app
# that hasn't been deployed yet (by bypassing the safety checks).
continue
if existing_id != client_snapshot_id:
updated_objects[key] = UpdatedObject(
self.object_snapshots[key], existing_id
)
if len(updated_objects) > 0:
self._count_send(updated_objects)
return updated_objects
, which looks for all outdated keys, and
updated_object_key: str = async_task_to_watched_keys[done.pop()]
, which only looks at a single complete Event, even if multiple events completed during the wait.

To prove that the wait can indeed see multiple completed Events, see this example:

from asyncio import wait, Event, run, create_task, FIRST_COMPLETED


async def main():
    a = Event()
    b = Event()

    wait_for_a = create_task(a.wait())
    wait_for_b = create_task(b.wait())

    a.set()
    b.set()

    done, pending = await wait([wait_for_a, wait_for_b], return_when=FIRST_COMPLETED)

    print(f"{len(done)=}")
    print(f"{len(pending)=}")

run(main())

# len(done)=2
# len(pending)=0

Generally this won't be a big issue because most listen_for_change requests in the current Serve setup are asking for a very small number of keys and are likely to only get one key update anyway. But, as I've been discussing with @edoakes and @zcin on Slack, I'd like to group up the DeploymentHandle listen_for_change RPCs under a single LongPollClient, which will be requesting many keys and is therefore more likely to hit this situation.

To complement this change, I also changed LongPollHost.notify_changed so that it takes multiple updates at the same time.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel force-pushed the handle-multiple-updated-keys-in-long-poll-host branch from 2dcc0fb to 89cf841 Compare November 19, 2024 17:05
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel marked this pull request as ready for review November 19, 2024 19:58
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Nov 19, 2024
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edoakes edoakes enabled auto-merge (squash) November 19, 2024 22:14
@edoakes edoakes merged commit c6d26fd into ray-project:master Nov 19, 2024
7 checks passed
@JoshKarpel JoshKarpel deleted the handle-multiple-updated-keys-in-long-poll-host branch November 20, 2024 02:36
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
…change` RPC (ray-project#48803)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Currently, in the `LongPollHost`/`LongPollClient`, if multiple objects
are updated that a `listen_for_change` request is waiting for *before
the async task in the host can run again*, only one of those updated
objects will be returned. This is inefficient because the
`LongPollClient` will immediately do a `listen_for_change` RPC again,
and that will see outdated snapshot IDs for the updates that weren't
returned and get all of the missed updates.

This is because of an asymmetry between
https://github.com/ray-project/ray/blob/b75cb793e437aa617d61dcb13e5f5d2fcc83ee68/python/ray/serve/_private/long_poll.py#L252-L272
, which looks for *all* outdated keys, and
https://github.com/ray-project/ray/blob/b75cb793e437aa617d61dcb13e5f5d2fcc83ee68/python/ray/serve/_private/long_poll.py#L309
, which only looks at a single complete `Event`, even if multiple events
completed during the
[`wait`](https://github.com/ray-project/ray/blob/b75cb793e437aa617d61dcb13e5f5d2fcc83ee68/python/ray/serve/_private/long_poll.py#L289-L293).

To prove that the `wait` can indeed see multiple completed `Event`s, see
this example:
```python
from asyncio import wait, Event, run, create_task, FIRST_COMPLETED


async def main():
    a = Event()
    b = Event()

    wait_for_a = create_task(a.wait())
    wait_for_b = create_task(b.wait())

    a.set()
    b.set()

    done, pending = await wait([wait_for_a, wait_for_b], return_when=FIRST_COMPLETED)

    print(f"{len(done)=}")
    print(f"{len(pending)=}")

run(main())

# len(done)=2
# len(pending)=0
```

Generally this won't be a big issue because most `listen_for_change`
requests in the current Serve setup are asking for a very small number
of keys and are likely to only get one key update anyway. But, as I've
been discussing with @edoakes and @zcin on Slack, I'd like to group up
the `DeploymentHandle` `listen_for_change` RPCs under a single
`LongPollClient`, which will be requesting many keys and is therefore
more likely to hit this situation.

To complement this change, I also changed `LongPollHost.notify_changed`
so that it takes multiple updates at the same time.

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants