Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] [2/N] Skip GCS health check if possible #49230

Merged
merged 12 commits into from
Dec 20, 2024

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Dec 12, 2024

Followup PR for #49122
Resolves issue: #48837

This PR integrates grpc completion callback for ray syncer with health check manager, which is used to skip a few unnecessary rpcs.

dentiny and others added 4 commits December 6, 2024 05:30
Signed-off-by: hjiang <dentinyhao@gmail.com>
Signed-off-by: hjiang <hjiang@anyscale.com>
Signed-off-by: hjiang <hjiang@anyscale.com>
@dentiny dentiny requested a review from jjyao December 12, 2024 05:18
@dentiny dentiny requested a review from a team as a code owner December 12, 2024 05:18
@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Dec 12, 2024
@dentiny dentiny changed the title [core] Skip GCS health check if possible [core] [2/N] Skip GCS health check if possible Dec 12, 2024
@@ -82,15 +82,39 @@ std::vector<NodeID> GcsHealthCheckManager::GetAllNodes() const {
return nodes;
}

void GcsHealthCheckManager::MarkNodeHealthy(const NodeID &node_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is called from another thread (ray syncer's thread). do we need a mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call comes from io context passed down to syner:

if (on_rpc_completion_) {
on_rpc_completion_(NodeID::FromBinary(remote_node_id_));
}

Aren't health check manager and syncer shared the same io context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I found they're actually different io contexts:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I posted it to io context, which is the same implementation as AddNode.

void GcsHealthCheckManager::AddNode(const NodeID &node_id,
std::shared_ptr<grpc::Channel> channel) {
io_service_.dispatch(
[this, channel = std::move(channel), node_id]() {
thread_checker_.IsOnSameThread();
auto context = new HealthCheckContext(this, channel, node_id);
auto [_, is_new] = health_check_contexts_.emplace(node_id, context);
RAY_CHECK(is_new);
},
"GcsHealthCheckManager::AddNode");
}

@dentiny dentiny requested a review from rynewang December 18, 2024 22:03
Signed-off-by: dentiny <dentinyhao@gmail.com>
@rynewang
Copy link
Contributor

build failure

Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny
Copy link
Contributor Author

dentiny commented Dec 19, 2024

build failure

Nice, it means we do have unchecked assertion. Should be fixed.

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

src/ray/gcs/gcs_server/gcs_health_check_manager.h Outdated Show resolved Hide resolved
@dentiny dentiny requested a review from jjyao December 19, 2024 20:04
Signed-off-by: dentiny <dentinyhao@gmail.com>
@jjyao jjyao merged commit 47ae84e into ray-project:master Dec 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants