-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] [2/N] Skip GCS health check if possible #49230
Conversation
Signed-off-by: hjiang <dentinyhao@gmail.com>
Signed-off-by: hjiang <hjiang@anyscale.com>
Signed-off-by: hjiang <hjiang@anyscale.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@@ -82,15 +82,39 @@ std::vector<NodeID> GcsHealthCheckManager::GetAllNodes() const { | |||
return nodes; | |||
} | |||
|
|||
void GcsHealthCheckManager::MarkNodeHealthy(const NodeID &node_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is called from another thread (ray syncer's thread). do we need a mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call comes from io context passed down to syner:
ray/src/ray/common/ray_syncer/ray_syncer_bidi_reactor_base.h
Lines 190 to 192 in 8ab5b2b
if (on_rpc_completion_) { | |
on_rpc_completion_(NodeID::FromBinary(remote_node_id_)); | |
} |
Aren't health check manager and syncer shared the same io context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I found they're actually different io contexts:
- ray syncer:
ray/src/ray/gcs/gcs_server/gcs_server.cc
Lines 524 to 525 in 8ab5b2b
ray_syncer_ = std::make_unique<syncer::RaySyncer>( io_context_provider_.GetIOContext<syncer::RaySyncer>(), kGCSNodeID.Binary()); - health check manager:
ray/src/ray/gcs/gcs_server/gcs_server.cc
Lines 291 to 292 in 8ab5b2b
gcs_healthcheck_manager_ = std::make_unique<GcsHealthCheckManager>( io_context_provider_.GetDefaultIOContext(), node_death_callback);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I posted it to io context, which is the same implementation as AddNode
.
ray/src/ray/gcs/gcs_server/gcs_health_check_manager.cc
Lines 143 to 153 in 8ab5b2b
void GcsHealthCheckManager::AddNode(const NodeID &node_id, | |
std::shared_ptr<grpc::Channel> channel) { | |
io_service_.dispatch( | |
[this, channel = std::move(channel), node_id]() { | |
thread_checker_.IsOnSameThread(); | |
auto context = new HealthCheckContext(this, channel, node_id); | |
auto [_, is_new] = health_check_contexts_.emplace(node_id, context); | |
RAY_CHECK(is_new); | |
}, | |
"GcsHealthCheckManager::AddNode"); | |
} |
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
build failure |
Signed-off-by: dentiny <dentinyhao@gmail.com>
Nice, it means we do have unchecked assertion. Should be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Signed-off-by: dentiny <dentinyhao@gmail.com>
Followup PR for #49122
Resolves issue: #48837
This PR integrates grpc completion callback for ray syncer with health check manager, which is used to skip a few unnecessary rpcs.