Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Posts CoreWorkerMemoryStore callbacks onto io_context. #47833

Merged
merged 20 commits into from
Oct 30, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Sep 27, 2024

CoreWorkerMemoryStore has GetAsync method which calls callbacks when the object is ready, immediately or on a Put() call. However despite the name, the call is not Async: it just calls on the GetAsync() or Put() call stack. If a mutex is held by any call frame in the callstack, it remains held in the callback, ripe for deadlocks.

Adds another ctor argument io_context and posts any callbacks to the context to avoid those deadlocks. Note many cpp tests creates CoreWorkerMemoryStore without a io_context so I kept the option of creating an owned io_context with thread.

Closes #47649

CoreWorkerMemoryStore has GetAsync method which calls callbacks when the object is ready, immediately or on a Put() call. However despite the name, the call is not `Async`: it just calls on the GetAsync() or Put() call stack. If a mutex is held by any call frame in the callstack, it remains held in the callback, ripe for deadlocks.

Adds another ctor argument io_context and posts any callbacks to the context to avoid those deadlocks. Note many cpp tests creates CoreWorkerMemoryStore without a io_context so I kept the option of creating an owned io_context with thread.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang added the go add ONLY when ready to merge, run all tests label Sep 27, 2024
…e task id

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@jjyao
Copy link
Collaborator

jjyao commented Oct 7, 2024

@MengjinYan could you take a first pass.

rynewang and others added 2 commits October 7, 2024 16:29
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @rynewang offline. Here the PR changes the task id to request id because the corresponding tests (in dependency_resolver_test.cc) fail after the async change in memory store.

From the current understanding, the tests happen to use Tasks with the same TaskID to submit dependency resolve requests. With the async replication, the task state will be overridden and the tests will fail. At the same time, in reality, we are thinking although task execution can have multiple attempts, because the task object will be the same, so it might work with the original implementation.

Therefore, @jjyao we would like your opinion here to see whether the change here makes sense or we should just update the tests to make it more like real scenarios.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

rynewang commented Oct 9, 2024

@jjyao pls take a look since Mengjin already reviewed

@@ -59,11 +59,15 @@ void InlineDependencies(

void LocalDependencyResolver::CancelDependencyResolution(const TaskID &task_id) {
absl::MutexLock lock(&mu_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changes in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the change Mengjin mentioned. Problem is we used TaskID to track dependency resolution requests. But at least in cpp test, a same taskID may call ResolveDependencies multiple times, a new request overrides and deletes an old dependency resolution request. But the test still assumes the old request finishes with callback called.

This PR fixes it by tracing each request not only by TaskID but also by an auto increment id.

I am not sure if the current behavior (forgetting old reqs) is expected, so I want some input from you.

Comment on lines 102 to 103
// ID of the dependency resolution request. Used to index pending_tasks_ because a
// same TaskID may be requested multiple times and we treat each request independently.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a separate issue? Seems the PR description doesn't cover this change.

@rynewang
Copy link
Contributor Author

todo: core_worker.cc GetAsync post removal

@rynewang
Copy link
Contributor Author

Friendly ping @jjyao

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

rynewang and others added 4 commits October 28, 2024 21:53
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

@jjyao ready to merge

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

@jjyao ready to review

@@ -26,8 +26,9 @@
namespace ray {
namespace internal {
LocalModeObjectStore::LocalModeObjectStore(LocalModeRayRuntime &local_mode_ray_tuntime)
: local_mode_ray_tuntime_(local_mode_ray_tuntime) {
memory_store_ = std::make_unique<CoreWorkerMemoryStore>();
: io_context_("LocalModeObjectStore"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually where this cause thread issue for cpp since now the callback is executed in a different thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an answer since I don't know ray-cpp that much

src/ray/core_worker/test/normal_task_submitter_test.cc Outdated Show resolved Hide resolved
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang merged commit 500b3b6 into ray-project:master Oct 30, 2024
5 checks passed
@rynewang rynewang deleted the async-mem-store branch October 30, 2024 17:48
std::unique_ptr<InstrumentedIOContextWithThread> io_context =
std::make_unique<InstrumentedIOContextWithThread>(
"DefaultCoreWorkerMemoryStoreWithThread");
return std::shared_ptr<DefaultCoreWorkerMemoryStoreWithThread>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not std::make_shared? Your implementation leads to two memory allocations instead of one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the ctor is private and thus not visible to make_shared. you can make a private tag type trick but for sake of simplicity I did not do that.

Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…oject#47833)

CoreWorkerMemoryStore has GetAsync method which calls callbacks when the
object is ready, immediately or on a Put() call. However despite the
name, the call is not `Async`: it just calls on the GetAsync() or Put()
call stack. If a mutex is held by any call frame in the callstack, it
remains held in the callback, ripe for deadlocks.

Adds another ctor argument io_context and posts any callbacks to the
context to avoid those deadlocks. Note many cpp tests creates
CoreWorkerMemoryStore without a io_context so I kept the option of
creating an owned io_context with thread.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <56065503+rynewang@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] ray.cancel deadlock
6 participants