-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Minor cpp changes around core worker #48262
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -103,6 +103,7 @@ bool GetRequest::Wait(int64_t timeout_ms) { | |||
auto remaining_timeout_ms = timeout_ms; | |||
auto timeout_timestamp = current_time_ms() + timeout_ms; | |||
while (!is_ready_) { | |||
// TODO (dayshah): see if using cv condition function instead of busy while helps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya i think it could still be relevant, pretty sure using cv.wait_for(lock, timeout, []() { return !is_ready; });
could lead to less cpu usage vs the busy while loop but want to look more into how it would affect performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's do it in another PR, if needed
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
test failures in cpp and python |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -703,7 +690,7 @@ Status NormalTaskSubmitter::CancelTask(TaskSpecification task_spec, | |||
bool recursive) { | |||
RAY_LOG(INFO) << "Cancelling a task: " << task_spec.TaskId() | |||
<< " force_kill: " << force_kill << " recursive: " << recursive; | |||
const SchedulingKey scheduling_key( | |||
SchedulingKey scheduling_key( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removed const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it can be moved into function capture below
if (!keep_executing) { | ||
RAY_UNUSED(task_finisher_->FailOrRetryPendingTask( | ||
task_spec.TaskId(), rpc::ErrorType::TASK_CANCELLED, nullptr)); | ||
RequestNewWorkerIfNeeded(scheduling_key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we removed a call to FailOrRetryPendingTask here. is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya this is the call we removed for when the task is removed and the task dependencies have just now been resolved, since now we're failing task before when cancel is actually called
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description, could you add details on how the hang happens. Something like #47861
@@ -4036,7 +4036,7 @@ void CoreWorker::HandleCancelTask(rpc::CancelTaskRequest request, | |||
RAY_LOG(INFO).WithField(task_id).WithField(current_actor_id) | |||
<< "Cancel an actor task"; | |||
CancelActorTaskOnExecutor( | |||
caller_worker_id, task_id, force_kill, recursive, on_cancel_callback); | |||
caller_worker_id, task_id, force_kill, recursive, std::move(on_cancel_callback)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to move since the parameter is const &?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, my bad left it in even after changing param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait actually CancelActorTaskOnExecucutor takes by value, CancelTaskOnExecutor takes by const ref
src/ray/core_worker/task_manager.cc
Outdated
@@ -1043,22 +1043,27 @@ void TaskManager::FailPendingTask(const TaskID &task_id, | |||
auto it = submissible_tasks_.find(task_id); | |||
RAY_CHECK(it != submissible_tasks_.end()) | |||
<< "Tried to fail task that was not pending " << task_id; | |||
RAY_CHECK(it->second.IsPending()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this? The function name indicating that it's a pending task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue was there was a ray data backpressure test and a train test that failed here, it seems that the task cancel would happen after the task finishes and by the time it acquires that mutex in FailPendingTask it's already at a point where the task status is finished. So change here is basically just to no-op on if the task is finished at this point.
and IsPending checks for status != fail and finish, only want to check for fail here to make sure we're not double failing
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Updated pr description to list the 3 main changes and why they're needed |
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -53,10 +53,10 @@ namespace ray { | |||
namespace rpc { | |||
|
|||
/// The maximum number of requests in flight per client. | |||
const int64_t kMaxBytesInFlight = 16 * 1024 * 1024; | |||
constexpr int64_t kMaxBytesInFlight = 16L * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Randomly find this PR, two things:
- We should have a human-readable unit library, something like
16_MiB
; inline
is needed if we declare constants at header file- C++ spec doesn't guide the behavior on
constexpr
, when included by multiple translation units, whether the symbol appears once or multiple times, so it's a compiler-based UB
- C++ spec doesn't guide the behavior on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it looks ok to you? #48638
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but let's not block this pr. we can always update later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline constexpr
is needed, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya good point inlined, can integrate into library after, there's other instances of stuff like this too can go at once
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -724,12 +721,9 @@ Status NormalTaskSubmitter::CancelTask(TaskSpecification task_spec, | |||
for (auto spec = scheduling_tasks.begin(); spec != scheduling_tasks.end(); spec++) { | |||
if (spec->TaskId() == task_spec.TaskId()) { | |||
scheduling_tasks.erase(spec); | |||
|
|||
if (scheduling_tasks.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't remove this if
check, we should only cancel worker lease if there is no scheduling task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually removed this because CancelWorkerLeaseIfNeeded already checks if its empty and early returns if it's not empty
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: #48262 (comment) Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: ray-project#48262 (comment) Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
As titled, I think having `MB` explicitly called out is more readable than `1024 * 1024` or `1<<20` Intended use case: ray-project#48262 (comment) Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
Minor cpp changes around core worker, was part of #48661, but factored out those changes.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.