Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Minor cpp changes around core worker #48262

Merged
merged 23 commits into from
Nov 9, 2024

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Oct 24, 2024

Why are these changes needed?

Minor cpp changes around core worker, was part of #48661, but factored out those changes.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah marked this pull request as ready for review October 28, 2024 22:26
@dayshah dayshah requested a review from a team as a code owner October 28, 2024 22:26
@dayshah dayshah changed the title [WIP][core] Fix hang when getting cancelled task with dependencies in progress [core] Fix hang when getting cancelled task with dependencies in progress Oct 28, 2024
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -103,6 +103,7 @@ bool GetRequest::Wait(int64_t timeout_ms) {
auto remaining_timeout_ms = timeout_ms;
auto timeout_timestamp = current_time_ms() + timeout_ms;
while (!is_ready_) {
// TODO (dayshah): see if using cv condition function instead of busy while helps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya i think it could still be relevant, pretty sure using cv.wait_for(lock, timeout, []() { return !is_ready; }); could lead to less cpu usage vs the busy while loop but want to look more into how it would affect performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's do it in another PR, if needed

src/ray/core_worker/task_manager.cc Outdated Show resolved Hide resolved
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@rynewang
Copy link
Contributor

test failures in cpp and python

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Oct 29, 2024
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -703,7 +690,7 @@ Status NormalTaskSubmitter::CancelTask(TaskSpecification task_spec,
bool recursive) {
RAY_LOG(INFO) << "Cancelling a task: " << task_spec.TaskId()
<< " force_kill: " << force_kill << " recursive: " << recursive;
const SchedulingKey scheduling_key(
SchedulingKey scheduling_key(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removed const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it can be moved into function capture below

if (!keep_executing) {
RAY_UNUSED(task_finisher_->FailOrRetryPendingTask(
task_spec.TaskId(), rpc::ErrorType::TASK_CANCELLED, nullptr));
RequestNewWorkerIfNeeded(scheduling_key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we removed a call to FailOrRetryPendingTask here. is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya this is the call we removed for when the task is removed and the task dependencies have just now been resolved, since now we're failing task before when cancel is actually called

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR description, could you add details on how the hang happens. Something like #47861

python/ray/tests/test_cancel.py Outdated Show resolved Hide resolved
@@ -4036,7 +4036,7 @@ void CoreWorker::HandleCancelTask(rpc::CancelTaskRequest request,
RAY_LOG(INFO).WithField(task_id).WithField(current_actor_id)
<< "Cancel an actor task";
CancelActorTaskOnExecutor(
caller_worker_id, task_id, force_kill, recursive, on_cancel_callback);
caller_worker_id, task_id, force_kill, recursive, std::move(on_cancel_callback));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to move since the parameter is const &?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, my bad left it in even after changing param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait actually CancelActorTaskOnExecucutor takes by value, CancelTaskOnExecutor takes by const ref

@@ -1043,22 +1043,27 @@ void TaskManager::FailPendingTask(const TaskID &task_id,
auto it = submissible_tasks_.find(task_id);
RAY_CHECK(it != submissible_tasks_.end())
<< "Tried to fail task that was not pending " << task_id;
RAY_CHECK(it->second.IsPending())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this? The function name indicating that it's a pending task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue was there was a ray data backpressure test and a train test that failed here, it seems that the task cancel would happen after the task finishes and by the time it acquires that mutex in FailPendingTask it's already at a point where the task status is finished. So change here is basically just to no-op on if the task is finished at this point.

and IsPending checks for status != fail and finish, only want to check for fail here to make sure we're not double failing

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Contributor Author

dayshah commented Nov 7, 2024

In the PR description, could you add details on how the hang happens. Something like #47861

Updated pr description to list the 3 main changes and why they're needed

Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -53,10 +53,10 @@ namespace ray {
namespace rpc {

/// The maximum number of requests in flight per client.
const int64_t kMaxBytesInFlight = 16 * 1024 * 1024;
constexpr int64_t kMaxBytesInFlight = 16L * 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomly find this PR, two things:

  • We should have a human-readable unit library, something like 16_MiB;
  • inline is needed if we declare constants at header file
    • C++ spec doesn't guide the behavior on constexpr, when included by multiple translation units, whether the symbol appears once or multiple times, so it's a compiler-based UB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it looks ok to you? #48638

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but let's not block this pr. we can always update later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline constexpr is needed, yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good point inlined, can integrate into library after, there's other instances of stuff like this too can go at once

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah changed the title [core] Fix hang when getting cancelled task with dependencies in progress [core] Minor cpp changes around core worker Nov 8, 2024
@@ -724,12 +721,9 @@ Status NormalTaskSubmitter::CancelTask(TaskSpecification task_spec,
for (auto spec = scheduling_tasks.begin(); spec != scheduling_tasks.end(); spec++) {
if (spec->TaskId() == task_spec.TaskId()) {
scheduling_tasks.erase(spec);

if (scheduling_tasks.empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't remove this if check, we should only cancel worker lease if there is no scheduling task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually removed this because CancelWorkerLeaseIfNeeded already checks if its empty and early returns if it's not empty

@jjyao jjyao merged commit e2dfcc8 into ray-project:master Nov 9, 2024
5 checks passed
@dayshah dayshah deleted the cancel-task-with-deps branch November 9, 2024 18:56
rynewang pushed a commit that referenced this pull request Nov 12, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
#48262 (comment)


Signed-off-by: dentiny <dentinyhao@gmail.com>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
Signed-off-by: dayshah <dhyey2019@gmail.com>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
ray-project#48262 (comment)


Signed-off-by: dentiny <dentinyhao@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
As titled, I think having `MB` explicitly called out is more readable
than `1024 * 1024` or `1<<20`
Intended use case:
ray-project#48262 (comment)

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants