Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Added object refs Task is dependent on to TaskInfoEntry #48234

Merged
merged 16 commits into from
Oct 28, 2024

Conversation

alexeykudinkin
Copy link
Contributor

Why are these changes needed?

Currently when listing current tasks using ray get task <task-id> we're not reflecting the args the task might be dependent on and therefore waiting to become available.

That substantially complicates troubleshooting when the task is blocked on an argument passed in as an object-ref.

This change adds arguments passed in by reference to the TaskInfoEntry to subsequently make it available to our StateHead APIs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Oct 23, 2024
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that! Could you add some tests?

@@ -637,6 +637,7 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
[this] {
RAY_LOG(INFO) << "Event stats:\n\n"
<< io_service_.stats().StatsString() << "\n\n"
<< task_execution_service_.stats().StatsString() << "\n\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I have another PR that also adds this.

// this task is dependent on and does NOT contain
// - Args passed by value (inlined)
// - ObjectRefs of the args passed by value
repeated ObjectReference dependent_args_refs = 27;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only record the ObjectID instead of the entire ObjectRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely, but why not entire ObjectRef?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only care about ObjectID so trying to reduce memory usage: we need to store lots of events in GCS also each task can potentially have unlimited number of args.

@jjyao
Copy link
Collaborator

jjyao commented Oct 24, 2024

There is test failure.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ds on

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… tasks API

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// NOTE: This list only contains `ObjectReference`s passed in as arguments
// this task is dependent on and does NOT contain
// - Args passed by value (inlined)
// - ObjectRefs of the args passed by value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlined objects

python/ray/util/state/common.py Outdated Show resolved Hide resolved
"task_id": "31323334",
"parent_task_id": "",
"args_object_ids": [
arg_ref.hex(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we store the binary format of the object id instead of hex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decode in the API

@alexeykudinkin alexeykudinkin force-pushed the ak/tsk-evts-ctx-fix branch 2 times, most recently from 0b64417 to ccd8b8c Compare October 28, 2024 18:20
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@jjyao jjyao enabled auto-merge (squash) October 28, 2024 19:07
@jjyao jjyao merged commit cacb54c into master Oct 28, 2024
6 checks passed
@jjyao jjyao deleted the ak/tsk-evts-ctx-fix branch October 28, 2024 21:47
kevin85421 added a commit to kevin85421/ray that referenced this pull request Nov 1, 2024
can-anyscale pushed a commit that referenced this pull request Nov 1, 2024
…skInfoEntry` (#48234)" (#48498)

This reverts commit cacb54c.

CoreWorker launches a thread to send events to GCS
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)).
#48234 adds the following logic in `pb_util.h`:

```cpp
  // Fill in task args
  for (size_t i = 0; i < task_spec.NumArgs(); i++) {
    if (task_spec.ArgByRef(i)) {
      task_info->add_args_object_ids(task_spec.ArgRef(i).object_id());
    }
  }
```

However, it is possible for another thread to call
[clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38)
between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In
this case, `RAY_CHECK`
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279))
in `ArgRef` will fail.

Error message:

<img width="1864" alt="image"
src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…skInfoEntry` (ray-project#48234)" (ray-project#48498)

This reverts commit cacb54c.

CoreWorker launches a thread to send events to GCS
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)).
ray-project#48234 adds the following logic in `pb_util.h`:

```cpp
  // Fill in task args
  for (size_t i = 0; i < task_spec.NumArgs(); i++) {
    if (task_spec.ArgByRef(i)) {
      task_info->add_args_object_ids(task_spec.ArgRef(i).object_id());
    }
  }
```

However, it is possible for another thread to call
[clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38)
between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In
this case, `RAY_CHECK`
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279))
in `ArgRef` will fail.

Error message:

<img width="1864" alt="image"
src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…skInfoEntry` (ray-project#48234)" (ray-project#48498)

This reverts commit cacb54c.

CoreWorker launches a thread to send events to GCS
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)).
ray-project#48234 adds the following logic in `pb_util.h`:

```cpp
  // Fill in task args
  for (size_t i = 0; i < task_spec.NumArgs(); i++) {
    if (task_spec.ArgByRef(i)) {
      task_info->add_args_object_ids(task_spec.ArgRef(i).object_id());
    }
  }
```

However, it is possible for another thread to call
[clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38)
between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In
this case, `RAY_CHECK`
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279))
in `ArgRef` will fail.

Error message:

<img width="1864" alt="image"
src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">
Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…try` (ray-project#48234)

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…skInfoEntry` (ray-project#48234)" (ray-project#48498)

This reverts commit cacb54c.

CoreWorker launches a thread to send events to GCS
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)).
ray-project#48234 adds the following logic in `pb_util.h`:

```cpp
  // Fill in task args
  for (size_t i = 0; i < task_spec.NumArgs(); i++) {
    if (task_spec.ArgByRef(i)) {
      task_info->add_args_object_ids(task_spec.ArgRef(i).object_id());
    }
  }
```

However, it is possible for another thread to call
[clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38)
between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In
this case, `RAY_CHECK`
([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279))
in `ArgRef` will fail.

Error message:

<img width="1864" alt="image"
src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">

Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants