-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Observability] Added object refs Task is dependent on to TaskInfoEntry
#48234
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing that! Could you add some tests?
@@ -637,6 +637,7 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_ | |||
[this] { | |||
RAY_LOG(INFO) << "Event stats:\n\n" | |||
<< io_service_.stats().StatsString() << "\n\n" | |||
<< task_execution_service_.stats().StatsString() << "\n\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I have another PR that also adds this.
src/ray/protobuf/common.proto
Outdated
// this task is dependent on and does NOT contain | ||
// - Args passed by value (inlined) | ||
// - ObjectRefs of the args passed by value | ||
repeated ObjectReference dependent_args_refs = 27; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can only record the ObjectID instead of the entire ObjectRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely, but why not entire ObjectRef
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only care about ObjectID so trying to reduce memory usage: we need to store lots of events in GCS also each task can potentially have unlimited number of args.
There is test failure. |
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…ds on Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
… tasks API Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
42ff9d3
to
2f0c056
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// NOTE: This list only contains `ObjectReference`s passed in as arguments | ||
// this task is dependent on and does NOT contain | ||
// - Args passed by value (inlined) | ||
// - ObjectRefs of the args passed by value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlined objects
"task_id": "31323334", | ||
"parent_task_id": "", | ||
"args_object_ids": [ | ||
arg_ref.hex(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we store the binary format of the object id instead of hex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decode in the API
0b64417
to
ccd8b8c
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
ccd8b8c
to
29659a6
Compare
…skInfoEntry` (ray-project#48234)" This reverts commit cacb54c.
…skInfoEntry` (#48234)" (#48498) This reverts commit cacb54c. CoreWorker launches a thread to send events to GCS ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)). #48234 adds the following logic in `pb_util.h`: ```cpp // Fill in task args for (size_t i = 0; i < task_spec.NumArgs(); i++) { if (task_spec.ArgByRef(i)) { task_info->add_args_object_ids(task_spec.ArgRef(i).object_id()); } } ``` However, it is possible for another thread to call [clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38) between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In this case, `RAY_CHECK` ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279)) in `ArgRef` will fail. Error message: <img width="1864" alt="image" src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">
…try` (ray-project#48234) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…skInfoEntry` (ray-project#48234)" (ray-project#48498) This reverts commit cacb54c. CoreWorker launches a thread to send events to GCS ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)). ray-project#48234 adds the following logic in `pb_util.h`: ```cpp // Fill in task args for (size_t i = 0; i < task_spec.NumArgs(); i++) { if (task_spec.ArgByRef(i)) { task_info->add_args_object_ids(task_spec.ArgRef(i).object_id()); } } ``` However, it is possible for another thread to call [clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38) between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In this case, `RAY_CHECK` ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279)) in `ArgRef` will fail. Error message: <img width="1864" alt="image" src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d">
…try` (ray-project#48234) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…skInfoEntry` (ray-project#48234)" (ray-project#48498) This reverts commit cacb54c. CoreWorker launches a thread to send events to GCS ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)). ray-project#48234 adds the following logic in `pb_util.h`: ```cpp // Fill in task args for (size_t i = 0; i < task_spec.NumArgs(); i++) { if (task_spec.ArgByRef(i)) { task_info->add_args_object_ids(task_spec.ArgRef(i).object_id()); } } ``` However, it is possible for another thread to call [clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38) between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In this case, `RAY_CHECK` ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279)) in `ArgRef` will fail. Error message: <img width="1864" alt="image" src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d"> Signed-off-by: JP-sDEV <jon.pablo80@gmail.com>
…try` (ray-project#48234) Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
…skInfoEntry` (ray-project#48234)" (ray-project#48498) This reverts commit cacb54c. CoreWorker launches a thread to send events to GCS ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/task_event_buffer.cc?L286:49-286:60)). ray-project#48234 adds the following logic in `pb_util.h`: ```cpp // Fill in task args for (size_t i = 0; i < task_spec.NumArgs(); i++) { if (task_spec.ArgByRef(i)) { task_info->add_args_object_ids(task_spec.ArgRef(i).object_id()); } } ``` However, it is possible for another thread to call [clear_object_ref](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/core_worker/transport/dependency_resolver.cc?L36-38) between `if (task_spec.ArgByRef(i)) {` and `task_spec.ArgRef(i)`. In this case, `RAY_CHECK` ([code](https://sourcegraph.com/github.com/ray-project/ray@d0af8622f6b5d0668d521e20539b7d76426ceb5f/-/blob/src/ray/common/task/task_spec.cc?L279)) in `ArgRef` will fail. Error message: <img width="1864" alt="image" src="https://github.com/user-attachments/assets/5401b6ef-959a-4afe-beea-daf4e1577b0d"> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
Currently when listing current tasks using
ray get task <task-id>
we're not reflecting the args the task might be dependent on and therefore waiting to become available.That substantially complicates troubleshooting when the task is blocked on an argument passed in as an object-ref.
This change adds arguments passed in by reference to the
TaskInfoEntry
to subsequently make it available to our StateHead APIs.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.