Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect separate statistics for failed tasks #11327

Merged
merged 3 commits into from
Mar 8, 2022

Conversation

losipiuk
Copy link
Member

@losipiuk losipiuk commented Mar 4, 2022

Description

Collect separate statistics for failed tasks.
Show aggregated CPU/Scheduled time and cumulative memory for failed tasks in Web UI.

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

query engine/UI

Related issues, pull requests, and links

fixes: #10734
improves on: #10754

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# UI
* Add CPU time, scheduled time, and cumulative memory statistics regarding failed tasks in a query to web UI. ({issue}`10754`)

# General
* Add CPU time, scheduled time, and cumulative memory statistics regarding failed tasks in a query to query-completion event. ({issue}`10734`)

@cla-bot cla-bot bot added the cla-signed label Mar 4, 2022
@losipiuk
Copy link
Member Author

losipiuk commented Mar 4, 2022

This is initial work towards #10734.

An alternative approach is in #11317 but I like this one better. It is a bit more code but it is more straightforward as we are only adding fields we need to. The #11317 added whole QueryStats/BasicQueryStats instance to count stats for a subset of tasks but many fields in the object only make sense if we are in the context of whole query.

@losipiuk
Copy link
Member Author

losipiuk commented Mar 4, 2022

@arhimondr please skim if it looks like a good direction to you.

private final DataSize userMemoryReservation;
private final DataSize totalMemoryReservation;
private final Duration totalCpuTime;
private final Duration failedCpuTime;
Copy link
Contributor

@arhimondr arhimondr Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe "wasted" (here and in other places)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like failed more as it is more precise and does not add emotional meaning to json field. But I do not feel super strongly. @martint / @findepi opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer "failed" too.

btw do you plan anything like speculative execution, where work could become "wasted" without being "failed"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg "cancelled"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that will come up. But even then "wasted" is wrong term really. It has negative connotation, and if we are up to kill some tasks to speed up the execution of whole query it should not be seen as we are doing something wrong. And "wasting" resources feels wrong. So maybe some other term - neither "failed" nor "wasted" - but nothing comes to my mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think "failed" is ok. The failure could be intrinsic (cluster / resource / system issues) or induced (task was forced to abort). In either case, it failed to complete.

@losipiuk losipiuk marked this pull request as ready for review March 7, 2022 15:31
@losipiuk losipiuk force-pushed the lo/retry-stats-2 branch 2 times, most recently from eea094d to 288e251 Compare March 8, 2022 09:24
@losipiuk losipiuk merged commit e0d29a8 into trinodb:master Mar 8, 2022
@github-actions github-actions bot added this to the 373 milestone Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Collect and report task failure related statistics in QueryCompletedEvent
4 participants