-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Calculate stage execution time in StageStatsSummary
from BlockMetadata
#37119
Conversation
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
python/ray/data/_internal/stats.py
Outdated
if exec_stats: | ||
# Calculate the total execution time of stage as | ||
# the maximum wall time from all blocks' stats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this give us the right stage time if there were multiple rounds of blocks?
I was thinking this would be something like max_block_finish_time - min_block_start_time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this give us the right stage time if there were multiple rounds of blocks?
@stephanie-wang would this be for the case where there are multiple substages? or what does "multiple rounds of blocks" mean in this case?
If the former, then I'm thinking we can include additional logic in StageStatsSummary.from_block_metadata()
(or a separate function) that can handle summing times across substages.
Also, what attribute(s) can I reference for calculating min/max_block_start_time
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I thought this number was the max duration of a single block/task. If all tasks in the stage run in parallel, then that will be the same as the stage time, but if we have to run multiple parallel rounds of tasks, then the actual stage time will be longer.
Also, what attribute(s) can I reference for calculating min/max_block_start_time?
Hmm it looks like we don't track this right now. I guess ideally each block would report its start and end time, in addition to max duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we should probably track the start/end time. It's a bit sketchy since clocks might not be quite in sync, but it is useful for doing rough calculations like in this case.
python/ray/data/tests/test_stats.py
Outdated
@@ -1062,9 +1099,10 @@ def test_summarize_blocks(ray_start_regular_shared, stage_two_block): | |||
calculated_stats = stats.to_summary() | |||
summarized_lines = calculated_stats.to_string().split("\n") | |||
|
|||
block_max_time = max(m.exec_stats.wall_time_s for m in block_meta_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to test against the actual output that we expect instead of internal block values (which may not be right either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test uses BlockMetadata
created with pre-determined parameters, and we are not reading/manipulating Datasets here, what would be the appropriate "actual output" to compare to in this case? Do you mean I should hardcode the expected value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking something like checking it against how much wall time has passed (so it is more about matching user expectations). For example, we could do something like this:
- Record start time
- Run a Dataset with N tasks that sleep(T), using N / 2 CPUs
- Check reported stage time is >=2*T and <= actual time passed since start
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Much needed change :)
python/ray/data/tests/test_stats.py
Outdated
# Check that each map_batches operator has the corresponding execution time. | ||
map_batches_1_stats = ds._get_stats_summary().parents[0].stages_stats[0] | ||
map_batches_2_stats = ds._get_stats_summary().stages_stats[0] | ||
assert sleep_1 <= map_batches_1_stats.time_total_s <= sleep_1 + 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not necessary, but I think we could remove the last check for <= sleep_1 + 0.5
(to make it more robust to unexpected slowdown on CI)
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
…ockMetadata` (#37119) (#37263) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>
…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>
…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Why are these changes needed?
Currently, the stage execution time used in
StageStatsSummary
is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292Instead, we should calculate the execution time as the maximum wall time from the stage's
BlockMetadata
, so that this output is correct in cases with multiple stages.Related issue number
Closes #37105
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.