[Data] Calculate stage execution time in `StageStatsSummary` from `BlockMetadata` #37119

scottjlee · 2023-07-05T21:31:14Z

Why are these changes needed?

Currently, the stage execution time used in StageStatsSummary is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's BlockMetadata, so that this output is correct in cases with multiple stages.

Related issue number

Closes #37105

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang · 2023-07-06T15:01:49Z

python/ray/data/_internal/stats.py

            if exec_stats:
+                # Calculate the total execution time of stage as
+                # the maximum wall time from all blocks' stats.


Would this give us the right stage time if there were multiple rounds of blocks?

I was thinking this would be something like max_block_finish_time - min_block_start_time?

Would this give us the right stage time if there were multiple rounds of blocks?
@stephanie-wang would this be for the case where there are multiple substages? or what does "multiple rounds of blocks" mean in this case?

If the former, then I'm thinking we can include additional logic in StageStatsSummary.from_block_metadata() (or a separate function) that can handle summing times across substages.

Also, what attribute(s) can I reference for calculating min/max_block_start_time?

Ah, I thought this number was the max duration of a single block/task. If all tasks in the stage run in parallel, then that will be the same as the stage time, but if we have to run multiple parallel rounds of tasks, then the actual stage time will be longer.

Also, what attribute(s) can I reference for calculating min/max_block_start_time?

Hmm it looks like we don't track this right now. I guess ideally each block would report its start and end time, in addition to max duration.

Yea, we should probably track the start/end time. It's a bit sketchy since clocks might not be quite in sync, but it is useful for doing rough calculations like in this case.

stephanie-wang · 2023-07-06T15:02:43Z

python/ray/data/tests/test_stats.py

@@ -1062,9 +1099,10 @@ def test_summarize_blocks(ray_start_regular_shared, stage_two_block):
    calculated_stats = stats.to_summary()
    summarized_lines = calculated_stats.to_string().split("\n")

+    block_max_time = max(m.exec_stats.wall_time_s for m in block_meta_list)


It would be good to test against the actual output that we expect instead of internal block values (which may not be right either).

Since the test uses BlockMetadata created with pre-determined parameters, and we are not reading/manipulating Datasets here, what would be the appropriate "actual output" to compare to in this case? Do you mean I should hardcode the expected value here?

Yeah, I was thinking something like checking it against how much wall time has passed (so it is more about matching user expectations). For example, we could do something like this:

Record start time

Run a Dataset with N tasks that sleep(T), using N / 2 CPUs

Check reported stage time is >=2*T and <= actual time passed since start

Signed-off-by: Scott Lee <sjl@anyscale.com>

stephanie-wang

Looks great, thanks! Much needed change :)

stephanie-wang · 2023-07-07T14:41:43Z

python/ray/data/tests/test_stats.py

+    # Check that each map_batches operator has the corresponding execution time.
+    map_batches_1_stats = ds._get_stats_summary().parents[0].stages_stats[0]
+    map_batches_2_stats = ds._get_stats_summary().stages_stats[0]
+    assert sleep_1 <= map_batches_1_stats.time_total_s <= sleep_1 + 0.5


Maybe not necessary, but I think we could remove the last check for <= sleep_1 + 0.5

(to make it more robust to unexpected slowdown on CI)

Signed-off-by: Scott Lee <sjl@anyscale.com>

…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>

…ockMetadata` (#37119) (#37263) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com>

…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: NripeshN <nn2012@hw.ac.uk>

…ockMetadata` (ray-project#37119) Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292 Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages. Signed-off-by: Scott Lee <sjl@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

Scott Lee added 3 commits July 5, 2023 14:27

calculate stage execution time from metadata

5260c2d

Signed-off-by: Scott Lee <sjl@anyscale.com>

undo one way change to ctx.optimize_fuse_read_stages

dc2683a

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into dupe-stats

7ba85df

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee marked this pull request as ready for review July 5, 2023 23:43

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani and raulchen as code owners July 5, 2023 23:43

scottjlee assigned stephanie-wang Jul 5, 2023

stephanie-wang reviewed Jul 6, 2023

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2023

Scott Lee added 3 commits July 6, 2023 15:34

progress

cecf3cd

Signed-off-by: Scott Lee <sjl@anyscale.com>

additional tests

c5ab5f5

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into dupe-stats

8e35893

Signed-off-by: Scott Lee <sjl@anyscale.com>

scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jul 7, 2023

scottjlee requested a review from stephanie-wang July 7, 2023 02:46

stephanie-wang approved these changes Jul 7, 2023

View reviewed changes

Scott Lee added 2 commits July 7, 2023 10:30

remove unnecessary check

019c6f2

Signed-off-by: Scott Lee <sjl@anyscale.com>

Merge branch 'master' into dupe-stats

3d69383

Signed-off-by: Scott Lee <sjl@anyscale.com>

c21 approved these changes Jul 10, 2023

View reviewed changes

c21 merged commit a8c5097 into ray-project:master Jul 10, 2023

scottjlee mentioned this pull request Jul 10, 2023

[Data] Cherry-pick #37119 #37263

Merged

8 tasks

akshay-anyscale mentioned this pull request Jul 21, 2023

Add service deployment instructions to stable diffusion template #37645

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Calculate stage execution time in `StageStatsSummary` from `BlockMetadata` #37119

[Data] Calculate stage execution time in `StageStatsSummary` from `BlockMetadata` #37119

scottjlee commented Jul 5, 2023 •

edited

Loading

stephanie-wang Jul 6, 2023

scottjlee Jul 6, 2023

stephanie-wang Jul 6, 2023

ericl Jul 6, 2023

stephanie-wang Jul 6, 2023

scottjlee Jul 6, 2023

stephanie-wang Jul 6, 2023

stephanie-wang left a comment

stephanie-wang Jul 7, 2023

[Data] Calculate stage execution time in StageStatsSummary from BlockMetadata #37119

[Data] Calculate stage execution time in StageStatsSummary from BlockMetadata #37119

Conversation

scottjlee commented Jul 5, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Calculate stage execution time in `StageStatsSummary` from `BlockMetadata` #37119

[Data] Calculate stage execution time in `StageStatsSummary` from `BlockMetadata` #37119

scottjlee commented Jul 5, 2023 •

edited

Loading