Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Calculate stage execution time in StageStatsSummary from BlockMetadata #37119

Merged
merged 8 commits into from
Jul 10, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Jul 5, 2023

Why are these changes needed?

Currently, the stage execution time used in StageStatsSummary is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's BlockMetadata, so that this output is correct in cases with multiple stages.

Related issue number

Closes #37105

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Scott Lee added 3 commits July 5, 2023 14:27
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee marked this pull request as ready for review July 5, 2023 23:43
if exec_stats:
# Calculate the total execution time of stage as
# the maximum wall time from all blocks' stats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this give us the right stage time if there were multiple rounds of blocks?

I was thinking this would be something like max_block_finish_time - min_block_start_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this give us the right stage time if there were multiple rounds of blocks?
@stephanie-wang would this be for the case where there are multiple substages? or what does "multiple rounds of blocks" mean in this case?

If the former, then I'm thinking we can include additional logic in StageStatsSummary.from_block_metadata() (or a separate function) that can handle summing times across substages.

Also, what attribute(s) can I reference for calculating min/max_block_start_time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought this number was the max duration of a single block/task. If all tasks in the stage run in parallel, then that will be the same as the stage time, but if we have to run multiple parallel rounds of tasks, then the actual stage time will be longer.

Also, what attribute(s) can I reference for calculating min/max_block_start_time?

Hmm it looks like we don't track this right now. I guess ideally each block would report its start and end time, in addition to max duration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we should probably track the start/end time. It's a bit sketchy since clocks might not be quite in sync, but it is useful for doing rough calculations like in this case.

@@ -1062,9 +1099,10 @@ def test_summarize_blocks(ray_start_regular_shared, stage_two_block):
calculated_stats = stats.to_summary()
summarized_lines = calculated_stats.to_string().split("\n")

block_max_time = max(m.exec_stats.wall_time_s for m in block_meta_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to test against the actual output that we expect instead of internal block values (which may not be right either).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test uses BlockMetadata created with pre-determined parameters, and we are not reading/manipulating Datasets here, what would be the appropriate "actual output" to compare to in this case? Do you mean I should hardcode the expected value here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking something like checking it against how much wall time has passed (so it is more about matching user expectations). For example, we could do something like this:

  • Record start time
  • Run a Dataset with N tasks that sleep(T), using N / 2 CPUs
  • Check reported stage time is >=2*T and <= actual time passed since start

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 6, 2023
Scott Lee added 3 commits July 6, 2023 15:34
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jul 7, 2023
@scottjlee scottjlee requested a review from stephanie-wang July 7, 2023 02:46
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Much needed change :)

# Check that each map_batches operator has the corresponding execution time.
map_batches_1_stats = ds._get_stats_summary().parents[0].stages_stats[0]
map_batches_2_stats = ds._get_stats_summary().stages_stats[0]
assert sleep_1 <= map_batches_1_stats.time_total_s <= sleep_1 + 0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not necessary, but I think we could remove the last check for <= sleep_1 + 0.5

(to make it more robust to unexpected slowdown on CI)

Scott Lee added 2 commits July 7, 2023 10:30
Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Scott Lee <sjl@anyscale.com>
@c21 c21 merged commit a8c5097 into ray-project:master Jul 10, 2023
scottjlee added a commit to scottjlee/ray that referenced this pull request Jul 10, 2023
…ockMetadata` (ray-project#37119)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
@scottjlee scottjlee mentioned this pull request Jul 10, 2023
8 tasks
Bhav00 pushed a commit to Bhav00/ray that referenced this pull request Jul 11, 2023
…ockMetadata` (ray-project#37119)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: Bhavpreet Singh <singh.bhavpreet00@gmail.com>
bveeramani pushed a commit that referenced this pull request Jul 12, 2023
…ockMetadata` (#37119) (#37263)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
…ockMetadata` (ray-project#37119)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: NripeshN <nn2012@hw.ac.uk>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ockMetadata` (ray-project#37119)

Currently, the stage execution time used in `StageStatsSummary` is the Dataset's total execution time: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L292

Instead, we should calculate the execution time as the maximum wall time from the stage's `BlockMetadata`, so that this output is correct in cases with multiple stages.

Signed-off-by: Scott Lee <sjl@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Incorrect StageSummaryStats execution time calculated
4 participants