Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Fix incorrect pending task size if outputs are empty #47604

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

bveeramani
Copy link
Member

Why are these changes needed?

If an operator outputs empty blocks, then Ray Data thinks that the operator has 256 MiB of pending task outputs, even though it should be 0. For example:

import pyarrow as pa
output = pa.Table.from_pydict({"data": [None] * 128})
assert output.nbytes == 0, output.nbytes

The reason for the bug is because we check if average_bytes_per_output is truthy rather than if it's not None.

bytes_per_output = self.average_bytes_per_output
if bytes_per_output is None:
bytes_per_output = context.target_max_block_size

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) September 11, 2024 18:44
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 11, 2024
@bveeramani bveeramani merged commit 029ff4d into master Sep 11, 2024
6 checks passed
@bveeramani bveeramani deleted the fix-metrics-bug branch September 11, 2024 19:57
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ect#47604)

If an operator outputs empty blocks, then Ray Data thinks that the
operator has 256 MiB of pending task outputs, even though it should be
0. For example:
```python
import pyarrow as pa
output = pa.Table.from_pydict({"data": [None] * 128})
assert output.nbytes == 0, output.nbytes
```

The reason for the bug is because we check if `average_bytes_per_output`
is truthy rather than if it's not `None`.

https://github.com/ray-project/ray/blob/1f83fb44580e392ba6d39a9e79bbdd8cd5b7d916/python/ray/data/_internal/execution/interfaces/op_runtime_metrics.py#L369-L371
---

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants