[Data] Ray Data heap OOMs when using fused pandas #46785

bveeramani · 2024-07-25T02:42:58Z

What happened + What you expected to happen

Ray Data map tasks buffer outputs until the buffer contains at least 128 MiB of data. However, Ray Data doesn't correctly count the size of some pandas DataFrames, so if you use the pandas batch format, Ray Data might buffer too much data and your task will heap OOM.

See the related #44577 for more information.

Versions / Dependencies

2.32

Reproduction script

This doesn't work.

import numpy as np

import ray

ray.init(num_cpus=1)


def yield_blocks(batch):
    while True:
        yield {"data": np.zeros((128, 1024, 1024), dtype=np.uint8)}


def convert_to_dataframe(batch):
    return batch


ds = (
    ray.data.range(1, override_num_blocks=1)
    .map_batches(yield_blocks, batch_size=None)
    .map_batches(convert_to_dataframe, batch_size=None, batch_format="pandas")
)
for _ in ds.iter_batches(batch_size=None):  # This should eventually heap OOM
    pass

This does work. For some reason, pandas' memory estimate is correct in this case but not the previous.

import numpy as np
import pandas as pd

import ray

ray.init(num_cpus=1)


def yield_blocks(block):
    while True:
        yield pd.DataFrame({"data": 128 * [np.zeros((1024, 1024), dtype=np.uint8)]})


ds = ray.data.range(1, override_num_blocks=1).map_batches(yield_blocks, batch_size=None)
for _ in ds.iter_batches(batch_size=None):
    pass

Issue Severity

Medium: It is a significant difficulty but I can work around it.

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>

bveeramani added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order data Ray Data-related issues labels Jul 25, 2024

alexeykudinkin assigned c21 Aug 1, 2024

Bye-legumes mentioned this issue Aug 2, 2024

[Data] Fix pandas memory calculation. #46939

Merged

5 tasks

c21 assigned bveeramani and unassigned c21 Aug 15, 2024

bveeramani added P2 Important issue, but not time-critical and removed P0 Issues that should be fixed in short order labels Aug 26, 2024

richardliaw closed this as completed in #46939 Nov 21, 2024

richardliaw closed this as completed in 8a0f810 Nov 21, 2024

This was referenced Nov 27, 2024

[Data] Reimplement of fix memory pandas #48968

Closed

[Data] Reimplement of fix memory pandas #48970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Ray Data heap OOMs when using fused pandas #46785

[Data] Ray Data heap OOMs when using fused pandas #46785

bveeramani commented Jul 25, 2024 •

edited

Loading

[Data] Ray Data heap OOMs when using fused pandas #46785

[Data] Ray Data heap OOMs when using fused pandas #46785

Comments

bveeramani commented Jul 25, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

bveeramani commented Jul 25, 2024 •

edited

Loading