[data] Data doesn't account for object store memory from pandas batch formats #48506

richardliaw · 2024-11-02T00:19:07Z

While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the "pandas" batch format, or if you use APIs like drop_columns that use the "pandas" batch format under-the-hood.

Here's a simple repro:

import ray


def generate_data(batch):
    for _ in range(8):
        yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}


ds = (
    ray.data.range(1, override_num_blocks=1)
    .map_batches(generate_data, batch_size=1)
    .map_batches(lambda batch: batch, batch_format=...)
)

for bundle in ds.iter_internal_ref_bundles():
    print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")

Output with pandas:

num_rows=8 size_bytes=192

Output with PyArrow:

num_rows=1 size_bytes=134217748                                                                                 
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748

Originally posted by @bveeramani in #44577 (comment)

The text was updated successfully, but these errors were encountered:

richardliaw · 2024-11-02T00:22:45Z

@raulchen did some debugging and identified that there is some odd behavior from Pandas:

import pandas as pd
import numpy as np
import pickle

df = pd.DataFrame({
    "data": [np.random.randint(size=1024, low=0, high=100, dtype=np.int8) for _ in range(1_000_000)]
})

print(df["data"].size, df["data"].dtype, df.memory_usage(index=True, deep=True).sum())
# 1000000 object 1144000132
df2 = pickle.loads(pickle.dumps(df))
print(df2["data"].size, df2["data"].dtype, df2.memory_usage(index=True, deep=True).sum())
# 1000000 object 120000132

Posted this on StackOverflow as well. https://stackoverflow.com/questions/79149716/pandas-memory-usage-inconsistent-for-in-line-numpy

richardliaw · 2024-11-02T01:14:58Z

The only known clue right now is that the "OWNDATA" flag for the numpy array is different.

Before pickle:
- dtype: int8
- flags:   C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

- strides: (1,)
- data pointer: 54063872

After pickle:
- dtype: int8
- flags:   C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

- strides: (1,)
- data pointer: 1022687168

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>

jcotant1 added the data Ray Data-related issues label Nov 2, 2024

richardliaw mentioned this issue Nov 15, 2024

[Data] Fix pandas memory calculation. #46939

Merged

5 tasks

richardliaw closed this as completed in #46939 Nov 21, 2024

richardliaw closed this as completed in 8a0f810 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Data doesn't account for object store memory from pandas batch formats #48506

[data] Data doesn't account for object store memory from pandas batch formats #48506

richardliaw commented Nov 2, 2024

richardliaw commented Nov 2, 2024

richardliaw commented Nov 2, 2024

[data] Data doesn't account for object store memory from pandas batch formats #48506

[data] Data doesn't account for object store memory from pandas batch formats #48506

Comments

richardliaw commented Nov 2, 2024

richardliaw commented Nov 2, 2024

richardliaw commented Nov 2, 2024