Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Data doesn't account for object store memory from pandas batch formats #48506

Closed
richardliaw opened this issue Nov 2, 2024 · 2 comments · Fixed by #46939
Closed

[data] Data doesn't account for object store memory from pandas batch formats #48506

richardliaw opened this issue Nov 2, 2024 · 2 comments · Fixed by #46939
Labels
data Ray Data-related issues

Comments

@richardliaw
Copy link
Contributor

While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the "pandas" batch format, or if you use APIs like drop_columns that use the "pandas" batch format under-the-hood.

Here's a simple repro:

import ray


def generate_data(batch):
    for _ in range(8):
        yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}


ds = (
    ray.data.range(1, override_num_blocks=1)
    .map_batches(generate_data, batch_size=1)
    .map_batches(lambda batch: batch, batch_format=...)
)

for bundle in ds.iter_internal_ref_bundles():
    print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")

Output with pandas:

num_rows=8 size_bytes=192         

Output with PyArrow:

num_rows=1 size_bytes=134217748                                                                                 
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748  

Originally posted by @bveeramani in #44577 (comment)

@richardliaw
Copy link
Contributor Author

@raulchen did some debugging and identified that there is some odd behavior from Pandas:

import pandas as pd
import numpy as np
import pickle

df = pd.DataFrame({
    "data": [np.random.randint(size=1024, low=0, high=100, dtype=np.int8) for _ in range(1_000_000)]
})

print(df["data"].size, df["data"].dtype, df.memory_usage(index=True, deep=True).sum())
# 1000000 object 1144000132
df2 = pickle.loads(pickle.dumps(df))
print(df2["data"].size, df2["data"].dtype, df2.memory_usage(index=True, deep=True).sum())
# 1000000 object 120000132

Posted this on StackOverflow as well. https://stackoverflow.com/questions/79149716/pandas-memory-usage-inconsistent-for-in-line-numpy

@richardliaw
Copy link
Contributor Author

The only known clue right now is that the "OWNDATA" flag for the numpy array is different.

Before pickle:
- dtype: int8
- flags:   C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

- strides: (1,)
- data pointer: 54063872

After pickle:
- dtype: int8
- flags:   C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

- strides: (1,)
- data pointer: 1022687168

@jcotant1 jcotant1 added the data Ray Data-related issues label Nov 2, 2024
MortalHappiness pushed a commit to MortalHappiness/ray that referenced this issue Nov 22, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
jecsand838 pushed a commit to jecsand838/ray that referenced this issue Dec 4, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this issue Dec 7, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants