[Data] Fix pandas memory calculation. #46939

Bye-legumes · 2024-08-02T20:22:53Z

Why are these changes needed?

close #46785
Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested.

Related issue number

closes #46785, closes #48506

Checks

[√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[√] I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [√] Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-08-02T20:25:22Z

@bveeramani @c21 can you check this PR. I think the main idea is to calculate it in recursion.. I just did some test and for the example in #46785, each time the time is 20ms for calculation, I think it's acceptable?

Bye-legumes · 2024-08-02T20:29:35Z

Although it may not support tensor or other object.

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-07T21:47:29Z

hi, @bveeramani can you check this?

richardliaw · 2024-11-13T19:37:21Z

Hi @Bye-legumes , I can take over the review here.

I actually implemented a couple tests, but seems like this is failing here:

___________________________ test_size_bytes_bytes_object ____________________________

ray_start_regular_shared = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.15', ray_version='3.0.0.dev0', ray_commit='9e5c5bfba9d79d1d134719cf202c7150407acbc6')

    def test_size_bytes_bytes_object(ray_start_regular_shared):
        def generate_data(batch):
            for _ in range(8):
                yield {"data": [[b"\x00" * 128 * 1024 * 128]]}
    
        ds = (
            ray.data.range(1, override_num_blocks=1)
            .map_batches(generate_data, batch_size=1)
            .map_batches(lambda batch: batch, batch_format="pandas")
        )
    
        true_value = 128 * 1024 * 128 * 8
        for bundle in ds.iter_internal_ref_bundles():
            size = bundle.size_bytes()
            # assert that true_value is within 10% of bundle.size_bytes()
>           assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)
E           AssertionError: (134217728, 192)
E           assert (134217728 * 0.9) <= 192

richardliaw · 2024-11-13T19:38:31Z

A couple tests I wrote --


def test_size_bytes_small(ray_start_regular_shared):
    animals = ["Flamingo", "Centipede"]
    block = pd.DataFrame({"animals": animals})
    block["animals"] = block["animals"].astype("string")

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    # generally strings are hard, so let's use what Pandas gives us.
    # get memory usage from pandas
    memory_usage = block.memory_usage(index=True, deep=True).sum()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_large_str(ray_start_regular_shared):
    animals = [
        random.choice(["alligator", "crocodile", "centipede", "flamingo"])
        for i in range(100_000)
    ]
    block = pd.DataFrame({"animals": animals})
    block["animals"] = block["animals"].astype("string")

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    # String disk usage is wildly different from in-process memory usage
    memory_usage = block.memory_usage(index=True, deep=True).sum()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_large_floats(ray_start_regular_shared):
    animals = [random.random() for i in range(100_000)]
    block = pd.DataFrame({"animals": animals})

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    memory_usage = pickle.dumps(block).__sizeof__()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_bytes_object(ray_start_regular_shared):
    def generate_data(batch):
        for _ in range(8):
            yield {"data": [[b"\x00" * 128 * 1024 * 128]]}

    ds = (
        ray.data.range(1, override_num_blocks=1)
        .map_batches(generate_data, batch_size=1)
        .map_batches(lambda batch: batch, batch_format="pandas")
    )

    true_value = 128 * 1024 * 128 * 8
    for bundle in ds.iter_internal_ref_bundles():
        size = bundle.size_bytes()
        # assert that true_value is within 10% of bundle.size_bytes()
        assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)


def test_size_bytes_unowned_numpy(ray_start_regular_shared):
    import numpy as np

    df = pd.DataFrame(
        {
            "data": [
                np.random.randint(size=1024, low=0, high=100, dtype=np.int8)
                for _ in range(1_000)
            ],
        }
    )

    block_accessor = PandasBlockAccessor.for_block(df)
    block_size = block_accessor.size_bytes()
    true_value = 1024 * 1000
    assert true_value * 0.9 <= block_size <= true_value * 1.1

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-14T21:48:43Z

Hi @Bye-legumes , I can take over the review here.

I actually implemented a couple tests, but seems like this is failing here:

___________________________ test_size_bytes_bytes_object ____________________________

ray_start_regular_shared = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.15', ray_version='3.0.0.dev0', ray_commit='9e5c5bfba9d79d1d134719cf202c7150407acbc6')

    def test_size_bytes_bytes_object(ray_start_regular_shared):
        def generate_data(batch):
            for _ in range(8):
                yield {"data": [[b"\x00" * 128 * 1024 * 128]]}
    
        ds = (
            ray.data.range(1, override_num_blocks=1)
            .map_batches(generate_data, batch_size=1)
            .map_batches(lambda batch: batch, batch_format="pandas")
        )
    
        true_value = 128 * 1024 * 128 * 8
        for bundle in ds.iter_internal_ref_bundles():
            size = bundle.size_bytes()
            # assert that true_value is within 10% of bundle.size_bytes()
>           assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)
E           AssertionError: (134217728, 192)
E           assert (134217728 * 0.9) <= 192

Thanks so much! I think this time it's OK now!

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-18T16:29:43Z

I updated the tests for readability and also made it a bit more extensive -- I found some areas that broke, could you take a look?

Hi, I just fixed the issues that you mentioned! Thx!

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw

This looks great to me! Will ping others once tests pass.

Bye-legumes · 2024-11-20T16:26:24Z

This looks great to me! Will ping others once tests pass.

Hi! I think it's OK now!

raulchen · 2024-11-21T10:35:03Z

python/ray/data/_internal/pandas_block.py

+        object_need_check = ["object", "python_object()"]
+        # Handle object columns separately
+        for column in self._table.columns:
+            if str(self._table[column].dtype) in object_need_check:


nit, can we directly compare the dtype without casting it to strings?

Just fixed now!

python/ray/data/tests/test_pandas_block.py

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

richardliaw · 2024-11-21T19:09:36Z

Some tests failing?

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes · 2024-11-21T19:31:43Z

Some tests failing?

Yeah, just fix as the some of the inner blocks are TensorArrayElement, TensorDtype now. I just make some changes and also I can use type check now. In the previous it's just "python_object()" for extension array. And recent 2 days the type changed to nd.array(object)

Bye-legumes · 2024-11-21T20:27:10Z

Some tests failing?

All the test are OK now.

richardliaw · 2024-11-21T21:42:04Z

Awesome, thank you so much @Bye-legumes !!!!

This reverts commit 8a0f810.

Reverts #46939 for #48865 #48864 #48863 #48862

aslonnie · 2024-11-22T10:36:21Z

This PR is reverted. many data release tests are broken.

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862 Signed-off-by: Connor Sanders <connor@elastiflow.com>

## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>

Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862 Signed-off-by: hjiang <dentinyhao@gmail.com>

fix

9ce4500

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Bye-legumes requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners August 2, 2024 20:22

Bye-legumes and others added 4 commits August 6, 2024 10:32

fix

898feec

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into fix_memory_pandas

94f518b

fix

99e2c64

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

e40e843

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Aug 12, 2024

anyscalesam added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 26, 2024

Merge branch 'master' into fix_memory_pandas

6383510

Bye-legumes requested review from alexeykudinkin and srinathk10 as code owners November 7, 2024 21:47

Bye-legumes and others added 2 commits November 14, 2024 16:38

Merge branch 'master' into fix_memory_pandas

d760e6d

fix

16906a7

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

77ed9c8

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into fix_memory_pandas

6670d3b

update-tests

0fe309c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw added the go add ONLY when ready to merge, run all tests label Nov 19, 2024

richardliaw approved these changes Nov 19, 2024

View reviewed changes

raulchen approved these changes Nov 21, 2024

View reviewed changes

Bye-legumes and others added 5 commits November 21, 2024 11:01

fix

0c3ea7e

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

318bfd5

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

1038e6a

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

Merge branch 'master' into fix_memory_pandas

0603948

fix

52bc996

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

fix

2d54084

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>

richardliaw merged commit 8a0f810 into ray-project:master Nov 21, 2024
5 checks passed

aslonnie added a commit that referenced this pull request Nov 22, 2024

Revert "[Data] Fix pandas memory calculation. (#46939)"

24ffa34

This reverts commit 8a0f810.

aslonnie mentioned this pull request Nov 22, 2024

Revert "[Data] Fix pandas memory calculation." #48866

Merged

aslonnie added a commit that referenced this pull request Nov 22, 2024

Revert "[Data] Fix pandas memory calculation." (#48866)

64454cc

Reverts #46939 for #48865 #48864 #48863 #48862

MortalHappiness pushed a commit to MortalHappiness/ray that referenced this pull request Nov 22, 2024

Revert "[Data] Fix pandas memory calculation." (ray-project#48866)

244ff9a

Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862

This was referenced Nov 27, 2024

[Data] Reimplement of fix memory pandas #48968

Closed

[Data] Reimplement of fix memory pandas #48970

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fix pandas memory calculation. #46939

[Data] Fix pandas memory calculation. #46939

Bye-legumes commented Aug 2, 2024 •

edited by richardliaw

Loading

Bye-legumes commented Aug 2, 2024

Bye-legumes commented Aug 2, 2024

Bye-legumes commented Nov 7, 2024

richardliaw commented Nov 13, 2024

richardliaw commented Nov 13, 2024

Bye-legumes commented Nov 14, 2024

Bye-legumes commented Nov 18, 2024

richardliaw left a comment •

edited

Loading

Bye-legumes commented Nov 20, 2024

raulchen Nov 21, 2024

Bye-legumes Nov 21, 2024 •

edited

Loading

richardliaw commented Nov 21, 2024

Bye-legumes commented Nov 21, 2024 •

edited

Loading

Bye-legumes commented Nov 21, 2024

richardliaw commented Nov 21, 2024

aslonnie commented Nov 22, 2024

[Data] Fix pandas memory calculation. #46939

[Data] Fix pandas memory calculation. #46939

Conversation

Bye-legumes commented Aug 2, 2024 • edited by richardliaw Loading

Why are these changes needed?

Related issue number

Checks

Bye-legumes commented Aug 2, 2024

Bye-legumes commented Aug 2, 2024

Bye-legumes commented Nov 7, 2024

richardliaw commented Nov 13, 2024

richardliaw commented Nov 13, 2024

Bye-legumes commented Nov 14, 2024

Bye-legumes commented Nov 18, 2024

richardliaw left a comment • edited Loading

Choose a reason for hiding this comment

Bye-legumes commented Nov 20, 2024

raulchen Nov 21, 2024

Choose a reason for hiding this comment

Bye-legumes Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

richardliaw commented Nov 21, 2024

Bye-legumes commented Nov 21, 2024 • edited Loading

Bye-legumes commented Nov 21, 2024

richardliaw commented Nov 21, 2024

aslonnie commented Nov 22, 2024

Bye-legumes commented Aug 2, 2024 •

edited by richardliaw

Loading

richardliaw left a comment •

edited

Loading

Bye-legumes Nov 21, 2024 •

edited

Loading

Bye-legumes commented Nov 21, 2024 •

edited

Loading