Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Fix pandas memory calculation. #46939

Merged
merged 21 commits into from
Nov 21, 2024

Conversation

Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Aug 2, 2024

Why are these changes needed?

close #46785
Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested.

Related issue number

closes #46785, closes #48506

Checks

  • [√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [√] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [√] Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

@bveeramani @c21 can you check this PR. I think the main idea is to calculate it in recursion.. I just did some test and for the example in #46785, each time the time is 20ms for calculation, I think it's acceptable?

@Bye-legumes
Copy link
Contributor Author

Although it may not support tensor or other object.

Bye-legumes and others added 4 commits August 6, 2024 10:32
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Aug 12, 2024
@anyscalesam anyscalesam added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 26, 2024
@Bye-legumes
Copy link
Contributor Author

hi, @bveeramani can you check this?

@richardliaw
Copy link
Contributor

Hi @Bye-legumes , I can take over the review here.

I actually implemented a couple tests, but seems like this is failing here:

___________________________ test_size_bytes_bytes_object ____________________________

ray_start_regular_shared = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.15', ray_version='3.0.0.dev0', ray_commit='9e5c5bfba9d79d1d134719cf202c7150407acbc6')

    def test_size_bytes_bytes_object(ray_start_regular_shared):
        def generate_data(batch):
            for _ in range(8):
                yield {"data": [[b"\x00" * 128 * 1024 * 128]]}
    
        ds = (
            ray.data.range(1, override_num_blocks=1)
            .map_batches(generate_data, batch_size=1)
            .map_batches(lambda batch: batch, batch_format="pandas")
        )
    
        true_value = 128 * 1024 * 128 * 8
        for bundle in ds.iter_internal_ref_bundles():
            size = bundle.size_bytes()
            # assert that true_value is within 10% of bundle.size_bytes()
>           assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)
E           AssertionError: (134217728, 192)
E           assert (134217728 * 0.9) <= 192

@richardliaw
Copy link
Contributor

A couple tests I wrote --


def test_size_bytes_small(ray_start_regular_shared):
    animals = ["Flamingo", "Centipede"]
    block = pd.DataFrame({"animals": animals})
    block["animals"] = block["animals"].astype("string")

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    # generally strings are hard, so let's use what Pandas gives us.
    # get memory usage from pandas
    memory_usage = block.memory_usage(index=True, deep=True).sum()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_large_str(ray_start_regular_shared):
    animals = [
        random.choice(["alligator", "crocodile", "centipede", "flamingo"])
        for i in range(100_000)
    ]
    block = pd.DataFrame({"animals": animals})
    block["animals"] = block["animals"].astype("string")

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    # String disk usage is wildly different from in-process memory usage
    memory_usage = block.memory_usage(index=True, deep=True).sum()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_large_floats(ray_start_regular_shared):
    animals = [random.random() for i in range(100_000)]
    block = pd.DataFrame({"animals": animals})

    block_accessor = PandasBlockAccessor.for_block(block)
    bytes_size = block_accessor.size_bytes()

    memory_usage = pickle.dumps(block).__sizeof__()
    # check that memory usage is within 10% of the size_bytes
    assert memory_usage * 0.9 <= bytes_size <= memory_usage * 1.1, (
        bytes_size,
        memory_usage,
    )


def test_size_bytes_bytes_object(ray_start_regular_shared):
    def generate_data(batch):
        for _ in range(8):
            yield {"data": [[b"\x00" * 128 * 1024 * 128]]}

    ds = (
        ray.data.range(1, override_num_blocks=1)
        .map_batches(generate_data, batch_size=1)
        .map_batches(lambda batch: batch, batch_format="pandas")
    )

    true_value = 128 * 1024 * 128 * 8
    for bundle in ds.iter_internal_ref_bundles():
        size = bundle.size_bytes()
        # assert that true_value is within 10% of bundle.size_bytes()
        assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)


def test_size_bytes_unowned_numpy(ray_start_regular_shared):
    import numpy as np

    df = pd.DataFrame(
        {
            "data": [
                np.random.randint(size=1024, low=0, high=100, dtype=np.int8)
                for _ in range(1_000)
            ],
        }
    )

    block_accessor = PandasBlockAccessor.for_block(df)
    block_size = block_accessor.size_bytes()
    true_value = 1024 * 1000
    assert true_value * 0.9 <= block_size <= true_value * 1.1

Bye-legumes and others added 2 commits November 14, 2024 16:38
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

Hi @Bye-legumes , I can take over the review here.

I actually implemented a couple tests, but seems like this is failing here:

___________________________ test_size_bytes_bytes_object ____________________________

ray_start_regular_shared = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.15', ray_version='3.0.0.dev0', ray_commit='9e5c5bfba9d79d1d134719cf202c7150407acbc6')

    def test_size_bytes_bytes_object(ray_start_regular_shared):
        def generate_data(batch):
            for _ in range(8):
                yield {"data": [[b"\x00" * 128 * 1024 * 128]]}
    
        ds = (
            ray.data.range(1, override_num_blocks=1)
            .map_batches(generate_data, batch_size=1)
            .map_batches(lambda batch: batch, batch_format="pandas")
        )
    
        true_value = 128 * 1024 * 128 * 8
        for bundle in ds.iter_internal_ref_bundles():
            size = bundle.size_bytes()
            # assert that true_value is within 10% of bundle.size_bytes()
>           assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)
E           AssertionError: (134217728, 192)
E           assert (134217728 * 0.9) <= 192

Thanks so much! I think this time it's OK now!

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

I updated the tests for readability and also made it a bit more extensive -- I found some areas that broke, could you take a look?

Hi, I just fixed the issues that you mentioned! Thx!

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Nov 19, 2024
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Will ping others once tests pass.

@Bye-legumes
Copy link
Contributor Author

This looks great to me! Will ping others once tests pass.

Hi! I think it's OK now!

object_need_check = ["object", "python_object()"]
# Handle object columns separately
for column in self._table.columns:
if str(self._table[column].dtype) in object_need_check:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, can we directly compare the dtype without casting it to strings?

Copy link
Contributor Author

@Bye-legumes Bye-legumes Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed now!

python/ray/data/tests/test_pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_pandas_block.py Outdated Show resolved Hide resolved
Bye-legumes and others added 5 commits November 21, 2024 11:01
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@richardliaw
Copy link
Contributor

Some tests failing?

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

Bye-legumes commented Nov 21, 2024

Some tests failing?

Yeah, just fix as the some of the inner blocks are TensorArrayElement, TensorDtype now. I just make some changes and also I can use type check now. In the previous it's just "python_object()" for extension array. And recent 2 days the type changed to nd.array(object)

@Bye-legumes
Copy link
Contributor Author

Some tests failing?

All the test are OK now.

@richardliaw richardliaw merged commit 8a0f810 into ray-project:master Nov 21, 2024
5 checks passed
@richardliaw
Copy link
Contributor

Awesome, thank you so much @Bye-legumes !!!!

@aslonnie
Copy link
Collaborator

This PR is reverted. many data release tests are broken.

MortalHappiness pushed a commit to MortalHappiness/ray that referenced this pull request Nov 22, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Connor Sanders <connor@elastiflow.com>
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
## Why are these changes needed?

close ray-project#46785
Current the memory usage for pandas is not accurate when it's object, so
we just implement to calculated it in recursion in case of nested.
## Related issue number

closes ray-project#46785, closes
ray-project#48506

## Checks

- [√] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [√] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [√] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: hjiang <dentinyhao@gmail.com>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests P0 Issues that should be fixed in short order
Projects
None yet
5 participants