-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Fix pandas memory calculation. #46939
[Data] Fix pandas memory calculation. #46939
Conversation
@bveeramani @c21 can you check this PR. I think the main idea is to calculate it in recursion.. I just did some test and for the example in #46785, each time the time is 20ms for calculation, I think it's acceptable? |
Although it may not support tensor or other object. |
hi, @bveeramani can you check this? |
Hi @Bye-legumes , I can take over the review here. I actually implemented a couple tests, but seems like this is failing here: ___________________________ test_size_bytes_bytes_object ____________________________
ray_start_regular_shared = RayContext(dashboard_url='127.0.0.1:8265', python_version='3.10.15', ray_version='3.0.0.dev0', ray_commit='9e5c5bfba9d79d1d134719cf202c7150407acbc6')
def test_size_bytes_bytes_object(ray_start_regular_shared):
def generate_data(batch):
for _ in range(8):
yield {"data": [[b"\x00" * 128 * 1024 * 128]]}
ds = (
ray.data.range(1, override_num_blocks=1)
.map_batches(generate_data, batch_size=1)
.map_batches(lambda batch: batch, batch_format="pandas")
)
true_value = 128 * 1024 * 128 * 8
for bundle in ds.iter_internal_ref_bundles():
size = bundle.size_bytes()
# assert that true_value is within 10% of bundle.size_bytes()
> assert true_value * 0.9 <= size <= true_value * 1.1, (true_value, size)
E AssertionError: (134217728, 192)
E assert (134217728 * 0.9) <= 192 |
A couple tests I wrote --
|
Thanks so much! I think this time it's OK now! |
Hi, I just fixed the issues that you mentioned! Thx! |
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Will ping others once tests pass.
Hi! I think it's OK now! |
object_need_check = ["object", "python_object()"] | ||
# Handle object columns separately | ||
for column in self._table.columns: | ||
if str(self._table[column].dtype) in object_need_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, can we directly compare the dtype without casting it to strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed now!
Some tests failing? |
Yeah, just fix as the some of the inner blocks are TensorArrayElement, TensorDtype now. I just make some changes and also I can use type check now. In the previous it's just "python_object()" for extension array. And recent 2 days the type changed to nd.array(object) |
All the test are OK now. |
Awesome, thank you so much @Bye-legumes !!!! |
This PR is reverted. many data release tests are broken. |
## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>
Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862 Signed-off-by: Connor Sanders <connor@elastiflow.com>
## Why are these changes needed? close ray-project#46785 Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested. ## Related issue number closes ray-project#46785, closes ray-project#48506 ## Checks - [√] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [√] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [√] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>
Reverts ray-project#46939 for ray-project#48865 ray-project#48864 ray-project#48863 ray-project#48862 Signed-off-by: hjiang <dentinyhao@gmail.com>
Why are these changes needed?
close #46785
Current the memory usage for pandas is not accurate when it's object, so we just implement to calculated it in recursion in case of nested.
Related issue number
closes #46785, closes #48506
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.