Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Reimplement of fix memory pandas #48970

Merged
merged 12 commits into from
Dec 3, 2024

Conversation

Bye-legumes
Copy link
Contributor

@Bye-legumes Bye-legumes commented Nov 27, 2024

Why are these changes needed?

fix of #46939 with sampling to reduce the overhead.
Key improvement:

  1. Use sampling to reduce the computation overhead
  2. Use pandas sample then use df.values to convert it to numpy.
  3. Use numpy vectorize to speed up the loop speed.
  4. Add check for tensor with number type as the nbytes can be use directly.

Related issue number

#46785

Checks

  • [√] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [√] I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • [√] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [√] Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Nov 27, 2024
sampled_indices = np.random.choice(
total_size, sample_size, replace=False
)
sampled_data = sampled_column[sampled_indices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if the tests work -- i remember last time i had to implement def take in PythonObjectArray, but maybe you get around it more easily

@richardliaw
Copy link
Contributor

relevant tests failing it looks like

Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

relevant tests failing it looks like

It can pass this time but two tests are retried. In fact these two test has no problem locally so I am not sure it's related to the memory problem. i.e. when run full test there will OOM but retry can succeed. Let me try to make it stable first. But currently I fixed many test by checking numeric tensor or numpy. I think in these case it's safe to use nbytes will object string need future check,

Bye-legumes and others added 6 commits November 28, 2024 15:20
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@jcotant1 jcotant1 added the data Ray Data-related issues label Dec 2, 2024

# TensorDtype for ray.air.util.tensor_extensions.pandas.TensorDtype
object_need_check = (TensorDtype,)
min_sample_size = _PANDAS_SIZE_BYTES_MIN_COUNT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it should be max_sample_size? as in "max number of items to sample"

python/ray/data/_internal/pandas_block.py Outdated Show resolved Hide resolved
Bye-legumes and others added 2 commits December 2, 2024 21:38
Co-authored-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: zhilong <121425509+Bye-legumes@users.noreply.github.com>
Signed-off-by: zhilong <zhilong.chen@mail.mcgill.ca>
@Bye-legumes
Copy link
Contributor Author

@richardliaw @raulchen I think it OK now.

@richardliaw richardliaw merged commit c3afcc3 into ray-project:master Dec 3, 2024
5 checks passed
jecsand838 pushed a commit to jecsand838/ray that referenced this pull request Dec 4, 2024
Signed-off-by: Connor Sanders <connor@elastiflow.com>
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
Signed-off-by: hjiang <dentinyhao@gmail.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants