Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] PythonObjectArray missing methods causing serialization failures #48737

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

akavalar
Copy link

@akavalar akavalar commented Nov 14, 2024

PythonObjectArray is a subclass of ExtensionArray, but as such it is missing some of the mandatory methods described here, specifically isna, take, copy, and _concat_same_type (the last method interpolate does not seem to be required). Such storage of Python objects in Arrow blocks was added to Ray Data in #45272 and was released as part of Ray 2.33.

Custom Python objects currently cannot be wrapped/serialized due to pandas.errors.AbstractMethodError: This method must be defined in the concrete class PythonObjectArray failures, as shown by the contrived example here: #48748

This PR adds the missing methods.

Why are these changes needed?

See above.

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: akavalar <akavalar@ucla.edu>
@richardliaw
Copy link
Contributor

This is great! Think the failure is unrelated. Could you add some tests as well?

@richardliaw richardliaw self-assigned this Nov 15, 2024
@xingyu-long
Copy link
Contributor

Hey, I'd like to write some tests for this PR!

@akavalar
Copy link
Author

akavalar commented Nov 16, 2024

@xingyu-long I started working on them but then stopped when I saw the rllib failures. If you intend to work on this right away then go for it, otherwise I'll probably want to finish the PR cause it's blocking us at work and we'd like to see it merge soon. Let me know, thanks!

Btw pandas has an entire section on how to test extension arrays.

The original PR also added a small number of tests here.

(forgot: let me know if you need help!)

@xingyu-long
Copy link
Contributor

xingyu-long commented Nov 16, 2024

@akavalar thanks for quick response! Yeah I will work on it, will post some updates tomorrow.

@xingyu-long
Copy link
Contributor

def test_pandas_isna():
    arr = np.array([1, np.nan, 3, 4, 5, np.nan, 7, 8, 9], dtype=object)
    ta = PythonObjectArray(arr)
    np.testing.assert_array_equal(ta.isna(), pd.isna(arr))


def test_pandas_take():
    arr = np.array([1, 2, 3, 4, 5], dtype=object)
    ta = PythonObjectArray(arr)
    indices = [1, 2, 3]
    np.testing.assert_array_equal(ta.take(indices).to_numpy(), arr[indices])
    indices = [1, 2, -1]
    np.testing.assert_array_equal(
        ta.take(indices, allow_fill=True, fill_value=100).to_numpy(),
        np.array([2, 3, 100]),
    )


def test_pandas_concat():
    arr1 = np.array([1, 2, 3, 4, 5], dtype=object)
    arr2 = np.array([6, 7, 8, 9, 10], dtype=object)
    ta1 = PythonObjectArray(arr1)
    ta2 = PythonObjectArray(arr2)
    concat_arr = PythonObjectArray._concat_same_type([ta1, ta2])
    assert len(concat_arr) == arr1.shape[0] + arr2.shape[0]
    np.testing.assert_array_equal(concat_arr.to_numpy(), np.concatenate([arr1, arr2]))

^ I came up with these tests.

maybe you can add them or have something similar into your PR @akavalar

cc @richardliaw

@akavalar
Copy link
Author

@xingyu-long thanks a lot! will have a look later today

@richardliaw
Copy link
Contributor

hey @akavalar - did you have a chance to take a look at the tests?

@akavalar
Copy link
Author

hey guys, I had to pivot to something else at work and haven't yet found the time for this - should be able to do it today though. thanks for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants