Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] PythonObjectArray missing methods causing serialization failures #48748

Open
akavalar opened this issue Nov 14, 2024 · 0 comments
Open
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@akavalar
Copy link

What happened + What you expected to happen

PythonObjectArray is a subclass of ExtensionArray, but as such it is missing some of the mandatory methods described here, specifically isna, take, copy, and _concat_same_type (the last method interpolate does not seem to be required). Such storage of Python objects in Arrow blocks was added to Ray Data in #45272 and was released as part of Ray 2.33.

Custom Python objects currently cannot be wrapped/serialized due to pandas.errors.AbstractMethodError: This method must be defined in the concrete class PythonObjectArray failures, as shown by the contrived example below.

I have a draft PR that adds the missing methods (#48737), but it's currently failing one of the rllib tests (rllib: examples). I find that somewhat surprising but I haven't yet looked at why that could be happening, so I'd appreciate any pointers in the mean time. Thanks!

Versions / Dependencies

Ray    2.38.0
Pandas 2.1.4
Python 3.10.15 (GCC 11.4.0)
Ubuntu 20.04.6 LTS (5.15.0-105-generic)

Reproduction script

from dataclasses import dataclass, field

import pandas as pd
import ray


@dataclass
class Message:
    data: dict = field(default_factory=dict)


class Stage:
    def __init__(self, data_in, data_out):
        self.data_in = data_in
        self.data_out = data_out

    def __call__(self, input_data):
        df = pd.DataFrame(input_data[self.data_in])  # <--- point of failure when copy (one of the missing methods) gets called
        dicts = df.to_dict("records")
        data = self.run_stage(dicts)
        return pd.DataFrame(data)

    def run_stage(self, dicts):
        raise NotImplementedError


class FirstStage(Stage):
    def run_stage(self, dicts):
        for entry in dicts:
            value = entry[self.data_in].data["value"]
            entry[self.data_out] = Message(data={"value": 2 * value})
        return dicts


class SecondStage(Stage):
    def run_stage(self, dicts):
        for entry in dicts:
            value = entry[self.data_in].data["value"]
            entry[self.data_out] = Message(data={"value": 5 * value})
        return dicts


ray.init()

dataset = ray.data.from_items(
    [
        {"input_data": Message(data={"value": 8})},
        {"input_data": Message(data={"value": 10})},
    ]
)
for stage_class, data_in, data_out in zip(
    [FirstStage, SecondStage],
    ["input_data", "intermediate_data"],
    ["intermediate_data", "final_data"],
):
    map_args = {
        "fn": stage_class,
        "fn_constructor_args": (data_in, data_out),
        "concurrency": 2,
        "batch_format": "pandas",
    }
    dataset = dataset.map_batches(**map_args)
print(dataset.to_pandas()["final_data"])

ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

@akavalar akavalar added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 14, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants