[Data] Add support for objects to Arrow blocks #45272

terraflops1048576 · 2024-05-11T22:24:22Z

Why are these changes needed?

Currently, Ray does not support blocks/batches with objects and multi-dimensional arrays in different columns. This causes Ray Data to throw exceptions when these are provided because:

Since there's an arbitrary object in the batch, the Arrow block format fails with ArrowNotImplemented with dtype 17. This falls back to return pd.DataFrame(dict(batch)) in BlockAccessor.batch_to_block.
However, this particular DataFrame constructor does not support columns with numpy.ndarray objects, so it throws the exception listed in the linked issue.

This change enables Python object storage in the Arrow blocks by defining an Arrow extension type that simply represents the Python objects as a variable-sized large binary. I suppose the alleged performance benefits listed in the comments are an extra benefit.

I'm not sure that this is the correct approach or that I've properly patched all of the places, so some help would be appreciated!

Related issue number

Resolves #45235

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

python/ray/data/tests/test_arrow_serialization.py

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

…fix type annnotation Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

anyscalesam · 2024-05-16T21:48:31Z

@terraflops1048576 this could be a big contribution but we want to think a little deeper about this with you to properly codevelop this; can you reach out to me on Ray Slack so we can setup some time to discuss further.

My handle is "Sam (Ray Team)"

We should get a formal REP in ray-enhancements for this as well: https://github.com/ray-project/enhancements

cc @bveeramani

raulchen · 2024-06-25T22:02:30Z

sorry, I got disruptions last week. doing the 2nd pass of review now.
Also, could you resolve the conflicts when you have a chance? thanks

raulchen

Looks good to me at the high level.
Apologize for my late review. Also none of the active Ray Data maintainers are familiar with the pyarrow extension. it'd be appreciated if you can help explain the questions and add more comments.
thanks.

raulchen · 2024-06-25T21:57:15Z

python/ray/air/util/object_extensions/arrow.py

+
+def object_extension_type_allowed() -> bool:
+    return (
+        PYARROW_VERSION is None


in which case will this be None? why should this return True when None?

Actually, I'm not sure. I just borrowed these version checks from the existing ArrowTensorArray implementation, which also passes if the PYARROW_VERSION is None. It turns out the way I've done it is buggy, so I'll revise it. I can also defensively just make this check fail if PYARROW_VERSION is None.

The more general question is whether we would be open to bumping the minimum PyArrow version to 9, which is almost 2 years old at this point. Currently, it is at 6.0.1, which appears to be almost 3 years old.

raulchen · 2024-06-25T23:49:12Z

python/ray/air/util/object_extensions/arrow.py

+
+
+@PublicAPI(stability="alpha")
+class ArrowPythonObjectType(pa.ExtensionType):


Could you document these classes and methods in this file?
Not many people are familiar with pyarrow extension. Adding some comments would help improve readability.

raulchen · 2024-06-26T00:45:26Z

python/ray/data/_internal/arrow_block.py

+            except (
+                pyarrow.ArrowInvalid,
+                pyarrow.ArrowNotImplementedError,
+                pyarrow.ArrowTypeError,


Is it possible to detect if the input data contains unsupported types, instead of using try-except?

if not, for above exception types, can be be caused by other reasons besides unsupported types?

I believe this is the way that the fallback to Pandas is calculated; in any case, this code is outdated after the merge conflict because they have now all been coalesced into ArrowConversionError.

It is possible to check that all of the objects inside are of some Arrow type, but I believe this is quite cumbersome -- you have to manually check whether it is convertible to any PyArrow type, and you have to check all of the elements. It's easier to try conversion and see if it works.

In any case, the conversion into the ArrowPythonObjectArray should never fail if everything is pickleable, and I assume this is better than falling back to Pandas.

raulchen · 2024-06-26T00:46:15Z

python/ray/data/_internal/arrow_block.py

+                        if log_once(f"arrow_object_pickle_{col_name}"):
+                            logger.warning(
+                                f"Failed to interpret {col_name} as "
+                                "multi-dimensional arrays. It will be pickled."


in which case would this happen?

The current code looks at any column which has a numpy dtype of object and checks if it can be converted into a list of tensors. If it can't be, then we fall back to ArrowPythonObjectArray. This happens when the numpy array has a dtype of object because you have custom objects in there.

raulchen · 2024-06-26T00:49:44Z

python/ray/data/tests/test_arrow_block.py

+    ds = ds.materialize()
+    block = ray.get(ds.get_internal_block_refs()[0])
+    # TODO: Once we support converting dict to a supported arrow type,
+    # the block type should be Arrow.


what does this comment mean? isn't the block already pyarrow?

Ah yes, this used to be a test that checked the fallback behavior to Pandas blocks, but these changes are supposed to remove that fallback behavior for the case given. I cut the test from where it originally was and pasted it here, but I must've forgotten to delete the comment.

python/ray/data/tests/test_arrow_block.py

raulchen · 2024-06-26T01:01:09Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

-                        f"but schema is not ArrayTensorType: {tensor_array_types[0]}"
-                    )
-                schema_field_overrides[col_name] = new_type
+    arrow_tensor_types = (ArrowVariableShapedTensorType, ArrowTensorType)


can you explain what this change is doing?

Basically this is a refactor of the unify_schemas function, which serves a similar purpose to the PyArrow unify_schemas. Its job is to take the schemas of the tables to be concatenated and produce a "unified" schema which contains the correct types for all of the columns of the constituent tables.

The previous code checked only the first table in the tables to concatenate, so it would behave weirdly if the first table weren't a tensor type and there was one further along by creating a schema where the tensor types were just erased (the error would be caught elsewhere, though).

However, for the object array, we need to support the case where the first table has a column of type, say, int64, and the second table has a column of type SomeObject. The old code would simply set the type to int64, which would cause an error. The refactored code will ensure that the int64s get pickled (yes, this is not very efficient, but this is the only way to support So this new code scans all of the table schemas, finds which columns have any table that contain tensors or objects, and handles them accordingly.

This is basically type promotion for the ArrowTensorArray and the ArrowVariableShapedTensorArray and likewise for the ArrowPythonObjectArray and other types. Should we set the promotion options differently too?

Thanks for the detailed explanation! Can you also update the docstring of this method to reflect this change?
also do you mind explaining again what you mean by "set the promotion options differently"?

…cts_to_arrow_blocks Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 · 2024-06-26T03:31:15Z

python/ray/data/tests/test_transform_pyarrow.py

@@ -181,6 +185,32 @@ def test_arrow_concat_tensor_extension_uniform_but_different():
    # fails for this case.


+@pytest.mark.skipif(


I'm not sure if these actually get run in CI if I put the skipif and require PyArrow version >= 9.0.0. Can someone check?

yes. it got run here https://buildkite.com/ray-project/microcheck/builds/2214

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 · 2024-06-26T08:14:07Z

After trying to fix the test failures that relate to the new exception info, I realized that this is actually rather problematic -- it fixes way too many errors that would otherwise be reported to the user, like the overflow error resulting from [1, 2**100], trying to concatenate np.array([2**100]) and np.array([1]), etc. After all, it is perfectly legal to pickle all of these things and store it in a column, but we obviously don't want to do this for performance reasons. We would rather error and tell the user to adjust the problematic values.

I'm not sure what we should do with the np.array case, as they are technically objects, but maybe we shouldn't fall back if anything related to ArrowTensorArray fails.

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

raulchen

Thanks for the new comments and explanation.

it fixes way too many errors that would otherwise be reported to the user, like the overflow error resulting from [1, 2100], trying to concatenate np.array([2100]) and np.array([1]), etc. After all, it is perfectly legal to pickle all of these things and store it in a column, but we obviously don't want to do this for performance reasons.

To confirm my understanding, in the UDF return, if any value is overflowed, all values will fallback to objects. If nothing is overflowed, the type will still be integers. Right?

This seems fine. Because we don't know if the users intend to return overflowed integers either. Maybe we can introduce a flag to allow disabling the fallback.

I'm not sure what we should do with the np.array case, as they are technically objects, but maybe we shouldn't fall back if anything related to ArrowTensorArray fails.

Are you concerned about np.arrays with object dtype? I think it's okay to not handle that. Users can just return a python list instead.

raulchen · 2024-06-26T23:29:41Z

python/ray/data/_internal/arrow_block.py

+def is_object_fixable_error(e: ArrowConversionError) -> bool:
+    """Returns whether this error can be fixed by using an ArrowPythonObjectArray"""
+    return any(
+        err in "".join(traceback.format_exception(type(e), e, e.__traceback__))


nit, it's not efficient to format the entire error as a string.
Can we iterate over the causes and check their types?

raulchen · 2024-06-26T23:36:47Z

python/ray/data/tests/test_transform_pyarrow.py

@@ -181,6 +185,32 @@ def test_arrow_concat_tensor_extension_uniform_but_different():
    # fails for this case.


+@pytest.mark.skipif(


yes. it got run here https://buildkite.com/ray-project/microcheck/builds/2214

raulchen · 2024-06-26T23:46:24Z

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

-                        f"but schema is not ArrayTensorType: {tensor_array_types[0]}"
-                    )
-                schema_field_overrides[col_name] = new_type
+    arrow_tensor_types = (ArrowVariableShapedTensorType, ArrowTensorType)


Thanks for the detailed explanation! Can you also update the docstring of this method to reflect this change?
also do you mind explaining again what you mean by "set the promotion options differently"?

python/ray/air/util/object_extensions/arrow.py

raulchen · 2024-06-26T23:59:48Z

Some tests are failing. can you take a look? https://buildkite.com/ray-project/microcheck/builds/2214

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 · 2024-06-27T03:50:33Z

I've tried to separate out overflows and some other errors like mismatched float/int types and make them continue reporting errors as in the tests I've fixed, instead of saving them as objects in the PyArrow table, primarily just to avoid unintended consequences.

You are correct that if we enabled catching all ArrowConversionErrors and turning things into objects that as long as blocks were completely non-overflowed integers, they would be stored as integer type. Currently, np.ndarrays with object dtype (such as when a user tries to put in overflowed integers) are just pickled as objects, because it's really hard to detect whether it's the result of overflowed integers or just plain objects in there.

raulchen

Thanks for the updates. LGTM

python/ray/data/_internal/arrow_block.py

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen · 2024-07-11T18:10:47Z

@terraflops1048576 the CI failures are unrelated and already fixed in master. Can you merge the latest master again?

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 · 2024-07-13T17:37:08Z

I'm not sure why the tests that I fixed here weren't broken before, but I included a fix.

Interestingly enough, the documentation for ray.data.Dataset.union says

The datasets must have the same schema as this dataset, otherwise the behavior is undefined.

However this isn't actually the case anymore and things like python/ray/data/tests/test_consumption.py::test_union actually depend on different behavior. Nevertheless, I fixed unify_schemas to reflect this behavior.

Also, I realized today while doing this fix that PyArrow tables can actually have duplicate column names -- this is an edge case that I don't really see treated anywhere (in particular, converting to pandas, which does not allow this, seems broken). I added something that just throws an exception if it sees it in this particular case.

Add support for objects to Arrow blocks

51751ba

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners May 11, 2024 22:24

terraflops1048576 commented May 11, 2024

View reviewed changes

python/ray/data/tests/test_arrow_serialization.py Outdated Show resolved Hide resolved

Peter Wang added 2 commits May 11, 2024 21:56

Fix concatenation with object arrays

ccb7a0f

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

Fix type annotation not being compatible with old Python

f99e0b6

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from abe7c58 to 516ff57 Compare May 12, 2024 05:46

Make block concat with objects not quadratic, use better pickle, and …

3b38131

…fix type annnotation Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 516ff57 to 3b38131 Compare May 12, 2024 05:59

terraflops1048576 changed the title ~~Add support for objects to Arrow blocks~~ [Data] Add support for objects to Arrow blocks May 12, 2024

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch 2 times, most recently from 2d82297 to 978b30e Compare May 12, 2024 19:23

Fix to construct_from_string

014949f

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 978b30e to 014949f Compare May 12, 2024 20:26

Peter Wang added 2 commits May 12, 2024 16:55

Add __reduce__ for backcompat with earlier pyarrow versions

9412643

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

Make tests conditional on the object extension supported

2fcd17e

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

raulchen self-assigned this May 13, 2024

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels May 29, 2024

anyscalesam closed this May 30, 2024

anyscalesam reopened this May 30, 2024

anyscalesam removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label May 30, 2024

raulchen reviewed Jun 26, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into terraflops/add_obje…

8752c8e

…cts_to_arrow_blocks Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 commented Jun 26, 2024

View reviewed changes

Change test, add comments

7854e67

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from f41298e to 7854e67 Compare June 26, 2024 04:55

Peter Wang added 2 commits June 26, 2024 03:16

Fix test

aeb1b49

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

No more pandas fallback

6564543

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 9c131f1 to 6564543 Compare June 26, 2024 08:22

Add another interesting test case

924f653

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

terraflops1048576 force-pushed the terraflops/add_objects_to_arrow_blocks branch from 2edc1bb to 924f653 Compare June 26, 2024 16:36

raulchen reviewed Jun 26, 2024

View reviewed changes

Add docstring, fix broken tests again

92d0cea

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

raulchen approved these changes Jul 11, 2024

View reviewed changes

python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved

Update python/ray/data/_internal/arrow_block.py

40bda3c

Signed-off-by: Hao Chen <chenh1024@gmail.com>

raulchen enabled auto-merge (squash) July 11, 2024 02:34

github-actions bot added the go add ONLY when ready to merge, run all tests label Jul 11, 2024

Merge branch 'master' into terraflops/add_objects_to_arrow_blocks

df25092

github-actions bot disabled auto-merge July 12, 2024 15:29

Fix tests that I'm unsure why they weren't broken before

097f171

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

anyscalesam merged commit ea452c9 into ray-project:master Jul 19, 2024
5 checks passed

raulchen mentioned this pull request Jul 25, 2024

[Data] Ray Data doesn't account for object store memory from object dtypes #44577

Closed

bveeramani mentioned this pull request Oct 17, 2024

[Data] Re-implement APIs like select_columns with PyArrow batch format #48090

Closed

This was referenced Nov 14, 2024

[Ray Data] PythonObjectArray missing methods causing serialization failures #48737

Closed

[Ray Data] PythonObjectArray missing methods causing serialization failures #48748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add support for objects to Arrow blocks #45272

[Data] Add support for objects to Arrow blocks #45272

terraflops1048576 commented May 11, 2024

anyscalesam commented May 16, 2024

raulchen commented Jun 25, 2024

raulchen left a comment

raulchen Jun 25, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 25, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 Jun 26, 2024

raulchen Jun 26, 2024

terraflops1048576 commented Jun 26, 2024 •

edited

Loading

raulchen left a comment •

edited

Loading

raulchen Jun 26, 2024

raulchen Jun 26, 2024

raulchen Jun 26, 2024

raulchen commented Jun 26, 2024

terraflops1048576 commented Jun 27, 2024

raulchen left a comment

raulchen commented Jul 11, 2024

terraflops1048576 commented Jul 13, 2024 •

edited

Loading



		@PublicAPI(stability="alpha")
		class ArrowPythonObjectType(pa.ExtensionType):

		@@ -181,6 +185,32 @@ def test_arrow_concat_tensor_extension_uniform_but_different():
		# fails for this case.


		@pytest.mark.skipif(

[Data] Add support for objects to Arrow blocks #45272

[Data] Add support for objects to Arrow blocks #45272

Conversation

terraflops1048576 commented May 11, 2024

Why are these changes needed?

Related issue number

Checks

anyscalesam commented May 16, 2024

raulchen commented Jun 25, 2024

raulchen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terraflops1048576 commented Jun 26, 2024 • edited Loading

raulchen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented Jun 26, 2024

terraflops1048576 commented Jun 27, 2024

raulchen left a comment

Choose a reason for hiding this comment

raulchen commented Jul 11, 2024

terraflops1048576 commented Jul 13, 2024 • edited Loading

terraflops1048576 commented Jun 26, 2024 •

edited

Loading

raulchen left a comment •

edited

Loading

terraflops1048576 commented Jul 13, 2024 •

edited

Loading