[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb #47832

alexeykudinkin · 2024-09-26T23:09:36Z

Why are these changes needed?

Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed int32 as offsets) this would result in the following issue:

pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Consequently, this change adds support for tensors of > 2Gb in size, while maintaining compatibility with existing datasets already using tensors.

This is done by forking off ArrowTensorType in 2:

ArrowTensorType (v1) remaining intact
ArrowTensorTypeV2 is rebased on Arrow's LargeListType as well as now using int64 offsets

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani · 2024-09-27T17:29:17Z

@alexeykudinkin is my understanding correct that v2 is off by default? Are we planning on removing v1 at some point? If not, how do we avoid maintaining both v1 and v2 indefinitely?

Also, what's the motivation for introducing a new version rather than changing the implementation? To avoid making a breaking change?

alexeykudinkin · 2024-09-30T19:18:22Z

@alexeykudinkin is my understanding correct that v2 is off by default? Are we planning on removing v1 at some point? If not, how do we avoid maintaining both v1 and v2 indefinitely?

Correct.

V1 is gonna stay around for a while for BWC reasons, hence we have to offer a migration path:

Users are switching to use ArrowTensorTypeV2 by setting RAY_DATA_USE_ARROW_TENSOR_V2 (or updating DataContext.use_arrow_tensor_v2
Reading V1 path will still be supported, but all new datasets will be written in V2 format

Also, what's the motivation for introducing a new version rather than changing the implementation? To avoid making a breaking change?

Correct

bveeramani

LGTM.

@alexeykudinkin do we need to make an analogous change to ArrowVariableShapedTensorType?

Reading V1 path will still be supported, but all new datasets will be written in V2 format

Not sure if I missed it, but will new datasets be written with v2 after this PR?

bveeramani · 2024-09-30T20:19:43Z

python/ray/data/tests/test_parquet.py

@@ -1204,6 +1205,92 @@ def test_partitioning_in_dataset_kwargs_raises_error(ray_start_regular_shared):
        )


+def test_tensors_in_tables_parquet(


Any reason not to parameterize this test?

No need -- this test verifies BWC

bveeramani · 2024-09-30T20:21:05Z

python/ray/air/util/tensor_extensions/arrow.py

+    @classmethod
+    def __arrow_ext_deserialize__(cls, storage_type, serialized):
+        shape = tuple(json.loads(serialized))
+        return cls(shape, storage_type.value_type)


Dumb question -- why can't we put this in _BaseArrowTensorType?

bveeramani · 2024-09-30T20:33:36Z

python/ray/air/tests/test_tensor_extension.py

+@pytest.mark.parametrize("tensor_format", ["v1", "v2"])
+def test_large_arrow_tensor_array(restore_data_context, tensor_format):
+    DataContext.get_current().use_arrow_tensor_v2 = tensor_format == "v2"
+
+    test_arr = np.ones((1000, 550), dtype=np.uint8)
+
+    if tensor_format == "v1":
+        with pytest.raises(ArrowConversionError) as exc_info:
+            ta = ArrowTensorArray.from_numpy([test_arr] * 4000)
+
+        assert (
+            repr(exc_info.value.__cause__)
+            == "ArrowInvalid('Negative offsets in list array')"
+        )
+    else:
+        ta = ArrowTensorArray.from_numpy([test_arr] * 4000)
+        assert len(ta) == 4000
+        for arr in ta:
+            assert np.asarray(arr).shape == (1000, 550)
+


Nit: IMO this would be more readable as two separate tests because separate tests would avoid conditionals, and there's only one shared line between the two parameterizations

I definitely see your point, however splitting it in 2 tests obstructs the intention of it which is to show that it doesn't work w/ V1 and does so w/ V2

raulchen · 2024-09-30T20:34:43Z

python/ray/air/tests/test_tensor_extension.py



+@pytest.mark.parametrize("tensor_format", ["v1", "v2"])


Nit, define an autouse fixture to avoid updating each test case.

@pytest.fixture(autouse=True, scope="module", params=["v1", "v2"]) def ...() ... yield

Oh, interesting, didn't know about this trick.

Would prefer not to now unwind all the 40 tests back removing this parametrize

python/ray/air/_internal/tensorflow_utils.py

raulchen · 2024-09-30T21:13:53Z

python/ray/data/context.py

@@ -286,6 +288,7 @@ class DataContext:
    min_parallelism: int = DEFAULT_MIN_PARALLELISM
    read_op_min_num_blocks: int = DEFAULT_READ_OP_MIN_NUM_BLOCKS
    enable_tensor_extension_casting: bool = DEFAULT_ENABLE_TENSOR_EXTENSION_CASTING
+    use_arrow_tensor_v2: bool = DEFAULT_USE_ARROW_TENSOR_V2


can you add a comment in the docstring explaining the differences between v1 and v2, as well as backward compatibility.

What docstring are you referring to (attached to what)?

Added to the DEFAULT_USE_ARROW_TENSOR_V2 definition

there is a long docstring below class DataContext covering all fields.
I was referring to it. but I think we should break it down to each field definition.

python/ray/data/dataset.py

raulchen

Can you also add a comment explaining what exactly is not compatible? I assume the issue is that data written in v1 cannot be read using v2.

Reading V1 path will still be supported, but all new datasets will be written in V2 format

This sounds like a good idea. We can default to V2, and print a warning to prompt users to switch V1 when reading V1 data. This can be done in a follow-up PR though.

alexeykudinkin · 2024-09-30T21:57:35Z

This sounds like a good idea. We can default to V2, and print a warning to prompt users to switch V1 when reading V1 data. This can be done in a follow-up PR though.

I'd rather decouple switching default from supporting it in this PR to get a bit more miles under the rubber for us to validate V2 before flipping it on by default

alexeykudinkin · 2024-09-30T21:58:47Z

Can you also add a comment explaining what exactly is not compatible? I assume the issue is that data written in v1 cannot be read using v2.

Wire format is not compatible:

V1 is using int32 offsets, while V2 is int64
V1 is using PA ListType, while V2 is LargeListType

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…Array` type def; Forked off `ArrowTensorTypeV2` from `ArrowTensorType` now using int64 offsets Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…rrowTensorType`) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen · 2024-10-01T05:28:04Z

python/ray/data/context.py

@@ -286,6 +288,7 @@ class DataContext:
    min_parallelism: int = DEFAULT_MIN_PARALLELISM
    read_op_min_num_blocks: int = DEFAULT_READ_OP_MIN_NUM_BLOCKS
    enable_tensor_extension_casting: bool = DEFAULT_ENABLE_TENSOR_EXTENSION_CASTING
+    use_arrow_tensor_v2: bool = DEFAULT_USE_ARROW_TENSOR_V2


there is a long docstring below class DataContext covering all fields.
I was referring to it. but I think we should break it down to each field definition.

python/ray/data/dataset.py

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ray-project#47832) Currently, when using tensor type in Ray Data if single tensor in a block grows above 2Gb (due to use of signed `int32` as offsets) this would result in the following issue: ``` pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays ``` Consequently, this change adds support for tensors of > 4Gb in size, while maintaining compatibility with existing datasets already using tensors. This is done by forking off `ArrowTensorType` in 2: - `ArrowTensorType` (v1) remaining intact - `ArrowTensorTypeV2` is rebased on Arrow's `LargeListType` as well as now using `int64` offsets --------- Signed-off-by: Peter Wang <peter.wang9812@gmail.com> Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Co-authored-by: Peter Wang <peter.wang9812@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

alexeykudinkin requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners September 26, 2024 23:09

alexeykudinkin assigned bveeramani and raulchen Sep 26, 2024

alexeykudinkin changed the title ~~[Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 4Gb~~ [Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb Sep 27, 2024

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Sep 27, 2024

alexeykudinkin assigned jjyao Sep 27, 2024

bveeramani approved these changes Sep 30, 2024

View reviewed changes

raulchen reviewed Sep 30, 2024

View reviewed changes

raulchen approved these changes Sep 30, 2024

View reviewed changes

Peter Wang and others added 6 commits September 30, 2024 15:39

Re-roll of ray-project#45352

f8f858b

Signed-off-by: Peter Wang <peter.wang9812@gmail.com>

Relocated OFFSET_DTYPE into ArrowTensorType

e05c9fe

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added a flag to control whether ArrowTensorTypeV2 should be used

46f230b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Abstracted _BaseArrowTensorType base class hosting Arrow `Extension…

0ed6d68

…Array` type def; Forked off `ArrowTensorTypeV2` from `ArrowTensorType` now using int64 offsets Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Utilize ArrowTensorTypeV2 when configured (by default still uses `A…

bd378f7

…rrowTensorType`) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed _from_numpy util to produce appropriate extension type

ce73306

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 5 commits September 30, 2024 15:39

Added "conftest" to bazel

7a4d893

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

178d81f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Abstracted get_arrow_extension_tensor_types

48fc69a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added pydoc for DEFAULT_USE_ARROW_TENSOR_V2

2de388c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

65d7a07

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/arw-tnsr-fix branch from db6d725 to 65d7a07 Compare October 1, 2024 00:01

alexeykudinkin added 2 commits September 30, 2024 18:01

Tidying up

8339f63

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing DC ref

76a6770

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen reviewed Oct 1, 2024

View reviewed changes

Capture corresponding DataContext snapshot in Schema

e60ab6a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

raulchen merged commit 1bab09b into ray-project:master Oct 1, 2024
5 checks passed

scottjlee mentioned this pull request Oct 7, 2024

Revert "[Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb" #47919

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb #47832

[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb #47832

alexeykudinkin commented Sep 26, 2024 •

edited

Loading

bveeramani commented Sep 27, 2024 •

edited

Loading

alexeykudinkin commented Sep 30, 2024

bveeramani left a comment •

edited

Loading

bveeramani Sep 30, 2024

alexeykudinkin Sep 30, 2024

bveeramani Sep 30, 2024

bveeramani Sep 30, 2024

alexeykudinkin Sep 30, 2024

raulchen Sep 30, 2024

alexeykudinkin Sep 30, 2024

raulchen Sep 30, 2024

alexeykudinkin Sep 30, 2024

alexeykudinkin Oct 1, 2024

raulchen Oct 1, 2024

raulchen left a comment

alexeykudinkin commented Sep 30, 2024

alexeykudinkin commented Sep 30, 2024

raulchen Oct 1, 2024

		@@ -1204,6 +1205,92 @@ def test_partitioning_in_dataset_kwargs_raises_error(ray_start_regular_shared):
		)


		def test_tensors_in_tables_parquet(

[Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb #47832

[Arrow] Adding ArrowTensorTypeV2 to support tensors larger than 2Gb #47832

Conversation

alexeykudinkin commented Sep 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani commented Sep 27, 2024 • edited Loading

alexeykudinkin commented Sep 30, 2024

bveeramani left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen left a comment

Choose a reason for hiding this comment

alexeykudinkin commented Sep 30, 2024

alexeykudinkin commented Sep 30, 2024

Choose a reason for hiding this comment

[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb #47832

[Arrow] Adding `ArrowTensorTypeV2` to support tensors larger than 2Gb #47832

alexeykudinkin commented Sep 26, 2024 •

edited

Loading

bveeramani commented Sep 27, 2024 •

edited

Loading

bveeramani left a comment •

edited

Loading