[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

ArturNiederfahrenhorst · 2024-10-21T22:09:34Z

Related issue number

Closes #48090

Prerequisite: #48575

python/ray/data/dataset.py

ArturNiederfahrenhorst · 2024-10-22T17:00:00Z

Looking at the failed test...

python/ray/data/dataset.py

ArturNiederfahrenhorst · 2024-11-05T19:58:34Z

I'll rebase once the fix is in and mongodb test should pass

alexeykudinkin

@ArturNiederfahrenhorst please hold on landing this one

python/ray/data/dataset.py

python/ray/data/tests/test_map.py

ArturNiederfahrenhorst · 2024-11-13T19:11:13Z

python/ray/data/tests/test_mongo.py

-    )
+
+    assert ds.count() == 5
+    assert ds.schema().names == ["_id", "float_field", "int_field"]


Made these changes to decouple them from the string representation which may vary over versions. On my local environment, it was different then here/CI.

@ArturNiederfahrenhorst that's a nice change.

Let's however not reduce strictness of the check itself -- let's keep asserting on the full schema, not just the column names

ArturNiederfahrenhorst · 2024-11-13T19:12:27Z

python/ray/data/tests/test_map.py

@@ -362,7 +383,7 @@ def test_drop_columns(ray_start_regular_shared, tmp_path):
        assert ds.drop_columns(["col2"]).take(1) == [{"col1": 1, "col3": 3}]
        assert ds.drop_columns(["col1", "col3"]).take(1) == [{"col2": 2}]
        assert ds.drop_columns([]).take(1) == [{"col1": 1, "col2": 2, "col3": 3}]
-        assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == [{}]
+        assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == []


As discussed offline, this behavior is arbitrary and probably has little practical relevance.
Since our pyarrow implementation of the drop operation returns an empty list, we decided to just change the test in this case.

richardliaw · 2024-11-14T17:55:06Z

python/ray/data/dataset.py

-        def add_column(batch: "pandas.DataFrame") -> "pandas.DataFrame":
-            batch.loc[:, col] = fn(batch)
-            return batch
+        def add_column(batch: "pyarrow.Table") -> "pyarrow.Table":


the typing here is off - batch is DataBatch type right? for example if it is pandas

richardliaw · 2024-11-14T17:57:32Z

python/ray/data/dataset.py

+        if batch_format not in [
+            "pandas",
+            "pyarrow",
+        ]:
+            raise ValueError(
+                f"batch_format argument must be 'pandas' or 'pyarrow', "
+                f"got: {batch_format}"
+            )
+


I don't think you need to validate here, should happen in map_batches

richardliaw · 2024-11-14T17:58:29Z

python/ray/data/dataset.py

+        # Historically, we have also accepted lists with duplicate column names.
+        # This is not tolerated by the underlying pyarrow.Table.drop_columns method.
+        cols_without_duplicates = list(set(cols))
+


i think we should just enforce this via validation / raise an error

This is a breaking change then!
Still?

I think it's fine, yes.

python/ray/data/dataset.py

bveeramani · 2024-11-14T17:58:27Z

python/ray/data/dataset.py

+        if batch_format not in [
+            "pandas",
+            "pyarrow",
+        ]:
+            raise ValueError(
+                f"batch_format argument must be 'pandas' or 'pyarrow', "
+                f"got: {batch_format}"
+            )


Any reason we can't support the numpy batch format?

bveeramani · 2024-11-14T17:58:46Z

python/ray/data/dataset.py

+                    # Create a new table with the updated column
+                    return batch.set_column(column_idx, col, column)


Should we either error or emit a warning here? Overriding a column might be unexpected

@bveeramani Does Ray Data have existing helpers to log this without spamming?
I'd do the same for numpy, pandas and arrow then.

+1

Since API is called add_column, i think we should assert that the column does not exist

Since @bveeramani is ok with an error or a warning and @alexeykudinkin prefers an error, I've made this case an error.

bveeramani · 2024-11-14T17:59:14Z

python/ray/data/tests/test_mongo.py

-    )
+
+    assert ds.count() == 5
+    assert ds.schema().names == ["_id", "float_field", "int_field"]


python/ray/data/dataset.py

alexeykudinkin · 2024-11-19T19:38:51Z

python/ray/data/dataset.py

+                    # Create a new table with the updated column
+                    return batch.set_column(column_idx, col, column)
+            else:
+                # batch format is assumed to be numpy


Let's not assume and instead add explicit conditional (for unsupported format throw an exception)

While I'm fine with that, this collides with #48140 (comment)

I'll revert the change I made for Richard then, assuming that I should follow the recommendation of the Ray Data team here.

Yeah, let's not assume the format -- UDF has to match the format and hence we need to be careful with an assumptions like that

alexeykudinkin · 2024-11-19T19:39:41Z

python/ray/data/dataset.py

+                    # Create a new table with the updated column
+                    return batch.set_column(column_idx, col, column)


+1

Since API is called add_column, i think we should assert that the column does not exist

alexeykudinkin · 2024-11-19T19:40:48Z

python/ray/data/tests/test_map.py

+
+    # Test with pyarrow batch format
+    ds = ray.data.range(5).add_column(
+        "foo", lambda x: pa.array([1] * x.num_rows), batch_format="pyarrow"


Let's also test with pa.chunked_array

alexeykudinkin · 2024-11-19T19:42:23Z

python/ray/data/tests/test_mongo.py

-    )
+
+    assert ds.count() == 5
+    assert ds.schema().names == ["_id", "float_field", "int_field"]


@ArturNiederfahrenhorst that's a nice change.

Let's however not reduce strictness of the check itself -- let's keep asserting on the full schema, not just the column names

python/ray/data/dataset.py

alexeykudinkin · 2024-11-21T00:01:38Z

python/ray/data/dataset.py

+
+        def add_column(
+            batch: DataBatch,
+        ) -> Union["pyarrow.Array", "pandas.Series", Dict[str, "np.ndarray"]]:


This should also return DataBatch

Ooof, good one!

Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>

Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after #48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>

Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after ray-project#48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>

ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: hjiang <dentinyhao@gmail.com>

Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after ray-project#48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>

ArturNiederfahrenhorst marked this pull request as ready for review October 22, 2024 15:32

ArturNiederfahrenhorst requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners October 22, 2024 15:32

ArturNiederfahrenhorst commented Oct 22, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst commented Oct 22, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst force-pushed the pyarrowbatches branch 2 times, most recently from 53ef980 to bab3632 Compare October 23, 2024 15:05

ArturNiederfahrenhorst assigned bveeramani Oct 25, 2024

bveeramani approved these changes Oct 28, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst requested a review from alexeykudinkin as a code owner November 4, 2024 23:04

ArturNiederfahrenhorst requested a review from srinathk10 as a code owner November 5, 2024 22:16

alexeykudinkin reviewed Nov 6, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

richardliaw reviewed Nov 7, 2024

View reviewed changes

python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst commented Nov 13, 2024

View reviewed changes

richardliaw reviewed Nov 14, 2024

View reviewed changes

bveeramani reviewed Nov 14, 2024

View reviewed changes

ruisearch42 mentioned this pull request Nov 18, 2024

[core][compiled graphs] Fix do_profile_tasks #48782

Merged

8 tasks

richardliaw added the go add ONLY when ready to merge, run all tests label Nov 19, 2024

alexeykudinkin reviewed Nov 19, 2024

View reviewed changes

alexeykudinkin approved these changes Nov 21, 2024

View reviewed changes

ArturNiederfahrenhorst enabled auto-merge (squash) November 21, 2024 00:21

github-actions bot disabled auto-merge November 21, 2024 15:38

ArturNiederfahrenhorst and others added 18 commits November 21, 2024 23:46

replace kwarg by arg

35bbfb1

cleanup

f234862

cleanup after merge

f3cbf08

Alexey's comment

55bc8f6

minor change

9252d0d

lint

0846f80

richard's comments

9fe088d

Update python/ray/data/dataset.py

a79db22

Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

add numpy

aae2a7a

add tests

ace4348

Richard's comment + testing

84bb1b5

lint

11c75a2

Update python/ray/data/dataset.py

a306be3

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Alexey's comments

cfe5900

fix doctests

ddf22e4

Update python/ray/data/dataset.py

4e11ff9

Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>

Remove pymongoarrow datatype testing in test_mongo

783e6ca

fix doctests

bcd0ba6

ArturNiederfahrenhorst force-pushed the pyarrowbatches branch from 691e963 to bcd0ba6 Compare November 21, 2024 22:49

ArturNiederfahrenhorst added 2 commits November 22, 2024 00:57

lint

149a5cf

fix gc test

9fea594

ArturNiederfahrenhorst enabled auto-merge (squash) November 22, 2024 00:01

ArturNiederfahrenhorst merged commit 335bd66 into ray-project:master Nov 22, 2024
6 checks passed

bveeramani mentioned this pull request Nov 25, 2024

[Data] Relax type check in add_column #48918

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

ArturNiederfahrenhorst commented Oct 21, 2024 •

edited

Loading

ArturNiederfahrenhorst commented Oct 22, 2024

ArturNiederfahrenhorst commented Nov 5, 2024

alexeykudinkin left a comment

ArturNiederfahrenhorst Nov 13, 2024

richardliaw Nov 14, 2024

bveeramani Nov 14, 2024

alexeykudinkin Nov 19, 2024

ArturNiederfahrenhorst Nov 13, 2024

richardliaw Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

richardliaw Nov 14, 2024

richardliaw Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

richardliaw Nov 19, 2024

bveeramani Nov 14, 2024

bveeramani Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

alexeykudinkin Nov 19, 2024

ArturNiederfahrenhorst Nov 20, 2024

bveeramani Nov 14, 2024

alexeykudinkin Nov 19, 2024

ArturNiederfahrenhorst Nov 20, 2024

ArturNiederfahrenhorst Nov 20, 2024

alexeykudinkin Nov 20, 2024

alexeykudinkin Nov 19, 2024

alexeykudinkin Nov 19, 2024

alexeykudinkin Nov 19, 2024

alexeykudinkin Nov 21, 2024

ArturNiederfahrenhorst Nov 21, 2024

		# Create a new table with the updated column
		return batch.set_column(column_idx, col, column)

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

Conversation

ArturNiederfahrenhorst commented Oct 21, 2024 • edited Loading

Related issue number

ArturNiederfahrenhorst commented Oct 22, 2024

ArturNiederfahrenhorst commented Nov 5, 2024

alexeykudinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst commented Oct 21, 2024 •

edited

Loading