-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Re-implement APIs like select_columns with PyArrow batch format #48140
[Data] Re-implement APIs like select_columns with PyArrow batch format #48140
Conversation
Looking at the failed test... |
53ef980
to
bab3632
Compare
I'll rebase once the fix is in and mongodb test should pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArturNiederfahrenhorst please hold on landing this one
) | ||
|
||
assert ds.count() == 5 | ||
assert ds.schema().names == ["_id", "float_field", "int_field"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made these changes to decouple them from the string representation which may vary over versions. On my local environment, it was different then here/CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArturNiederfahrenhorst that's a nice change.
Let's however not reduce strictness of the check itself -- let's keep asserting on the full schema, not just the column names
@@ -362,7 +383,7 @@ def test_drop_columns(ray_start_regular_shared, tmp_path): | |||
assert ds.drop_columns(["col2"]).take(1) == [{"col1": 1, "col3": 3}] | |||
assert ds.drop_columns(["col1", "col3"]).take(1) == [{"col2": 2}] | |||
assert ds.drop_columns([]).take(1) == [{"col1": 1, "col2": 2, "col3": 3}] | |||
assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == [{}] | |||
assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, this behavior is arbitrary and probably has little practical relevance.
Since our pyarrow implementation of the drop operation returns an empty list, we decided to just change the test in this case.
python/ray/data/dataset.py
Outdated
def add_column(batch: "pandas.DataFrame") -> "pandas.DataFrame": | ||
batch.loc[:, col] = fn(batch) | ||
return batch | ||
def add_column(batch: "pyarrow.Table") -> "pyarrow.Table": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the typing here is off - batch is DataBatch type right? for example if it is pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
python/ray/data/dataset.py
Outdated
if batch_format not in [ | ||
"pandas", | ||
"pyarrow", | ||
]: | ||
raise ValueError( | ||
f"batch_format argument must be 'pandas' or 'pyarrow', " | ||
f"got: {batch_format}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to validate here, should happen in map_batches
python/ray/data/dataset.py
Outdated
# Historically, we have also accepted lists with duplicate column names. | ||
# This is not tolerated by the underlying pyarrow.Table.drop_columns method. | ||
cols_without_duplicates = list(set(cols)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should just enforce this via validation / raise an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change then!
Still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine, yes.
python/ray/data/dataset.py
Outdated
if batch_format not in [ | ||
"pandas", | ||
"pyarrow", | ||
]: | ||
raise ValueError( | ||
f"batch_format argument must be 'pandas' or 'pyarrow', " | ||
f"got: {batch_format}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we can't support the numpy batch format?
python/ray/data/dataset.py
Outdated
# Create a new table with the updated column | ||
return batch.set_column(column_idx, col, column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we either error or emit a warning here? Overriding a column might be unexpected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bveeramani Does Ray Data have existing helpers to log this without spamming?
I'd do the same for numpy, pandas and arrow then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Since API is called add_column
, i think we should assert that the column does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @bveeramani is ok with an error or a warning and @alexeykudinkin prefers an error, I've made this case an error.
) | ||
|
||
assert ds.count() == 5 | ||
assert ds.schema().names == ["_id", "float_field", "int_field"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
python/ray/data/dataset.py
Outdated
# Create a new table with the updated column | ||
return batch.set_column(column_idx, col, column) | ||
else: | ||
# batch format is assumed to be numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not assume and instead add explicit conditional (for unsupported format throw an exception)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm fine with that, this collides with #48140 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert the change I made for Richard then, assuming that I should follow the recommendation of the Ray Data team here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's not assume the format -- UDF has to match the format and hence we need to be careful with an assumptions like that
python/ray/data/dataset.py
Outdated
# Create a new table with the updated column | ||
return batch.set_column(column_idx, col, column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Since API is called add_column
, i think we should assert that the column does not exist
|
||
# Test with pyarrow batch format | ||
ds = ray.data.range(5).add_column( | ||
"foo", lambda x: pa.array([1] * x.num_rows), batch_format="pyarrow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also test with pa.chunked_array
) | ||
|
||
assert ds.count() == 5 | ||
assert ds.schema().names == ["_id", "float_field", "int_field"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArturNiederfahrenhorst that's a nice change.
Let's however not reduce strictness of the check itself -- let's keep asserting on the full schema, not just the column names
python/ray/data/dataset.py
Outdated
|
||
def add_column( | ||
batch: DataBatch, | ||
) -> Union["pyarrow.Array", "pandas.Series", Dict[str, "np.ndarray"]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also return DataBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooof, good one!
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com>
691e963
to
bcd0ba6
Compare
ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after #48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: Connor Sanders <connor@elastiflow.com>
Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after ray-project#48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Connor Sanders <connor@elastiflow.com>
ray-project#48140) ## Related issue number Closes ray-project#48090 Prerequisite: ray-project#48575 --------- Signed-off-by: Artur Niederfahrenhorst <attaismyname@googlemail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com> Signed-off-by: hjiang <dentinyhao@gmail.com>
Previously, you could add a column with a list like this: ``` ds.add_column("zeros", lambda batch: [0] * len(batch)) ``` However, after ray-project#48140, this behavior isn't supported. To avoid breaking tests and user code, this PR re-adds support for lists. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: hjiang <dentinyhao@gmail.com>
Related issue number
Closes #48090
Prerequisite: #48575