[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

richardliaw · 2024-11-12T04:07:56Z

Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented.

This is related to #42776 and #42142

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

alexeykudinkin

LGTM

@richardliaw please add the test verifying we've addressed regression from #42142

alexeykudinkin · 2024-11-12T18:39:58Z

python/ray/data/_internal/arrow_block.py

+            if sort_key is not None:
+                return tuple(r[k] for k in keys if k in r)


Please add a comment that we leverage semantic of lexicographic ordering where missing cols, will yield a sequence that is "smaller" than the longer one (to sidestep the problem of comparing Nones to other types)

This is just a refactor, we're not actually doing any None comparisons. In fact here the main thing is just improving the readability (previously we'd rely on the ordering of the names, which is really odd)

I'm specifically referring to the part where you're filtering non-existent values (if k in r)

On a second thought though, this filtering is incorrect -- if i have key as (A, B) where A has null value this will produce tuple as just (b) which is incorrect

alexeykudinkin · 2024-11-12T18:42:40Z

python/ray/data/_internal/planner/aggregate.py

@@ -22,7 +22,7 @@


 def generate_aggregate_fn(
-    key: Optional[str],
+    key: Optional[Union[str, List[str]]],


Why not making this API accept SortKey as well

Technically there is no need for the aggregate function to take a sortkey; we just happen to use it as an implementation detail (our aggregations are sort-based).

python/ray/data/_internal/arrow_block.py

python/ray/data/_internal/pandas_block.py

python/ray/data/_internal/arrow_block.py

richardliaw · 2024-11-12T22:40:19Z

@alexeykudinkin this doesn't fully address regression from #42142, but I plan to do so in a followup PR

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

scottjlee · 2024-11-13T18:24:35Z

python/ray/data/_internal/arrow_block.py

@@ -502,7 +503,7 @@ def sort_and_partition(

        return find_partitions(table, boundaries, sort_key)

-    def combine(self, key: Union[str, List[str]], aggs: Tuple["AggregateFn"]) -> Block:
+    def combine(self, sort_key: "SortKey", aggs: Tuple["AggregateFn"]) -> Block:
        """Combine rows with the same key into an accumulator.

        This assumes the block is already sorted by key in ascending order.


nit: docstring contains key instead of sort_key. same with other methods

updated, thanks!

SortKey type is kind of a misnomer. It's just the key(s) on which we happen to do things like groupby, sort, join, windowing etc.

python/ray/data/_internal/arrow_block.py

alexeykudinkin · 2024-11-13T18:30:39Z

python/ray/data/tests/test_all_to_all.py

+@pytest.mark.parametrize("keys", ["A", ["A", "B"]])
+def test_agg_inputs(ray_start_regular_shared, keys):
+    xs = list(range(100))
+    ds = ray.data.from_items([{"A": (x % 3), "B": x, "C": (x % 2)} for x in xs])


Please add the test with None values (like in the original issue)

Let's also make sure we cover this part #48697 (comment)

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

richardliaw added the go add ONLY when ready to merge, run all tests label Nov 12, 2024

richardliaw force-pushed the sortkey-update branch from a8eebb6 to 11f7e7e Compare November 12, 2024 07:15

richardliaw marked this pull request as ready for review November 12, 2024 16:36

richardliaw requested review from scottjlee, bveeramani, raulchen, stephanie-wang, omatthew98, alexeykudinkin and srinathk10 as code owners November 12, 2024 16:36

alexeykudinkin reviewed Nov 12, 2024

View reviewed changes

scottjlee reviewed Nov 12, 2024

View reviewed changes

python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved

scottjlee reviewed Nov 12, 2024

View reviewed changes

richardliaw added 3 commits November 12, 2024 17:37

Sort-key Update

a3301e6

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

address comments

d5597f1

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

rename-more-intuitive

6fa628a

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw force-pushed the sortkey-update branch from e71ec38 to 6fa628a Compare November 13, 2024 01:37

richardliaw added 4 commits November 12, 2024 18:09

lint

dd356cf

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

87dc7c5

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

typing

eeb8e79

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

remove-tuple

f1e3774

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

scottjlee approved these changes Nov 13, 2024

View reviewed changes

alexeykudinkin reviewed Nov 13, 2024

View reviewed changes

richardliaw added 2 commits November 13, 2024 14:34

update-docs

6fc6763

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix-fail-loudly

54417ba

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

alexeykudinkin approved these changes Nov 13, 2024

View reviewed changes

richardliaw enabled auto-merge (squash) November 14, 2024 00:07

richardliaw merged commit 510686f into ray-project:master Nov 14, 2024
6 checks passed

richardliaw deleted the sortkey-update branch November 14, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

richardliaw commented Nov 12, 2024 •

edited

Loading

alexeykudinkin left a comment

alexeykudinkin Nov 12, 2024

richardliaw Nov 13, 2024

alexeykudinkin Nov 13, 2024

alexeykudinkin Nov 13, 2024

alexeykudinkin Nov 12, 2024

richardliaw Nov 13, 2024

richardliaw commented Nov 12, 2024

scottjlee Nov 13, 2024

richardliaw Nov 13, 2024

srinathk10 Nov 14, 2024

alexeykudinkin Nov 13, 2024

alexeykudinkin Nov 13, 2024

		if sort_key is not None:
		return tuple(r[k] for k in keys if k in r)

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

Conversation

richardliaw commented Nov 12, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

alexeykudinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Nov 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Nov 12, 2024 •

edited

Loading