Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] cleanup: use SortKey instead of mixed typing in aggregation #48697

Merged
merged 9 commits into from
Nov 14, 2024

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Nov 12, 2024

Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented.

This is related to #42776 and #42142

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@richardliaw richardliaw added the go add ONLY when ready to merge, run all tests label Nov 12, 2024
@richardliaw richardliaw marked this pull request as ready for review November 12, 2024 16:36
Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardliaw please add the test verifying we've addressed regression from #42142

Comment on lines 627 to 632
if sort_key is not None:
return tuple(r[k] for k in keys if k in r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that we leverage semantic of lexicographic ordering where missing cols, will yield a sequence that is "smaller" than the longer one (to sidestep the problem of comparing Nones to other types)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a refactor, we're not actually doing any None comparisons. In fact here the main thing is just improving the readability (previously we'd rely on the ordering of the names, which is really odd)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm specifically referring to the part where you're filtering non-existent values (if k in r)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought though, this filtering is incorrect -- if i have key as (A, B) where A has null value this will produce tuple as just (b) which is incorrect

@@ -22,7 +22,7 @@


def generate_aggregate_fn(
key: Optional[str],
key: Optional[Union[str, List[str]]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not making this API accept SortKey as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically there is no need for the aggregate function to take a sortkey; we just happen to use it as an implementation detail (our aggregations are sort-based).

python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_block.py Outdated Show resolved Hide resolved
python/ray/data/_internal/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/_internal/arrow_block.py Show resolved Hide resolved
@richardliaw
Copy link
Contributor Author

@alexeykudinkin this doesn't fully address regression from #42142, but I plan to do so in a followup PR

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@@ -502,7 +503,7 @@ def sort_and_partition(

return find_partitions(table, boundaries, sort_key)

def combine(self, key: Union[str, List[str]], aggs: Tuple["AggregateFn"]) -> Block:
def combine(self, sort_key: "SortKey", aggs: Tuple["AggregateFn"]) -> Block:
"""Combine rows with the same key into an accumulator.

This assumes the block is already sorted by key in ascending order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docstring contains key instead of sort_key. same with other methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SortKey type is kind of a misnomer. It's just the key(s) on which we happen to do things like groupby, sort, join, windowing etc.

python/ray/data/_internal/arrow_block.py Show resolved Hide resolved
Comment on lines +217 to +220
@pytest.mark.parametrize("keys", ["A", ["A", "B"]])
def test_agg_inputs(ray_start_regular_shared, keys):
xs = list(range(100))
ds = ray.data.from_items([{"A": (x % 3), "B": x, "C": (x % 2)} for x in xs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the test with None values (like in the original issue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also make sure we cover this part #48697 (comment)

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw enabled auto-merge (squash) November 14, 2024 00:07
@richardliaw richardliaw merged commit 510686f into ray-project:master Nov 14, 2024
6 checks passed
@richardliaw richardliaw deleted the sortkey-update branch November 14, 2024 00:07
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…ay-project#48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.


## Related issue number

This is related to ray-project#42776 and
ray-project#42142


Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…ay-project#48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.

## Related issue number

This is related to ray-project#42776 and
ray-project#42142

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants