-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] cleanup: use SortKey instead of mixed typing in aggregation #48697
Conversation
a8eebb6
to
11f7e7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@richardliaw please add the test verifying we've addressed regression from #42142
if sort_key is not None: | ||
return tuple(r[k] for k in keys if k in r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment that we leverage semantic of lexicographic ordering where missing cols, will yield a sequence that is "smaller" than the longer one (to sidestep the problem of comparing Nones to other types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a refactor, we're not actually doing any None comparisons. In fact here the main thing is just improving the readability (previously we'd rely on the ordering of the names, which is really odd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm specifically referring to the part where you're filtering non-existent values (if k in r
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought though, this filtering is incorrect -- if i have key as (A, B) where A has null value this will produce tuple as just (b)
which is incorrect
@@ -22,7 +22,7 @@ | |||
|
|||
|
|||
def generate_aggregate_fn( | |||
key: Optional[str], | |||
key: Optional[Union[str, List[str]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not making this API accept SortKey as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there is no need for the aggregate function to take a sortkey; we just happen to use it as an implementation detail (our aggregations are sort-based).
@alexeykudinkin this doesn't fully address regression from #42142, but I plan to do so in a followup PR |
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
e71ec38
to
6fa628a
Compare
@@ -502,7 +503,7 @@ def sort_and_partition( | |||
|
|||
return find_partitions(table, boundaries, sort_key) | |||
|
|||
def combine(self, key: Union[str, List[str]], aggs: Tuple["AggregateFn"]) -> Block: | |||
def combine(self, sort_key: "SortKey", aggs: Tuple["AggregateFn"]) -> Block: | |||
"""Combine rows with the same key into an accumulator. | |||
|
|||
This assumes the block is already sorted by key in ascending order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: docstring contains key
instead of sort_key
. same with other methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SortKey type is kind of a misnomer. It's just the key(s) on which we happen to do things like groupby, sort, join, windowing etc.
@pytest.mark.parametrize("keys", ["A", ["A", "B"]]) | ||
def test_agg_inputs(ray_start_regular_shared, keys): | ||
xs = list(range(100)) | ||
ds = ray.data.from_items([{"A": (x % 3), "B": x, "C": (x % 2)} for x in xs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the test with None values (like in the original issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also make sure we cover this part #48697 (comment)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
…ay-project#48697) ## Why are these changes needed? This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented. ## Related issue number This is related to ray-project#42776 and ray-project#42142 Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
This makes SortAggregate more consistent by unifying the API on the SortKey object, similar to how SortTaskSpec is implemented.
This is related to #42776 and #42142
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.