-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44952][SQL][PYTHON] Support named arguments in aggregate Pandas UDFs #42663
Conversation
case e: NamedArgumentExpression => | ||
// For NamedArgumentExpression, we extract the value and replace it with | ||
// an AttributeReference (with an internal column name, e.g. "_w0"). | ||
NamedArgumentExpression( | ||
e.key, | ||
extractedExprMap.getOrElseUpdate(e.canonicalized, | ||
Alias(e.value, s"_w${extractedExprMap.size}")()).toAttribute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtenedor Does this change make sense when the named arguments are specified to window functions?
This is trying to keep the named arguments in the window functions side; otherwise, the named arguments will go to Project
that will be generated here and it can't be processed because NamedArgumentExpression
is marked as Unevaluable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, the reason the NamedArgumentExpression
is marked as Unevaluable
is because the intention is for the analyzer to match the provided argument (name, value) pairs for a function call and compare them against the expected ordered parameter list (including parameter names and types) of the function signature.
By the end of analysis, these expressions should be gone as a result of rearranging the provided function arguments to match the expected order, if necessary.
We might just want to deduplicate the logic from L3034-3035 into one place that we can reuse here, but the general idea looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this implementation LGTM. I tried to think of test cases, I left a couple more ideas in there, but this looks good to go once we add them.
with self.subTest(query_no=i): | ||
assertDataFrameEqual(aggregated, df.groupby("id").agg(mean(df.v).alias("wm"))) | ||
|
||
def test_named_arguments_negative(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please also port these negative tests to the 'kwargs' Pandas UDF case as well? It would be good to make sure we do the same checks there. Same for the window functions testing.
"DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.DOUBLE_NAMED_ARGUMENT_REFERENCE", | ||
): | ||
self.spark.sql( | ||
"SELECT id, weighted_mean(v => v, v => w) as wm FROM v GROUP BY id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this test! In addition could we add a negative test case where we provide a positional argument matching the first parameter of the function, then a named argument with the same name as that first parameter? It should return an error in that case. Same for the window functions testing.
Thanks! merging to master. |
What changes were proposed in this pull request?
Supports named arguments in aggregate Pandas UDFs.
For example:
or with window:
Why are the changes needed?
Now that named arguments support was added (#41796, #42020).
Aggregate Pandas UDFs can support it.
Does this PR introduce any user-facing change?
Yes, named arguments will be available for aggregate Pandas UDFs.
How was this patch tested?
Added related tests.
Was this patch authored or co-authored using generative AI tooling?
No.