-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796
Conversation
@MaxGekk Please let me know if the end-to-end tests were what you had in mind. |
These seem reasonable to me for this PR that only covers parser changes so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM since this was ported from #41429 which I already approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for Ci.
+1, LGTM. Merging to master. |
### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (#41796, #42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes #42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (apache#41796, apache#42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
…andas UDFs ### What changes were proposed in this pull request? Supports named arguments in scalar Python/Pandas UDF. For example: ```py >>> udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ ``` or: ```py >>> pandas_udf("int") ... def test_udf(a, b): ... return a + 10 * b ... >>> spark.udf.register("test_udf", test_udf) >>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ >>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show() +---------------------------------+ |test_udf(b => (id * 10), a => id)| +---------------------------------+ | 0| | 101| +---------------------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (#41796, #42020). Scalar Python/Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for scalar Python/Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42617 from ueshin/issues/SPARK-44918/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
…s UDFs ### What changes were proposed in this pull request? Supports named arguments in aggregate Pandas UDFs. For example: ```py >>> pandas_udf("double") ... def weighted_mean(v: pd.Series, w: pd.Series) -> float: ... import numpy as np ... return np.average(v, weights=w) ... >>> df = spark.createDataFrame( ... [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)], ... ("id", "v", "w")) >>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show() +---+-----------------------------+ | id|weighted_mean(v => v, w => w)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ >>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show() +---+-----------------------------+ | id|weighted_mean(w => w, v => v)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ ``` or with window: ```py >>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1) >>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ >>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (#41796, #42020). Aggregate Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for aggregate Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42663 from ueshin/issues/SPARK-44952/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
…s UDFs ### What changes were proposed in this pull request? Supports named arguments in aggregate Pandas UDFs. For example: ```py >>> pandas_udf("double") ... def weighted_mean(v: pd.Series, w: pd.Series) -> float: ... import numpy as np ... return np.average(v, weights=w) ... >>> df = spark.createDataFrame( ... [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)], ... ("id", "v", "w")) >>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show() +---+-----------------------------+ | id|weighted_mean(v => v, w => w)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ >>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show() +---+-----------------------------+ | id|weighted_mean(w => w, v => v)| +---+-----------------------------+ | 1| 1.6666666666666667| | 2| 7.166666666666667| +---+-----------------------------+ ``` or with window: ```py >>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1) >>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ >>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show() +---+----+---+------------------+ | id| v| w| wm| +---+----+---+------------------+ | 1| 1.0|1.0|1.6666666666666667| | 1| 2.0|2.0|1.6666666666666667| | 2| 3.0|1.0| 4.333333333333333| | 2| 5.0|2.0| 7.166666666666667| | 2|10.0|3.0| 7.166666666666667| +---+----+---+------------------+ ``` ### Why are the changes needed? Now that named arguments support was added (apache/spark#41796, apache/spark#42020). Aggregate Pandas UDFs can support it. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for aggregate Pandas UDFs. ### How was this patch tested? Added related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42663 from ueshin/issues/SPARK-44952/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (apache#41796, apache#42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
What changes were proposed in this pull request?
We plan on adding two new tokens called
namedArgumentExpression
andfunctionArgument
which would enable this feature. When parsing this logic, we also make changes to ASTBuilder such that it can detect if the argument passed is a named argument or a positional one.Why are the changes needed?
This is part of a larger project to implement named parameter support for user defined functions, built-in functions, and table valued functions.
Does this PR introduce any user-facing change?
Yes, the user would be able to call functions with argument lists that contain named arguments.
How was this patch tested?
We add tests in the PlanParserSuite that will verify that the plan parsed is as intended.