Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

Closed
wants to merge 3 commits into from
Closed

[SPARK-43922][SQL] Add named parameter support in parser for function calls #41796

wants to merge 3 commits into from

Conversation

learningchess2003
Copy link
Contributor

@learningchess2003 learningchess2003 commented Jun 29, 2023

What changes were proposed in this pull request?

We plan on adding two new tokens called namedArgumentExpression and functionArgument which would enable this feature. When parsing this logic, we also make changes to ASTBuilder such that it can detect if the argument passed is a named argument or a positional one.

Why are the changes needed?

This is part of a larger project to implement named parameter support for user defined functions, built-in functions, and table valued functions.

Does this PR introduce any user-facing change?

Yes, the user would be able to call functions with argument lists that contain named arguments.

How was this patch tested?

We add tests in the PlanParserSuite that will verify that the plan parsed is as intended.

@github-actions github-actions bot added the SQL label Jun 29, 2023
@learningchess2003
Copy link
Contributor Author

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

@dtenedor
Copy link
Contributor

@MaxGekk Please let me know if the end-to-end tests were what you had in mind.

These seem reasonable to me for this PR that only covers parser changes so far.

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since this was ported from #41429 which I already approved.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for Ci.

@MaxGekk MaxGekk changed the title [SPARK-43922] Add named parameter support in parser for function calls [SPARK-43922][SQL] Add named parameter support in parser for function calls Jun 30, 2023
@MaxGekk
Copy link
Member

MaxGekk commented Jun 30, 2023

+1, LGTM. Merging to master.
Thank you, @learningchess2003 and @dtenedor for review.

@MaxGekk MaxGekk closed this in 91c4581 Jun 30, 2023
ueshin added a commit that referenced this pull request Aug 14, 2023
### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
...     staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+
```

### Why are the changes needed?

Now that named arguments are supported (#41796, #42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes #42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
...     staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+
```

### Why are the changes needed?

Now that named arguments are supported (apache#41796, apache#42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
ueshin added a commit that referenced this pull request Aug 24, 2023
…andas UDFs

### What changes were proposed in this pull request?

Supports named arguments in scalar Python/Pandas UDF.

For example:

```py
>>> udf("int")
... def test_udf(a, b):
...     return a + 10 * b
...
>>> spark.udf.register("test_udf", test_udf)

>>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+

>>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+
```

or:

```py
>>> pandas_udf("int")
... def test_udf(a, b):
...     return a + 10 * b
...
>>> spark.udf.register("test_udf", test_udf)

>>> spark.range(2).select(test_udf(b=col("id") * 10, a=col("id"))).show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+

>>> spark.sql("SELECT test_udf(b => id * 10, a => id) FROM range(2)").show()
+---------------------------------+
|test_udf(b => (id * 10), a => id)|
+---------------------------------+
|                                0|
|                              101|
+---------------------------------+
```

### Why are the changes needed?

Now that named arguments support was added (#41796, #42020).

Scalar Python/Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for scalar Python/Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42617 from ueshin/issues/SPARK-44918/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
ueshin added a commit that referenced this pull request Sep 1, 2023
…s UDFs

### What changes were proposed in this pull request?

Supports named arguments in aggregate Pandas UDFs.

For example:

```py
>>> pandas_udf("double")
... def weighted_mean(v: pd.Series, w: pd.Series) -> float:
...     import numpy as np
...     return np.average(v, weights=w)
...
>>> df = spark.createDataFrame(
...     [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)],
...     ("id", "v", "w"))

>>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show()
+---+-----------------------------+
| id|weighted_mean(v => v, w => w)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+

>>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show()
+---+-----------------------------+
| id|weighted_mean(w => w, v => v)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+
```

or with window:

```py
>>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1)

>>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+

>>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+
```

### Why are the changes needed?

Now that named arguments support was added (#41796, #42020).

Aggregate Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for aggregate Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42663 from ueshin/issues/SPARK-44952/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 1, 2023
…s UDFs

### What changes were proposed in this pull request?

Supports named arguments in aggregate Pandas UDFs.

For example:

```py
>>> pandas_udf("double")
... def weighted_mean(v: pd.Series, w: pd.Series) -> float:
...     import numpy as np
...     return np.average(v, weights=w)
...
>>> df = spark.createDataFrame(
...     [(1, 1.0, 1.0), (1, 2.0, 2.0), (2, 3.0, 1.0), (2, 5.0, 2.0), (2, 10.0, 3.0)],
...     ("id", "v", "w"))

>>> df.groupby("id").agg(weighted_mean(v=df["v"], w=df["w"])).show()
+---+-----------------------------+
| id|weighted_mean(v => v, w => w)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+

>>> df.groupby("id").agg(weighted_mean(w=df["w"], v=df["v"])).show()
+---+-----------------------------+
| id|weighted_mean(w => w, v => v)|
+---+-----------------------------+
|  1|           1.6666666666666667|
|  2|            7.166666666666667|
+---+-----------------------------+
```

or with window:

```py
>>> w = Window.partitionBy("id").orderBy("v").rowsBetween(-2, 1)

>>> df.withColumn("wm", weighted_mean(v=df.v, w=df.w).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+

>>> df.withColumn("wm", weighted_mean_udf(w=df.w, v=df.v).over(w)).show()
+---+----+---+------------------+
| id|   v|  w|                wm|
+---+----+---+------------------+
|  1| 1.0|1.0|1.6666666666666667|
|  1| 2.0|2.0|1.6666666666666667|
|  2| 3.0|1.0| 4.333333333333333|
|  2| 5.0|2.0| 7.166666666666667|
|  2|10.0|3.0| 7.166666666666667|
+---+----+---+------------------+
```

### Why are the changes needed?

Now that named arguments support was added (apache/spark#41796, apache/spark#42020).

Aggregate Pandas UDFs can support it.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for aggregate Pandas UDFs.

### How was this patch tested?

Added related tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42663 from ueshin/issues/SPARK-44952/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

```py
>>> udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+
```

or:

```py
>>> udtf
... class TestUDTF:
...     staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+
```

### Why are the changes needed?

Now that named arguments are supported (apache#41796, apache#42020).

It should be supported in Python UDTF.

### Does this PR introduce _any_ user-facing change?

Yes, named arguments will be available for Python UDTF.

### How was this patch tested?

Added related tests.

Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants