[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

ueshin · 2023-08-09T23:52:14Z

What changes were proposed in this pull request?

Supports named arguments in Python UDTF.

For example:

>>> @udtf(returnType="a: int")
... class TestUDTF:
...     def eval(self, a, b):
...         yield a,
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> TestUDTF(a=lit(10), b=lit("x")).show()
+---+
|  a|
+---+
| 10|
+---+

>>> TestUDTF(b=lit("x"), a=lit(10)).show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show()
+---+
|  a|
+---+
| 10|
+---+

>>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show()
+---+
|  a|
+---+
| 10|
+---+

or:

>>> @udtf
... class TestUDTF:
...     @staticmethod
...     def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult:
...         return AnalyzeResult(
...             StructType(
...                 [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())]
...             )
...         )
...     def eval(self, **kwargs):
...         yield tuple(value for _, value in sorted(kwargs.items()))
...
>>> spark.udtf.register("test_udtf", TestUDTF)

>>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show()
+---+---+-----+
|  a|  b|    x|
+---+---+-----+
| 10|  x|100.0|
+---+---+-----+

>>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show()
+---+---+-----+
|  a|  x|    z|
+---+---+-----+
|  x| 10|100.0|
+---+---+-----+

Why are the changes needed?

Now that named arguments are supported (#41796, #42020).

It should be supported in Python UDTF.

Does this PR introduce any user-facing change?

Yes, named arguments will be available for Python UDTF.

How was this patch tested?

Added related tests.

allisonwang-db · 2023-08-10T00:37:35Z

python/pyspark/sql/tests/test_udtf.py

+        for i, df in enumerate(
+            [
+                self.spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')"),
+                self.spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)"),


What would be the error message if the named argument is used incorrectly? For example

duplicated input argument names: a => 10, a => 10

non-existing argument name: c => 10

incorrect combination of positional and named arguments: test_udtf(a => 10, 'x')

I am afraid that if we directly leverage Python's kwargs, the error messages wouldn't be as user-friendly as the SQL function ones.

That's a good point. So far just rely on the Python's error.

@dtenedor What's the error message like when applying name arguments with the above cases to other functions? Are there any example we can follow here?

Yeah I believe @learningchess2003 added these checks in [1]. They are currently in the FunctionBuilderBase.scala file in [2]. If we want to reuse those checks, we could be consistent between error messages for Python UDTFs and other Spark functions.

[1] #42020
[2] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/FunctionBuilderBase.scala#L107-L128

Updated to raise the following errors:

duplicated input argument names: a => 10, a => 10

It will be checked in the analysis phase and an error with the error class DUPLICATE_ROUTINE_PARAMETER_ASSIGNMENT.DOUBLE_NAMED_ARGUMENT_REFERENCE will be raised.

non-existing argument name: c => 10

It will be handled in Python runtime and an error will be raised.

...PySparkRuntimeError: [UDTF_EXEC_ERROR] User defined table function encountered an error in the 'eval' method: eval() got an unexpected keyword argument 'c'

incorrect combination of positional and named arguments: test_udtf(a => 10, 'x')

It will be checked in the analysis phase and an error with the error class UNEXPECTED_POSITIONAL_ARGUMENT will be raised.

dtenedor

The approach looks good! Besides @allisonwang-db's suggestion, I just have a couple comments.

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonUDTFRunner.scala

python/pyspark/worker.py

dtenedor · 2023-08-11T21:01:09Z

python/pyspark/sql/tests/test_udtf.py

+        for i, df in enumerate(
+            [
+                self.spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')"),
+                self.spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)"),


ueshin · 2023-08-14T15:56:34Z

Let me merge this now to unblock the issue #42385 (comment).

ueshin · 2023-08-14T15:56:51Z

Thanks! merging to master.

allisonwang-db

Late LGTM! Left a few comments.

python/pyspark/sql/functions.py

allisonwang-db · 2023-08-14T17:57:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala

+      try {
+        bufferStream.close()
+      } finally {
+        if (!releasedOrClosed) {
+          // An error happened. Force to close the worker.
+          env.destroyPythonWorker(pythonExec, workerModule, envVars.asScala.toMap, worker)
+        }


Just curious, why do we need to change this part?

#42385 changed to use bufferStream = new DirectByteBufferOutputStream(), but it was not closed.

python/pyspark/sql/tests/test_udtf.py

ueshin · 2023-08-14T19:16:00Z

I submitted another PR to address the above comments. #42490

…ents in Python UDTF ### What changes were proposed in this pull request? This is a follow-up of #42422. Adds more tests for named arguments in Python UDTF. ### Why are the changes needed? There are more cases to test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added related tests. Closes #42490 from ueshin/issues/SPARK-44749/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (apache#41796, apache#42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…ents in Python UDTF ### What changes were proposed in this pull request? This is a follow-up of apache#42422. Adds more tests for named arguments in Python UDTF. ### Why are the changes needed? There are more cases to test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added related tests. Closes apache#42490 from ueshin/issues/SPARK-44749/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-----+ | a| b| x| +---+---+-----+ | 10| x|100.0| +---+---+-----+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-----+ | a| x| z| +---+---+-----+ | x| 10|100.0| +---+---+-----+ ``` ### Why are the changes needed? Now that named arguments are supported (apache#41796, apache#42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes apache#42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

…ents in Python UDTF ### What changes were proposed in this pull request? This is a follow-up of apache#42422. Adds more tests for named arguments in Python UDTF. ### Why are the changes needed? There are more cases to test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added related tests. Closes apache#42490 from ueshin/issues/SPARK-44749/tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

Support named arguments in Python UDTF.

cfa7de4

github-actions bot added SQL CORE PYTHON labels Aug 9, 2023

allisonwang-db reviewed Aug 10, 2023

View reviewed changes

dtenedor reviewed Aug 10, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonUDTFRunner.scala Outdated Show resolved Hide resolved

python/pyspark/worker.py Outdated Show resolved Hide resolved

ueshin added 3 commits August 10, 2023 14:37

Fix.

df5c7fa

Fix.

cdc1452

Fix.

b9eab18

github-actions bot added the CONNECT label Aug 11, 2023

ueshin added 2 commits August 11, 2023 12:17

Fix.

442c934

Fix.

bf79e46

dtenedor approved these changes Aug 11, 2023

View reviewed changes

ueshin marked this pull request as ready for review August 11, 2023 21:48

ueshin requested review from allisonwang-db and dtenedor August 11, 2023 21:58

dtenedor approved these changes Aug 11, 2023

View reviewed changes

ueshin added 5 commits August 11, 2023 20:04

Merge branch 'master' into issues/SPARK-44749/kwargs

f1f8594

Fix.

8700ab5

Fix.

f10c90c

Fix.

587a970

Fix.

eb4a2dd

ueshin mentioned this pull request Aug 14, 2023

[SPARK-44705][PYTHON] Make PythonRunner single-threaded #42385

Closed

ueshin closed this in d462956 Aug 14, 2023

allisonwang-db reviewed Aug 14, 2023

View reviewed changes

ueshin mentioned this pull request Aug 14, 2023

[SPARK-44749][PYTHON][FOLLOWUP][TESTS] Add more tests for named arguments in Python UDTF #42490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

ueshin commented Aug 9, 2023 •

edited

Loading

allisonwang-db Aug 10, 2023

ueshin Aug 10, 2023 •

edited

Loading

dtenedor Aug 10, 2023

ueshin Aug 11, 2023

dtenedor Aug 11, 2023

dtenedor left a comment

dtenedor Aug 11, 2023

ueshin commented Aug 14, 2023

ueshin commented Aug 14, 2023

allisonwang-db left a comment

allisonwang-db Aug 14, 2023

ueshin Aug 14, 2023

ueshin commented Aug 14, 2023

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

[SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF #42422

Conversation

ueshin commented Aug 9, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

allisonwang-db Aug 10, 2023

Choose a reason for hiding this comment

ueshin Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

dtenedor Aug 10, 2023

Choose a reason for hiding this comment

ueshin Aug 11, 2023

Choose a reason for hiding this comment

dtenedor Aug 11, 2023

Choose a reason for hiding this comment

dtenedor left a comment

Choose a reason for hiding this comment

dtenedor Aug 11, 2023

Choose a reason for hiding this comment

ueshin commented Aug 14, 2023

ueshin commented Aug 14, 2023

allisonwang-db left a comment

Choose a reason for hiding this comment

allisonwang-db Aug 14, 2023

Choose a reason for hiding this comment

ueshin Aug 14, 2023

Choose a reason for hiding this comment

ueshin commented Aug 14, 2023

ueshin commented Aug 9, 2023 •

edited

Loading

ueshin Aug 10, 2023 •

edited

Loading