[SPARK-27658][SQL] Add FunctionCatalog API #24559

rdblue · 2019-05-08T17:40:46Z

What changes were proposed in this pull request?

This adds a new API for catalog plugins that exposes functions to Spark. The API can list and load functions. This does not include create, delete, or alter operations.

Design Document

There are 3 types of functions defined:

A ScalarFunction that produces a value for every call
An AggregateFunction that produces a value after updates for a group of rows

Functions loaded from the catalog by name as UnboundFunction. Once input arguments are determined bind is called on the unbound function to get a BoundFunction implementation that is one of the 3 types above. Binding can fail if the function doesn't support the input type. BoundFunction returns the result type produced by the function.

How was this patch tested?

This includes a test that demonstrates the new API.

rdblue · 2019-05-08T17:42:10Z

@jzhuge, @mccheah, @cloud-fan, and @marmbrus, this PR adds a FunctionCatalog interface so you might be interested. It is low priority, but I want to start a discussion about the interface.

SparkQA · 2019-05-08T20:45:02Z

Test build #105263 has finished for PR 24559 at commit 6b9b7eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-09T03:27:21Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/ScalarFunction.java

+   * @param input an input row
+   * @return a result value
+   */
+  R produceResult(InternalRow input);


A UDF doesn't take an entire row as it's input, but some columns. e.g. SELECT substring(strCol, 3).

This assumes that Spark will create an InternalRow to pass to the function. That's the easiest way to pass an arbitrary number of arguments that correspond to a struct schema.

Note that this doesn't need to be expensive. We can build an InternalRow that exposes a projection of another and that can be reused for all of the UDF calls.

Think about a UDF that adds 2 ints.

val row = InternalRow(i, j) udf.call(row) // inside udf.call return row.getInt(0) + row.getInt(1)

is much slower than

udf.call(i, j) // inside udf.call return i + j

We need to think about the tradeoffs and pick between perf and ease-of-use.

I don't think that's a relevant comparison. Clearly, it's a bad idea to copy data into a new InternalRow to pass it into a UDF. But InternalRow is an interface so we can change how it works. We have an InternalRow that exposes data from a ColumnarBatch and one that joins partition values, we could similarly have an InternalRow that wraps another InternalRow for this access.

class ProjectingRow(wrappedPositions: Array[Int]) extends InternalRow { var wrapped = null def set(row: InternalRow): Unit = this.wrapped = row def getInt(pos: Int): Int = wrapped.getInt(wrappedPositions(pos)) ... }

Then each UDF call becomes:

udfRow.set(inputRow) val result = udf.call(udfRow)

And call could be implemented as you'd expect:

public int call(InternalRow row) { return row.getInt(0) + row.getInt(1) }

I think that the overhead of set is much better than using reflection or object inspectors like Hive.

cloud-fan · 2019-05-09T03:28:52Z

I think we need a design doc for the UDF API. We need to think about ease-of-use and performance.

rdblue · 2019-05-09T16:11:54Z

@cloud-fan, I agree that we will eventually want a doc. This is intended to get everyone thinking about what it will look like and what the performance would be.

rdblue · 2019-05-16T00:01:56Z

@AndrewKL you might be interested in this.

SparkQA · 2019-05-22T23:12:01Z

Test build #105705 has finished for PR 24559 at commit e520366.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class NoSuchFunctionException(

SparkQA · 2019-05-23T05:03:08Z

Test build #105709 has finished for PR 24559 at commit 21a5f07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2019-09-29T09:59:37Z

Any progress or unaddressed issue here?

github-actions · 2020-01-08T00:07:21Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

SparkQA · 2020-10-06T22:20:49Z

Test build #129473 has finished for PR 24559 at commit 21a5f07.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

holdenk

I like the idea of having a function catalog API, left some initial questions/suggestions.

holdenk · 2020-10-06T22:10:08Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/AggregateFunction.java

+   * @param input an input row
+   * @return updated aggregation state
+   */
+  S update(S state, InternalRow input);


Would it make sense to have a default implementation for taking an iterator of internalrows? Just thinking out loud.

Is there value for a function in being able to control iteration? And can Spark support it if there is?

I think there could be value for a function like limit because the source could stop iteration early. But, I'm not sure what effect that would have on Spark to have an iterator that is not exhausted. Overall, I think there aren't very many cases where controlling iteration in the function has enough value to warrant the additional complexity in the API.

holdenk · 2020-10-06T22:11:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/AggregateFunction.java

+import org.apache.spark.sql.types.DataType;
+
+/**
+ * Interface for a function that produces a result value by aggregating over multiple input rows.


Just thinking back to our initial groupByKey impl, can we add a warning here that if folks do not implement the AssociativeAggregateFunction they are going to force all values for a key onto a single node.

SparkQA · 2020-11-18T22:22:19Z

Test build #131304 has finished for PR 24559 at commit b3ba28d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T22:44:36Z

Test build #131305 has finished for PR 24559 at commit 2eec8e5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-18T23:00:04Z

Test build #131306 has finished for PR 24559 at commit 1721075.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

wmoustafa · 2020-11-20T05:44:29Z

I think we need a design doc for the UDF API. We need to think about ease-of-use and performance.

@rdblue @cloud-fan What do you think of the Transport API? It is simple, wraps InternalRows in the case of Spark, and portable between Spark, Presto, Hive and Avro (and potentially other data formats, so UDFs can probably be pushed to the format layer)

dongjoon-hyun · 2020-11-20T20:51:24Z

.../src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala

+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types.{DataType, IntegerType, LongType, StructType}
+
+class AggregateFunctionSuite extends SparkFunSuite {


Could you add the missing import, @rdblue ?

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala:23:38: not found: type SparkFunSuite [error] class AggregateFunctionSuite extends SparkFunSuite {

SparkQA · 2020-11-20T21:44:28Z

Test build #131445 has finished for PR 24559 at commit aa39040.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2020-11-20T21:44:52Z

What do you think of the Transport API?

I think it's great to have people working on APIs for maintaining UDF libraries across projects.

You may be wondering whether I think we should use that to call UDFs. I don't think that we would want to build support for a generic framework into Spark itself. I think Spark's API should be specific to Spark, just like the data source APIs are specific to Spark. That avoids complications like converting to Row or another representation for Hive.

It should be possible to build a library using Transport that plugs in through this API, though. And it is great to have you looking at this and thinking about how it may be limited by the choices here.

wmoustafa · 2020-11-20T21:55:42Z

I don't think that we would want to build support for a generic framework into Spark itself. I think Spark's API should be specific to Spark, just like the data source APIs are specific to Spark. That avoids complications like converting to Row or another representation for Hive.

I think there are 2 types of APIs: Function Catalog APIs and UDF expression APIs (e.g., Generic UDFs). I mentioned the Transport API as a way to do the latter, and wanted to get your thoughts on the friendly-ship of the Function Catalog APIs to UDF expression APIs like Transport. To the user, Transport provides tools to make type validation and inference user-friendly (declarative using type signatures), and Java types that map to their SQL counterparts. To Spark, it is just an Expression API processing InternalRows.

xkrogen · 2021-03-30T16:38:51Z

...catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java

+  BoundFunction bind(StructType inputType);
+
+  /**
+   * Returns Function documentation.


Minor nit -- the method is called "description", should we say that this returns a description of the function (as opposed to "documentation")?

I think they're synonymous in this context.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java

sunchao

We may also need to update the PR description. For instance, it still mentions PartialAggregateFunction.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java

SparkQA · 2021-03-31T22:24:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41366/

SparkQA · 2021-04-01T02:07:42Z

Test build #136783 has finished for PR 24559 at commit 37439c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-04-05T20:29:59Z

Retest this please.

SparkQA · 2021-04-05T21:46:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41496/

SparkQA · 2021-04-05T21:46:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41496/

SparkQA · 2021-04-06T01:39:48Z

Test build #136919 has finished for PR 24559 at commit 37439c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-06T07:57:56Z

[info] org.apache.spark.sql.TPCDSQueryTestSuite *** ABORTED *** (1 second, 198 milliseconds)
[info]   java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

I saw this on other PRs as well. @maropu do you have any clue about it?

yaooqinn · 2021-04-06T16:21:35Z

[info] org.apache.spark.sql.TPCDSQueryTestSuite *** ABORTED *** (1 second, 198 milliseconds)
[info]   java.net.BindException: Cannot assign requested address: Service 'sparkDriver' failed after 100 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

I saw this on other PRs as well. @maropu do you have any clue about it?

Looks like there is something wrong with the GA's DNS.. We encounter the same binding issue in Kyuubi too - apache/kyuubi#489

SparkQA · 2021-04-06T20:51:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41543/

SparkQA · 2021-04-06T20:51:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41543/

maropu · 2021-04-07T00:01:13Z

I saw this on other PRs as well. @maropu do you have any clue about it?

Looks like there is something wrong with the GA's DNS.. We encounter the same >> binding issue in Kyuubi too - apache/kyuubi#489

Yea, it looks like a GA env issue. I saw the same error message in other test suites, e.g., RateStreamProviderSuite) I think the workaround for now is just to re-run a GA job (but, I will try to find a solution). FYI: @dongjoon-hyun @HyukjinKwon

SparkQA · 2021-04-07T00:59:08Z

Test build #136966 has finished for PR 24559 at commit bb8f2aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-07T09:18:53Z

Jenkins has passed and the GA failure is unrelated. I'm merging it to master, thanks!

dongjoon-hyun · 2021-04-07T17:47:58Z

Thank you, @rdblue and all!

HyukjinKwon · 2021-04-08T03:33:50Z

🎉

daniel-goldstein · 2022-02-05T19:46:40Z

Not sure if this is the best place for this, but we've encountered the binding failure multiple times in our own containerized environments and found this issue to be in containers that ended up getting entirely numeric hostnames. getaddrinfo (which I assume Java's getLocalHost uses), may sometimes misinterpret fully numeric hostnames as IP addresses. I'm assuming based off this code that the SPARK_LOCAL_IP workaround circumvents this issue. Here's an easy replication of this error (though a bit contrived):

docker run --hostname 2886795934 -e SPARK_MODE=master bitnami/spark:3.2.1

In our case, setting a numeric hostname was our fault, and docker explicitly rejects numeric hostnames, it seems for the same reason. I'm not very familiar with GA and from a quick browsing am unsure if this could ever happen, but thought it might be good to keep in mind if this continues to be a sporadic failure and whether or not Spark should be aware of this failure mode.

cloud-fan reviewed May 9, 2019

View reviewed changes

rdblue force-pushed the SPARK-27658-add-function-catalog-api branch from 6b9b7eb to e520366 Compare May 22, 2019 23:00

dongjoon-hyun added the SQL label Jun 14, 2019

rdblue mentioned this pull request Aug 23, 2019

[SPARK-28612][SQL] Add DataFrameWriterV2 API #25354

Closed

rdblue mentioned this pull request Sep 20, 2019

Support bucket table for Iceberg apache/iceberg#430

Closed

github-actions bot added the Stale label Jan 8, 2020

github-actions bot closed this Jan 9, 2020

imback82 mentioned this pull request Jun 24, 2020

[SPARK-31999][SQL] Add REFRESH FUNCTION command #28840

Closed

rdblue removed the Stale label Oct 6, 2020

holdenk reopened this Oct 6, 2020

holdenk reviewed Oct 6, 2020

View reviewed changes

rdblue force-pushed the SPARK-27658-add-function-catalog-api branch from 21a5f07 to b3ba28d Compare November 18, 2020 22:09

dongjoon-hyun reviewed Nov 20, 2020

View reviewed changes

xkrogen reviewed Mar 30, 2021

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java Show resolved Hide resolved

sunchao reviewed Mar 30, 2021

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java Outdated Show resolved Hide resolved

rdblue added 4 commits April 6, 2021 12:56

Add FunctionCatalog API.

fa43cbd

Fix javadoc errors.

b9f139a

Fix more review nits.

70b384f

Fix javadoc.

bb8f2aa

rdblue force-pushed the SPARK-27658-add-function-catalog-api branch from 37439c6 to bb8f2aa Compare April 6, 2021 19:57

harupy mentioned this pull request Apr 7, 2021

Spark session creation is flaky in GitHub Actions mlflow/mlflow#4229

Closed

cloud-fan approved these changes Apr 7, 2021

View reviewed changes

cloud-fan closed this in 3c7d6c3 Apr 7, 2021

li-boxuan mentioned this pull request Apr 18, 2021

GitHub Actions start to fail for CQLInputFormatIT on master JanusGraph/janusgraph#2577

Closed

nalinigans mentioned this pull request Apr 19, 2021

Fix the spark/hdfs builds by cleaning up /etc/hosts and adding spark_local_ip to spark_env.sh GenomicsDB/GenomicsDB#154

Merged

[SPARK-27658][SQL] Add FunctionCatalog API #24559

[SPARK-27658][SQL] Add FunctionCatalog API #24559

Conversation

rdblue commented May 8, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rdblue commented May 8, 2019 • edited Loading

SparkQA commented May 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented May 9, 2019

rdblue commented May 9, 2019

rdblue commented May 16, 2019

SparkQA commented May 22, 2019

SparkQA commented May 23, 2019

jerryshao commented Sep 29, 2019

github-actions bot commented Jan 8, 2020

SparkQA commented Oct 6, 2020

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

SparkQA commented Nov 18, 2020

wmoustafa commented Nov 20, 2020

dongjoon-hyun Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2020

rdblue commented Nov 20, 2020

wmoustafa commented Nov 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2021

SparkQA commented Apr 1, 2021

dongjoon-hyun commented Apr 5, 2021

SparkQA commented Apr 5, 2021

SparkQA commented Apr 5, 2021

SparkQA commented Apr 6, 2021

cloud-fan commented Apr 6, 2021

yaooqinn commented Apr 6, 2021

SparkQA commented Apr 6, 2021

SparkQA commented Apr 6, 2021

maropu commented Apr 7, 2021 • edited Loading

SparkQA commented Apr 7, 2021

cloud-fan commented Apr 7, 2021

dongjoon-hyun commented Apr 7, 2021

HyukjinKwon commented Apr 8, 2021

daniel-goldstein commented Feb 5, 2022

rdblue commented May 8, 2019 •

edited

Loading

rdblue commented May 8, 2019 •

edited

Loading

dongjoon-hyun Nov 20, 2020 •

edited

Loading

maropu commented Apr 7, 2021 •

edited

Loading