Spark - Implement FunctionCatalog and Truncate #5305

kbendick · 2022-07-19T06:50:16Z

Implements FunctionCatalog for Spark 3.3 and implements all variants of Truncate.

FunctionCatalog

This allows users of SparkCatalog and SparkSessionCatalog to use truncate without having to register it as a UDF.

All Iceberg functions that we register into the function catalog are accessible when used with an Iceberg spark catalog and:

No namespace is referenced - the storage partitioned joins implementation requires this.
e.g. my_catalog.truncate(width, value).
Note - Using truncate(width, value) typically does not work, as Spark adds the namespace to the call. system.truncate should be preferred.
The system namespace is referenced, to match called procedure syntax. Note this only works right now with the SparkCatalog, as the SparkSessionCatalog has logic in Spark to verify the namespace exists.
e.g. my_catalog.system.truncate(width, value) or system.truncate(6, column)

Truncate

The truncate function also allows for a dynamic width or the width to come from a column - though typically the width will likely be static for one given call as it's mostly intended to be used to match partition transforms (specifically with joins or on non-partition columns to create a new column in the data without needing to partition on it).

This PR refactors the definition of the transform functions into a utility class where needed so that Spark’s magic functions can call them via the static invoke function and not duplicate logic. This allows Spark to include the functions in codegen.

Special Considerations for Using Function Catalog Efficiently via Magic Functions and Code Gen

The requirements for magic functions to be used with codegen include that:

invoke is a static function
invoke takes in the primitive types / native Spark types corresponding to each of Spark's input DataTypes (e.g. int for IntegerType and UTF8String for StringType).

Further documentation on the magic functions is found here in the ScalarFunction JavaDoc

This partially closes #5349

kbendick · 2022-07-19T06:51:23Z

After this, I'll add bucket and zorder as well.

This is to facilitate the usage of the various transforms from PySpark as well as SQL.

Additionally, having a zorder function will make it possible for people to pre-sort their data on input, vs having to zorder sort it when running a data compaction job.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java

kbendick · 2022-07-26T21:25:42Z

@rdblue PTAL. I was going to tag other people as well.

kbendick · 2022-07-26T21:28:31Z

@rdblue Do you think I should break this into a 2 PRs?

kbendick · 2022-07-26T23:45:12Z

Looking at the storage partition joins, it looks like the function bucket will need to be resolvable on the empty namespace. I can update that here or a follow up:

https://github.com/apache/spark/blob/e9eb28e27d10497c8b36774609823f4bbd2c8500/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L84-L100

aokolnychyi · 2022-07-27T02:27:40Z

I'd love to take a look as well. I should have some time in a day. We also had some progress on bucketed joins internally.

kbendick · 2022-07-27T16:25:41Z

I'd love to take a look as well. I should have some time in a day. We also had some progress on bucketed joins internally.

Thanks Anton. Was going to tag you today now that it's cleaned up. Also cc @huaxingao @flyrain @nastra @Fokko

kbendick · 2022-07-27T17:06:09Z

Right not we're requiring the call be to the system namespace, but the storage partitioned join implementation looks for a function called bucket in the FunctionCatalog using an empty array for the namespace per this diff in the merged PR in Spark for Storage Partitioned Joins inside V2ExpressionUtils#toCatalystTransform..

So we should probably allow the empty namespace to resolve functions as well.

kbendick · 2022-07-27T17:07:51Z

Link to the code that resolves our own bucket implementation in case the link above doesn't resolve: https://github.com/apache/spark/blob/47f0303944abb11d3018186bc125113772eff8ef/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L84-L100

…on't verify for perf reasons

…ade in storage partitioned joins implementation

kbendick · 2022-07-28T22:08:39Z

Because this PR is so big, I'm going to separate out the FunctionCatalog implementation from Truncate.

I'm going to add a very simple function to be able to test it but keep the code to review a lot smaller. 👍

kbendick · 2022-07-28T22:31:47Z

I've opened #5377 to cover just the FunctionCatalog.

This PR is too big, and this way we can focus on just the FunctionCatalog business without having to worry about the details of truncate.

I've added an iceberg_version function in the other PR to assist with testing.

kbendick · 2022-08-02T16:07:51Z

This PR is closed in favor of #5377 and #5411.

I'll open a PR for Truncate and link it shortly.

github-actions bot added the spark label Jul 19, 2022

kbendick force-pushed the kb-add-spark-function-catalog branch 4 times, most recently from b6f983d to ac78639 Compare July 19, 2022 23:32

sunchao mentioned this pull request Jul 20, 2022

Support bucket table for Iceberg #430

Closed

kbendick force-pushed the kb-add-spark-function-catalog branch from bd734c6 to fa5ad6e Compare July 20, 2022 19:36