-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark - Implement FunctionCatalog and Truncate #5305
Conversation
After this, I'll add This is to facilitate the usage of the various transforms from PySpark as well as SQL. Additionally, having a |
b6f983d
to
ac78639
Compare
bd734c6
to
fa5ad6e
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/SparkFunctions.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkTruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
1e21a4b
to
bf3b2fd
Compare
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/functions/TruncateFunction.java
Outdated
Show resolved
Hide resolved
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/BaseCatalog.java
Show resolved
Hide resolved
6126c68
to
1074435
Compare
05ad451
to
f81e780
Compare
@rdblue PTAL. I was going to tag other people as well. |
2743a2a
to
0f5c8f8
Compare
@rdblue Do you think I should break this into a 2 PRs? |
Looking at the storage partition joins, it looks like the function |
I'd love to take a look as well. I should have some time in a day. We also had some progress on bucketed joins internally. |
Thanks Anton. Was going to tag you today now that it's cleaned up. Also cc @huaxingao @flyrain @nastra @Fokko |
Right not we're requiring the call be to the So we should probably allow the empty namespace to resolve functions as well. |
Link to the code that resolves our own |
16626eb
to
ed2f738
Compare
…on't verify for perf reasons
…ade in storage partitioned joins implementation
ed2f738
to
24efcba
Compare
Because this PR is so big, I'm going to separate out the I'm going to add a very simple function to be able to test it but keep the code to review a lot smaller. 👍 |
I've opened #5377 to cover just the This PR is too big, and this way we can focus on just the I've added an |
Implements
FunctionCatalog
for Spark 3.3 and implements all variants ofTruncate
.FunctionCatalog
This allows users of
SparkCatalog
andSparkSessionCatalog
to usetruncate
without having to register it as a UDF.All Iceberg functions that we register into the function catalog are accessible when used with an Iceberg spark catalog and:
e.g.
my_catalog.truncate(width, value)
.Note - Using
truncate(width, value)
typically does not work, as Spark adds the namespace to the call.system.truncate
should be preferred.system
namespace is referenced, to match called procedure syntax. Note this only works right now with theSparkCatalog
, as theSparkSessionCatalog
has logic in Spark to verify the namespace exists.e.g.
my_catalog.system.truncate(width, value)
orsystem.truncate(6, column)
Truncate
The truncate function also allows for a dynamic width or the width to come from a column - though typically the width will likely be static for one given call as it's mostly intended to be used to match partition transforms (specifically with joins or on non-partition columns to create a new column in the data without needing to partition on it).
This PR refactors the definition of the transform functions into a utility class where needed so that Spark’s magic functions can call them via the static
invoke
function and not duplicate logic. This allows Spark to include the functions in codegen.Special Considerations for Using Function Catalog Efficiently via Magic Functions and Code Gen
The requirements for magic functions to be used with codegen include that:
invoke
is a static functioninvoke
takes in the primitive types / native Spark types corresponding to each of Spark's input DataTypes (e.g. int for IntegerType and UTF8String for StringType).Further documentation on the magic functions is found here in the ScalarFunction JavaDoc
This partially closes #5349