Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Transport to use API from Trino v406 #128

Merged

Conversation

yiqiangin
Copy link
Contributor

@yiqiangin yiqiangin commented Apr 20, 2023

  • Code Changes
    There are three major parts in code changes:
  1. replace APIs from trino-main v352 with APIs trino-spi v406 in the classes of the module of transportable-udfs-trino like StdUdfWrapper and TrinoFactory
    The main change is the class of StdUdfWrapper does not extend the class of SqlScalarFuntion from trino-main any more.
  2. create a new module named transportable-udfs-trino-plugin which implements the deployment of Transport UDFs via a Trino plugin named TransportPlugin following the new Transport UDF deployment approach as shown
    in https://docs.google.com/document/d/1MgruaEc6Flwfr6KnDT0gGI_vCsCCNe_y3vXlGV-1dNQ/edit
    The major classes including:
    • TransportPlugin, TransportModule, TransportConnectorFactory and TransportConnector for a plugin to to loaded by PluginManager in Trino server to build a catalog with a connector to provide Transport UDFs
    • TransportConnectorMetadata, TransportFunctionProvider and TransportUDFClassLoader to dynamically load the implementation of UDF classes and build a map for function resolution during the initialization of the plugin
  3. Some related changes in the module of transportable-udfs-test-trino
    • Using TransportPlugin, TransportConnectorFactory and TransportConnector to load the target UDF class for test
    • Using 'QueryAssertions` from Trino to execute the query with UDF and verify the outputs
  4. As the original url of Conjar maven repo is not reachable now which results in dependency resolution issue, replace it with a new url shown in https://conjars.wensel.net/
  • Tests
  1. ./gradlew build is successful from the project's root directory and from transportable-udfs-example-udfs subdirectory
  2. unit tests in the module of transportable-udfs-trino and transportable-udfs-test-trino
  3. unit test on example UDFs in the module of transportable-udfs-example-udfs
  4. manual tests on example UDFs in the module of transportable-udfs-example-udfs which are deployed in a local Trino server

Yiqiang Ding added 15 commits March 16, 2023 11:48
@weijiii
Copy link
Contributor

weijiii commented Apr 20, 2023

Thanks for working on this!

@aastha25
Copy link
Contributor

could you also add in the testing section that ./gradlew build is successful from the project's root directory and from transportable-udfs-example-udfs subdirectory.

@yiqiangin
Copy link
Contributor Author

could you also add in the testing section that ./gradlew build is successful from the project's root directory and from transportable-udfs-example-udfs subdirectory.

Done.

Copy link
Contributor

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yiqiangin for this PR!

Comment on lines +19 to +27
// Temporarily disable the tests for Trino. As the test infrastructure from Trino named QueryAssertions is used to
// run these test for Trino, QueryAssertions mandatory execute the function with the query in two formats: one with
// is the normal query (e.g. SELECT "binary_duplicate"(a0) FROM (VALUES ROW(from_base64('YmFy'))) t(a0);), the other
// is with "where RAND()>0" clause (e.g. SELECT "binary_duplicate"(a0) FROM (VALUES ROW(from_base64('YmFy'))) t(a0) where RAND()>0;)
// QueryAssertions verifies the output from both queries are equal otherwise the test fail.
// However, the execution of the query with where clause triggers the code of VariableWidthBlockBuilder.writeByte() to create
// the input byte array in Slice with an initial 32 byes capacity, while the execution of the query without where clause does not trigger
// the code of VariableWidthBlockBuilder.writeByte() and create the input byte array in Slice with the actual capacity of the content.
// Therefore, the outputs from both queries are different.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the next step for it? Sounds like it's an issue of Trino QueryAssertions? Maybe add a TODO item here and other places?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 disabling the tests does not seem like the right course of action here

Copy link
Contributor

@weijiii weijiii Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to run the failed test with QueryAssertion now? It was not yet clear why QueryAssertion failed previously with the old way using SqlScalarFunction. Annotation-driven functions had no issue with QueryAssertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test still fails with QueryAssertions. Briefly QueryAssertions requires to compare the outputs from two queries with the same UDF, however in case of binary UDF, two queries results in the execution of different code paths in Trino which makes the inputs to UDF are different, so the outputs of the queries are different. The details are added in the comments of the code.
I have created a TODO issue to reenable them after the root cause is found in Trino and fixed as shown in #131

transportable-udfs-trino-plugin/build.gradle Show resolved Hide resolved
Comment on lines 35 to 43
@Test
public void testFileLookupFailNull() {
StdTester tester = getTester();
tester.check(functionCall("file_lookup", resource("file_lookup_function/sample"), null), null, "boolean");
try {
StdTester tester = getTester();
tester.check(functionCall("file_lookup", resource("file_lookup_function/sample"), null), null, "boolean");
} catch (NullPointerException ex) {
Assert.assertFalse(Boolean.valueOf(System.getProperty("trinoTest")));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. Previously we asserted that NPE was thrown. Now you seem to be asserting that NPE is thrown only when trinoTest is false? But when trinoTest is true we do expect the exception to be thrown, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of this test are executed for Trino, Hive and Spark. Originally it expects a NullPointerException is thrown during the execution of looking up a null value in a file by all three query engines. However, Trino v406 does not throw NullPointerException but returns a null value in this case. Therefore the code change removes the expected exception from @test annotation, and only checks if a NullPointerException is caught if it is not in case of Trino.
Also I have already added some comments to explain the reason in the code.

@@ -12,6 +12,7 @@ dependencies {
implementation('com.google.guava:guava:24.1-jre')
implementation('org.apache.commons:commons-io:1.3.2')
testImplementation('io.airlift:aircompressor:0.21')
testImplementation('org.junit.jupiter:junit-jupiter-api:5.9.2')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this module use TestNG? why is JUnit 5 added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this module use TestNG. However, as the tests for Trino use the class of QueryAssertions from Trino, if this dependency is not added, the tests for Trino will fail because of the following error:

java.lang.NoClassDefFoundError: org/junit/jupiter/api/Assertions
	at io.trino.sql.query.QueryAssertions$ExpressionAssertProvider.evaluate(QueryAssertions.java:675)
	at io.trino.sql.query.QueryAssertions$ExpressionAssertProvider.assertThat(QueryAssertions.java:691)
	at io.trino.sql.query.QueryAssertions$ExpressionAssertProvider.assertThat(QueryAssertions.java:602)

defaultEnvironment.gradle Outdated Show resolved Hide resolved

public StdUdfWrapper(StdUDF stdUDF) {
this.functionMetadata = FunctionMetadata.builder(FunctionKind.SCALAR)
.nullable()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why always nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this code is simply migrating the existing code with the APIs of FunctionMetadata from Trino v352 to Trino v406 as shown in https://github.com/linkedin/transport/blob/master/transportable-udfs-trino/src/main/java/com/linkedin/transport/trino/StdUdfWrapper.java#L74 , the original code has already set nullable as true

Copy link
Contributor

@weijiii weijiii Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiqiangin The link you pasted is using StdUDF#getNullableArguments() to decide which arguments can receive a null value; it'd be better to mention that it is moved here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weijiii Here is the definition of FunctionMetadata in v352

    public FunctionMetadata(
            Signature signature,
            boolean nullable,
            List<FunctionArgumentDefinition> argumentDefinitions,
            boolean hidden,
            boolean deterministic,
            String description,
            FunctionKind kind)
    {
        this(FunctionId.toFunctionId(signature), signature, nullable, argumentDefinitions, hidden, deterministic, description, kind, false);
    }

As shown in this line https://github.com/linkedin/transport/blob/master/transportable-udfs-trino/src/main/java/com/linkedin/transport/trino/StdUdfWrapper.java#L82
true is set as the second argument in new FunctionMetadata() which is nullable, Booleans.asList(stdUDF.getNullableArguments() is the third argument

transportable-udfs-trino-plugin/build.gradle Outdated Show resolved Hide resolved
private final ClassLoader parent;

public TransportUDFClassLoader(ClassLoader parent, List<URL> urls) {
super(urls.toArray(new URL[0]));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the modern way to do this is:

Suggested change
super(urls.toArray(new URL[0]));
super(urls.toArray(URL[]::new));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Transport is still using Java 8, this modern approach seems not allowed.

@weijiii
Copy link
Contributor

weijiii commented Apr 26, 2023

Can you also update StdUdfUtils#RESERVED_KEYWORDS ? This is within scope of upgrading to Trino 406. The new ones should be

private static final Set[String](https://app.slack.com/client/T06BYN8F7/String) RESERVED_KEYWORDS =
      ImmutableSet.of("ALTER", "AND", "AS", "BETWEEN", "BY", "CASE", "CAST", "CONSTRAINT", "CREATE", "CROSS", "CUBE",
          "CURRENT_CATALOG", "CURRENT_DATE", "CURRENT_PATH", "CURRENT_ROLE", "CURRENT_SCHEMA", "CURRENT_TIME",
          "CURRENT_TIMESTAMP", "CURRENT_USER", "DEALLOCATE", "DELETE", "DESCRIBE", "DISTINCT", "DROP", "ELSE", "END",
          "ESCAPE", "EXCEPT", "EXECUTE", "EXISTS", "EXTRACT", "FALSE", "FOR", "FROM", "FULL", "GROUP", "GROUPING",
          "HAVING", "IN", "INNER", "INSERT", "INTERSECT", "INTO", "IS", "JOIN", "JSON_ARRAY", "JSON_EXISTS", "JSON_OBJECT", 
          "JSON_QUERY", "JSON_VALUE", "LEFT", "LIKE", "LISTAGG", "LOCALTIME", "LOCALTIMESTAMP", "NATURAL", "NORMALIZE", "NOT", 
          "NULL", "ON", "OR", "ORDER", "OUTER", "PREPARE", "RECURSIVE", "RIGHT", "ROLLUP", "SELECT", "SKIP", "TABLE", "THEN", 
          "TRIM", "TRUE", "UESCAPE", "UNION", "UNNEST", "USING", "VALUES", "WHEN", "WHERE", "WITH");

I think we should expedite this release so the subsequent testing of our production UDFs can be initiated. (cc Dali reviewers @ljfgem @aastha25 ) Thanks.

@yiqiangin
Copy link
Contributor Author

Can you also update StdUdfUtils#RESERVED_KEYWORDS ? This is within scope of upgrading to Trino 406. The new ones should be

private static final Set[String](https://app.slack.com/client/T06BYN8F7/String) RESERVED_KEYWORDS =
      ImmutableSet.of("ALTER", "AND", "AS", "BETWEEN", "BY", "CASE", "CAST", "CONSTRAINT", "CREATE", "CROSS", "CUBE",
          "CURRENT_CATALOG", "CURRENT_DATE", "CURRENT_PATH", "CURRENT_ROLE", "CURRENT_SCHEMA", "CURRENT_TIME",
          "CURRENT_TIMESTAMP", "CURRENT_USER", "DEALLOCATE", "DELETE", "DESCRIBE", "DISTINCT", "DROP", "ELSE", "END",
          "ESCAPE", "EXCEPT", "EXECUTE", "EXISTS", "EXTRACT", "FALSE", "FOR", "FROM", "FULL", "GROUP", "GROUPING",
          "HAVING", "IN", "INNER", "INSERT", "INTERSECT", "INTO", "IS", "JOIN", "JSON_ARRAY", "JSON_EXISTS", "JSON_OBJECT", 
          "JSON_QUERY", "JSON_VALUE", "LEFT", "LIKE", "LISTAGG", "LOCALTIME", "LOCALTIMESTAMP", "NATURAL", "NORMALIZE", "NOT", 
          "NULL", "ON", "OR", "ORDER", "OUTER", "PREPARE", "RECURSIVE", "RIGHT", "ROLLUP", "SELECT", "SKIP", "TABLE", "THEN", 
          "TRIM", "TRUE", "UESCAPE", "UNION", "UNNEST", "USING", "VALUES", "WHEN", "WHERE", "WITH");

I think that we should expedite this release so the subsequent testing of our production UDFs can be initiated. (cc Dali reviewers @ljfgem @aastha25 ) Thanks.

I have updated the list per Erik's review before. Just double check the list in the code is the same as the list shown here.

Copy link
Contributor

@ljfgem ljfgem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yiqiangin for this PR!
@wmoustafa @xkrogen: Do you want to take a look before merging?

transportable-udfs-trino-plugin/build.gradle Show resolved Hide resolved
@yiqiangin yiqiangin merged commit 4651eba into yiqiangin/trino-406-upgrade Apr 27, 2023
Copy link

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have 4 JAR files checked in? (e.g. transport-udf-1-trino-dist-thin.jar)

Comment on lines 39 to 44
// in case of Trino, the execution of a query with UDF to check a null value in a file
// does not result in a NullPointerException, but returns a null value
tester.check(functionCall("file_lookup", resource("file_lookup_function/sample"), null), null, "boolean");
} catch (NullPointerException ex) {
Assert.assertFalse(Boolean.valueOf(System.getProperty("trinoTest")));
// in case of Hive and Spark, the execution of a query with UDF to check a null value in a file results in a NullPointerException
Assert.assertFalse(isTrinoTest());
Copy link

@xkrogen xkrogen Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this difference in behavior expected? do we need to modify the behavior to be consistent across engines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino v352 has the same behavior with other engines, however, Trino v406 does not have.

transportable-udfs-trino-plugin/build.gradle Show resolved Hide resolved
Comment on lines 19 to 23
public class TransportModule implements Module {
@Override
public void configure(Binder binder) {
binder.bind(TransportConnector.class).in(Scopes.SINGLETON);
configBinder(binder).bindConfig(TransportConfig.class);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we extend AbstractModule as recommended in the Javadoc of Module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +18 to +26
@Test
public void testTransportPluginInitialization() {
TestingTrinoServer server = TestingTrinoServer.create();
Plugin plugin = new TransportPlugin();
server.installPlugin(plugin);
server.createCatalog("LINKEDIN", "TRANSPORT");
Assert.assertTrue(getOnlyElement(plugin.getConnectorFactories()) instanceof TransportConnectorFactory);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert that the UDFs are actually able to be loaded?

it would be nice also to have a test which can assert that each UDF is in a separate classloader -- maybe load two UDFs and grab the classloader for each and assert that they are different?

Copy link
Contributor

@weijiii weijiii Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @yiqiangin

LocalQueryRunner _runner = ...;
_runner.installPlugin(new TransportPlugin());
_runner.createCatalog("LINKEDIN", "TRANSPORT", ImmutableMap.of(<some config>...));

MaterializedResult result = _runner("SELECT array_element_at(array[1,2,3], 2)");
// assert result

MaterializedResult.java

Then we can run some query to verify the functions exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I change this test to initialize TransportPlugin with a LocalQueryRunner. So the success of loading of UDFs are verified by successfully executing a query with UDF via LocalQueryRunner as shown in #133

For the test which can assert that each UDF is in a separate classloader, I am still trying to find out a good way to implement this in Transport side.

@yiqiangin
Copy link
Contributor Author

why do we have 4 JAR files checked in? (e.g. transport-udf-1-trino-dist-thin.jar)

Those JARs contains UDF classes which are actually loaded in TransportPluginTest to verify the loading of UDF classes by TransportUDFClassLoader during the initialization of TransportPlugin

@xkrogen
Copy link

xkrogen commented Apr 27, 2023

why do we have 4 JAR files checked in? (e.g. transport-udf-1-trino-dist-thin.jar)

Those JARs contains UDF classes which are actually loaded in TransportPluginTest to verify the loading of UDF classes by TransportUDFClassLoader during the initialization of TransportPlugin

Can we put them inside of a test/resources folder to make it more clear that these are testing resources?

@yiqiangin
Copy link
Contributor Author

why do we have 4 JAR files checked in? (e.g. transport-udf-1-trino-dist-thin.jar)

Those JARs contains UDF classes which are actually loaded in TransportPluginTest to verify the loading of UDF classes by TransportUDFClassLoader during the initialization of TransportPlugin

Can we put them inside of a test/resources folder to make it more clear that these are testing resources?

It has been done in #133

yiqiangin added a commit that referenced this pull request Apr 28, 2023
* Upgrade Gradle from 6.7 to 7.5 (#122)

* Upgrade to Gradle 7.5

* Upgrade to Gradle 7.5

* remove a unnecessary classpath

* upgrade gradle wrapper in transportable-udfs-examples

* address comments

* add comment

* using api() to avoid adding dependencies

* Simplify dependencies further

---------

Co-authored-by: Yiqiang Ding <yiqding@linkedin.com>
Co-authored-by: Jiefan Li <jiefli@linkedin.com>

* Upgrade Transport to compile Trino UDF with Java 17 (#125)

Co-authored-by: Yiqiang Ding <yiqding@linkedin.com>

* Upgrade Transport to use API from Trino v406 (#128)

* Upgrade Transport to use API from Trino v406

* Upgrade Transport to use API from Trino v406

* PoC of Trino Connector

* PoC of Trino Connector

* address comments and remove unreachable artifactory repo

* address comments

* fix the issue of Maven repo

* fixing conjar repo issue in transportable-udfs-examples

* address comments

* address comments

* address comments

* address comments

* address comments

* address comments

* address comments

* add some comments

* remove conjar repo

* address comments

---------

Co-authored-by: Yiqiang Ding <yiqding@linkedin.com>

* Improvements on transport-trino-plugin (#133)

* Improvments on transportable-udfs-trino-plugin

* Improvments on transportable-udfs-trino-plugin

* fix a type

* address comments

* address comments

---------

Co-authored-by: Yiqiang Ding <yiqding@linkedin.com>

---------

Co-authored-by: Yiqiang Ding <yiqding@linkedin.com>
Co-authored-by: Jiefan Li <jiefli@linkedin.com>
@yiqiangin yiqiangin deleted the yiqiangin/transport-trino-plugin branch May 2, 2023 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants