Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Delta Lake product tests #11565

Merged
merged 7 commits into from
May 24, 2022

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Mar 18, 2022

Description

Expose Delta Lake product tests

TODOs :

Figure out how to setup Databricks environment to be used the Delta Lake connector tests.
Open issues:

  • Databricks spark-jdbc42 is not available on Maven Central so it can't be directly referenced as a dependency in Maven for product tests (needed to connect to Databricks cluster)
  • Databricks Community Edition has a few limitations for which need to be found appropriate workarounds:
    - no auto-restart when querying a terminated cluster - can be solved by creating a new cluster via Databricks Clusters API v2
    - no Instance profiles functionality available on community clusters - this is a serious limitation because the Delta Lake connector tests create tables backed by AWS S3 buckets
    - per Community account only one cluster can be created - can be solved by creating multiple accounts in order to test Databricks 9.1 LTS, 7.3 LTS

Is this change a fix, improvement, new feature, refactoring, or other?

Tests

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Delta Lake connector

How would you describe this change to a non-technical end user or system administrator?

This change contributes to ensuring accuracy of the functionality exposed by the Delta Lake connector.

Related issues, pull requests, and links

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Mar 18, 2022
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 94c2923 to 6c00390 Compare March 21, 2022 11:21
@findepi findepi changed the title Open source Delta Lake product tests Add Delta Lake product tests Mar 21, 2022
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 2 times, most recently from d00b9a8 to 6e3e356 Compare March 23, 2022 13:52
@findinpath findinpath marked this pull request as ready for review March 25, 2022 12:59
@findinpath findinpath marked this pull request as draft March 25, 2022 12:59
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 6e3e356 to 8781e2a Compare April 6, 2022 21:08
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 6 times, most recently from 0740a29 to e77f81f Compare April 27, 2022 10:51
@findinpath findinpath marked this pull request as ready for review April 27, 2022 14:50
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 2 times, most recently from 303427c to 06bfb9a Compare April 27, 2022 19:48
@alexjo2144
Copy link
Member

Does Minio need it's own module, or can it go into one of the existing shared ones, like trino-testing?

@findinpath
Copy link
Contributor Author

Does Minio need it's own module, or can it go into one of the existing shared ones, like trino-testing?

trino-testing is about testing infrastructure with a focus on test utilities.
Ideally would sit the MinioClient somewhere outside of Trino.
I haven't found (so far) a better solution for where to place this client class.

@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 06bfb9a to bd9e5b5 Compare April 29, 2022 14:15
testing/trino-testing-minio/pom.xml Outdated Show resolved Hide resolved
testing/trino-testing-minio/pom.xml Outdated Show resolved Hide resolved
plugin/trino-delta-lake/pom.xml Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 2 times, most recently from 310fb13 to 5ba4bd2 Compare May 9, 2022 14:53
@findinpath
Copy link
Contributor Author

There's file rename you can take out testing/trino-product-tests-launcher/src/main/resources/docker/presto-product-tests/conf/environment/singlenode-delta-lake-oss/minio-files/data/presto-ci-test/dummy

@alexjo2144 Adressed the issue by creating the minio bucket directory on the fly while configuring MinIO via testcontainers.

@findinpath
Copy link
Contributor Author

This branch has conflicts that must be resolved

I am currently intentionally not fixing the Git conflicts in order to keep the basis of the current PR which is used in a private downstream project.

throw new UncheckedIOException(e);
}
builder.configureContainer(MINIO_CONTAINER_NAME, container ->
container.withCopyFileToContainer(forHostPath(minioBucketDirectory), "/data/" + S3_BUCKET_NAME));
Copy link
Member

@alexjo2144 alexjo2144 May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in the common/Minio file. It's not clear here what's special about the /data/ directory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't /data where MinIO docker container stores buckets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/data is the argument given to minio server to point the directory where the buckets are physically stored.

The Docker container process is started with:

                .withCommand("server", "--address", format("0.0.0.0:%d", MINIO_PORT), "/data")

I think that this customisation is specific for EnvSinglenodeDeltaLakeOss and not a common setting for MinioContainer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least have /data/ stored in a static class variable on Minio then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow up on this

@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 2 times, most recently from f4b70bd to 97b8ca5 Compare May 16, 2022 21:08
@findinpath
Copy link
Contributor Author

I am currently intentionally not fixing the Git conflicts in order to keep the basis of the current PR which is used in a private downstream project.

Change of plan. Fixing the conflicts.

@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch 5 times, most recently from 5a74dc8 to 3d0e6e0 Compare May 17, 2022 09:31
@findepi
Copy link
Member

findepi commented May 17, 2022

@findepi what do you think about taking out all the test resources once we have the Databricks compatibility env set up?

Seems like once we have that we can create all the table's we need on the fly

Some of the ones in the integration test suite do test specific edge cases, but the ones here are just copies of TPCH

Good question.
There are some advantages of static test resources

  • support tests that can be run directly from an IDE, by a contributor.
  • ensure we support tables created by an old version of Dabaricks runtime, not only the one we're currently testing against. While new data needs to be read only by modern software version, data created by old software version needs to remain readable.

@findepi findepi requested a review from alexjo2144 May 17, 2022 09:52
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 3d0e6e0 to 684e998 Compare May 17, 2022 11:20
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 95d0a5a to 44591e1 Compare May 19, 2022 18:58
@findinpath
Copy link
Contributor Author

Rebased on master.

@findepi
Copy link
Member

findepi commented May 19, 2022

@alexjo2144 ptal

@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 79e7376 to 032704c Compare May 19, 2022 20:45
The Delta Lake product tests can be all executed with SuiteDeltaLake
suite class.

The following test product test environments are exposed:

- single-node-delta-lake-oss: used to test the compatibility of
  the Trino Delta Lake connector with Apache Spark with Delta OSS
- single-node-delta-lake-databricks: used to test the compatibility
  of the Trino Delta Lake connector with Delta Lake Databricks
- single-node-delta-lake-kerberized-hdfs: used to test Delta Lake
  connector on top of kerberized Hadoop environment
- single-node-minio-data-lake: lightweight environment that can
  be used to test the Lakehouse connectors with HMS & MinIO

The aim of the Delta Lake product tests is to ensure compatibility
with both implementations of Delta Lake:

- Delta OSS
- Databricks Delta

These product tests were originally written for the Starburst Enterprise
Delta Lake connector.

Co-authored by various engineers at Starburst Data:

Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com>
Co-authored-by: Alex Jo <alex.jo@starburstdata.com>
Co-authored-by: Łukasz Osipiuk <lukasz@osipiuk.net>
Co-authored-by: Konrad Dziedzic <konraddziedzic@gmail.com>
Co-authored-by: Adam J. Shook <shook@datacatessen.com>
Co-authored-by: Mateusz Gajewski <mateusz.gajewski@gmail.com>
Co-authored-by: Gaurav Sehgal <gaurav.sehgal8297@gmail.com>
Co-authored-by: Raunaq Morarka <raunaqmorarka@gmail.com>
Co-authored-by: Ashhar Hasan <ashhar.hasan@starburstdata.com>
Co-authored-by: Michał Ślizak <michal.slizak+github@gmail.com>
Co-authored-by: Grzegorz Kokosiński <grzegorz@starburstdata.com>
Co-authored-by: Arkadiusz Czajkowski <arek@starburstdata.com>
Co-authored-by: Jacob I. Komissar <jacob.komissar@starburstdata.com>
Co-authored-by: Krzysztof Sobolewski <krzysztof.sobolewski@starburstdata.com>
Co-authored-by: Krzysztof Skrzypczynski <krzysztof.skrzypczynski@starburstdata.com>
Co-authored-by: Yuya Ebihara <yuya.ebihara@starburstdata.com>
Co-authored-by: Praveen Krishna <praveenkrishna@tutanota.com>
Co-authored-by: Karol Sobczak <napewnotrafi@gmail.com>
Co-authored-by: Sasha Sheikin <myminitrue@gmail.com>
Co-authored-by: Szymon Homa <szymon.homa@starburstdata.com>
@findinpath findinpath force-pushed the oss-delta-lake-product-tests branch from 2020ab3 to 4a81969 Compare May 20, 2022 09:24
@findepi findepi merged commit 4338264 into trinodb:master May 24, 2022
@github-actions github-actions bot added this to the 382 milestone May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants