Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta lake connector - Reader #16843

Merged
merged 8 commits into from
Jan 7, 2022
Merged

Delta lake connector - Reader #16843

merged 8 commits into from
Jan 7, 2022

Conversation

vkorukanti
Copy link
Contributor

@vkorukanti vkorukanti commented Oct 7, 2021

New connector for reading Delta Lake tables natively in Presto. Connector is based on Delta Standalone Reader API. Design doc is here.

== RELEASE NOTES ==

General Changes
* New connector for reading [Delta Lake tables](https://delta.io/) natively in Presto

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 7, 2021

CLA Signed

The committers are authorized under a signed CLA.

  • ✅ Venki Korukanti (ce353870b5331e01a44a355d5ab3fd149db40fc9, 7db9cadbfefa33b534f9d4e903af657bf5f250e6, 0416ca8fbd48ebad5a5001d9615fe7aa342a58fd, dfba41034c75ee576f0398a8d38a50b04180e33c, 87d557b657a945cf1b864a68bca7028d8730e622, 04333a866bc037c1a8a6fd58e21958089facf8b8, fe89f08d89fde4e3610f08ac55d50acc8303eb40, 53f0edc930c5ba1e7d471e9a289bcad8528ccf96)
  • ✅ George Chow (aed89e191772c4c655e832bbe1187fb2239e57c1)

@vkorukanti
Copy link
Contributor Author

Initial connector. More features and tests are coming in next commits.

@arun11299
Copy link

@vkorukanti Thanks a lot for this effort. Looking forward to use it!

presto-delta/pom.xml Outdated Show resolved Hide resolved
@vkorukanti vkorukanti changed the title [WIP] Delta lake connector Delta lake connector - Reader Nov 5, 2021
@imjalpreet
Copy link
Member

@vkorukanti Thank you for your contribution! This will help in simplifying reading Delta Tables from Presto. I am planning to review this PR over the next few days. I see that there are test failures currently due to java.lang.NoClassDefFoundError while running the tests for presto-delta. Can you please look into fixing them? Meanwhile, I will continue with my review.

@vkorukanti
Copy link
Contributor Author

@vkorukanti Thank you for your contribution! This will help in simplifying reading Delta Tables from Presto. I am planning to review this PR over the next few days. I see that there are test failures currently due to java.lang.NoClassDefFoundError while running the tests for presto-delta. Can you please look into fixing them? Meanwhile, I will continue with my review.

Thanks @imjalpreet for reviewing. Seems like there is a dependency issue on the CI. I am able to run the tests locally. Will check out what is causing the issue.

@imjalpreet
Copy link
Member

@vkorukanti If possible, can you also share the design document?

@vkorukanti
Copy link
Contributor Author

@vkorukanti If possible, can you also share the design document?

https://docs.google.com/document/d/16S7xoAmXpSax7W1OWYYHo5nZ71t5NvrQ-F79pZF6yb8/edit?usp=sharing

Copy link
Member

@nmahadevuni nmahadevuni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti Thanks for the PR. I reviewed a few commits. Have added some comments. Will continue to review the rest of them.

schemaTableName.getSchemaName(),
schemaTableName.getTableName(),
tableLocation,
Optional.of(snapshot.getVersion()), // lock the snapshot version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will storing the snapshot itself instead of version be better so the listFiles need not call getSnapshotForVersionAsOf again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be addressing this in a next path. I am also planning to add a Guava cache so that we don't load the DeltaLog again when listing the files in the Snapshot.

presto-main/etc/catalog/delta.properties Outdated Show resolved Hide resolved
@Test
public void filterOnRegularColumn()
{
String tableName = "data-reader-primitives";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the table definition as a comment for all these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not clear. Are you asking for the table schema or data in the table?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table schema

Copy link
Member

@imjalpreet imjalpreet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution Venki! The code looks good overall, I have added some minor comments. I am still going over the design doc and since the tests have still not run in CI, I will run them locally for now. As I move forward with the design doc and tests, will add review comments if any.

boolean changed = false;
for (PlanNode child : node.getSources()) {
PlanNode newChild = child.accept(this, null);
if (newChild != child) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is preferable to use equals() to compare object references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. This code existed for sometime. I am just refactoring here. Not sure if it is worth modifying to use .equals().

presto-docs/src/main/sphinx/connector/delta.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/connector/delta.rst Outdated Show resolved Hide resolved
{
public static final String PARTITION_PRUNING_ENABLED = "partition_pruning_enabled";
public static final String FILTER_PUSHDOWN_ENABLED = "filter_pushdown_enabled";
private static final String CACHE_ENABLED = "cache_enabled";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need cache_enabled session property since it's value is not getting used as of now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it is not enabled yet. It requires some work. Removing this property for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to put it back because of the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti what tests use it? It's better to add properties back when caching is actually supported. Can you modify the tests not to use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this property the integration tests fail with the following error callstack. The problem is here, we always look for the property CACHE_ENABLED. If not present it throws an unknown session property error. This is not related to Delta. In any case, caching will be added once this PR lands in master.


	at com.facebook.presto.metadata.SessionPropertyManager.lambda$decodeCatalogPropertyValue$1(SessionPropertyManager.java:181)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at com.facebook.presto.metadata.SessionPropertyManager.decodeCatalogPropertyValue(SessionPropertyManager.java:181)
	at com.facebook.presto.FullConnectorSession.getProperty(FullConnectorSession.java:150)
	at com.facebook.presto.hive.HiveSessionProperties.isCacheEnabled(HiveSessionProperties.java:1087)
	at java.util.Optional.map(Optional.java:215)
	at com.facebook.presto.hive.cache.HiveCachingHdfsConfiguration.lambda$getConfiguration$0(HiveCachingHdfsConfiguration.java:76)
	at com.facebook.presto.hive.cache.HiveCachingHdfsConfiguration$CachingJobConf.createFileSystem(HiveCachingHdfsConfiguration.java:105)
	at org.apache.hadoop.fs.PrestoFileSystemCache.get(PrestoFileSystemCache.java:59)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at com.facebook.presto.hive.HdfsEnvironment.lambda$getFileSystem$0(HdfsEnvironment.java:71)
	at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
	at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:70)
	at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:64)
	at com.facebook.presto.delta.DeltaClient.loadDeltaTableLog(DeltaClient.java:148)
	at com.facebook.presto.delta.DeltaClient.getTable(DeltaClient.java:79)
	at com.facebook.presto.delta.DeltaMetadata.getTableHandle(DeltaMetadata.java:136)
	at com.facebook.presto.delta.DeltaMetadata.getTableHandle(DeltaMetadata.java:57)
	at com.facebook.presto.metadata.MetadataManager.getTableHandle(MetadataManager.java:330)
	at com.facebook.presto.sql.analyzer.StatementAnalyzer$Visitor.lambda$visitTable$17(StatementAnalyzer.java:1244)
`

@nmahadevuni
Copy link
Member

nmahadevuni commented Nov 18, 2021

The TestDeltaIntegration is failing in PR because of dependency issue. Also locally, it fails with two errors.
[ERROR] TestDeltaIntegration.readCheckpointedDeltaTable:138->AbstractDeltaDistributedQueryTestBase.assertDeltaQueryFails:74->AbstractTestQueryFramework.assertQueryFails:235 Expected exception message 'No reproducible commits found at file:.../presto-delta/target/test-classes/checkpointed-delta-table/_delta_log' to match 'Can not find snapshot (3) in Delta table 'deltatables.checkpointed-delta-table@v3': No reproducible commits found at .*' for query: SELECT * FROM "checkpointed-delta-table@v3" WHERE col1 in (0, 10, 15)
[ERROR] TestDeltaIntegration.readSpecificSnapshotVersion:123->AbstractDeltaDistributedQueryTestBase.assertDeltaQuery:58->AbstractTestQueryFramework.assertQuery:143 Execution of 'actual' query failed: SELECT * FROM "snapshot-data3@t2020-10-26 02:52:48" WHERE col1 = 0

@nmahadevuni
Copy link
Member

After your latest commits too, I see some errors when I rant he TestDelta* tests.

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project presto-delta: Compilation failure: Compilation failure:
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[85,1] cannot find symbol
[ERROR] symbol: static checkSchemaMatch
[ERROR] location: class com.facebook.presto.hive.parquet.ParquetPageSourceFactory
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[20,40] package com.facebook.presto.parquet.rule does not exist
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[33,17] cannot find symbol
[ERROR] symbol: class ParquetDereferencePushDown
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[363,14] cannot find symbol
[ERROR] symbol: method checkSchemaMatch(org.apache.parquet.schema.Type,com.facebook.presto.common.type.Type)
[ERROR] location: class com.facebook.presto.delta.DeltaPageSourceProvider
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaPlanOptimizerProvider.java:[33,41] incompatible types: inference variable E has incompatible bounds
[ERROR] equality constraints: com.facebook.presto.spi.ConnectorPlanOptimizer
[ERROR] lower bounds: com.facebook.presto.delta.rule.DeltaParquetDereferencePushDown
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[40,5] method does not override or implement a method from a supertype
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[47,5] method does not override or implement a method from a supertype
[ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[55,5] method does not override or implement a method from a supertype

@vkorukanti
Copy link
Contributor Author

After your latest commits too, I see some errors when I rant he TestDelta* tests.

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project presto-delta: Compilation failure: Compilation failure: [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[85,1] cannot find symbol [ERROR] symbol: static checkSchemaMatch [ERROR] location: class com.facebook.presto.hive.parquet.ParquetPageSourceFactory [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[20,40] package com.facebook.presto.parquet.rule does not exist [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[33,17] cannot find symbol [ERROR] symbol: class ParquetDereferencePushDown [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[363,14] cannot find symbol [ERROR] symbol: method checkSchemaMatch(org.apache.parquet.schema.Type,com.facebook.presto.common.type.Type) [ERROR] location: class com.facebook.presto.delta.DeltaPageSourceProvider [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaPlanOptimizerProvider.java:[33,41] incompatible types: inference variable E has incompatible bounds [ERROR] equality constraints: com.facebook.presto.spi.ConnectorPlanOptimizer [ERROR] lower bounds: com.facebook.presto.delta.rule.DeltaParquetDereferencePushDown [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[40,5] method does not override or implement a method from a supertype [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[47,5] method does not override or implement a method from a supertype [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[55,5] method does not override or implement a method from a supertype

It looks like the presto-parquet module is not rebuilt in your env for some reason. Can you try on a clean build? Also I see the tests are passing in CI (here)

@nmahadevuni
Copy link
Member

After your latest commits too, I see some errors when I rant he TestDelta* tests.
Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project presto-delta: Compilation failure: Compilation failure: [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[85,1] cannot find symbol [ERROR] symbol: static checkSchemaMatch [ERROR] location: class com.facebook.presto.hive.parquet.ParquetPageSourceFactory [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[20,40] package com.facebook.presto.parquet.rule does not exist [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[33,17] cannot find symbol [ERROR] symbol: class ParquetDereferencePushDown [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/DeltaPageSourceProvider.java:[363,14] cannot find symbol [ERROR] symbol: method checkSchemaMatch(org.apache.parquet.schema.Type,com.facebook.presto.common.type.Type) [ERROR] location: class com.facebook.presto.delta.DeltaPageSourceProvider [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaPlanOptimizerProvider.java:[33,41] incompatible types: inference variable E has incompatible bounds [ERROR] equality constraints: com.facebook.presto.spi.ConnectorPlanOptimizer [ERROR] lower bounds: com.facebook.presto.delta.rule.DeltaParquetDereferencePushDown [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[40,5] method does not override or implement a method from a supertype [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[47,5] method does not override or implement a method from a supertype [ERROR].../presto-delta/src/main/java/com/facebook/presto/delta/rule/DeltaParquetDereferencePushDown.java:[55,5] method does not override or implement a method from a supertype

It looks like the presto-parquet module is not rebuilt in your env for some reason. Can you try on a clean build? Also I see the tests are passing in CI (here)

Thank you. Even though it was built clean from newly cloned directory, I had to build presto-parquet and presto-hive again, only then the tests worked.

@@ -0,0 +1,2 @@
{"commitInfo":{"timestamp":1636316599141,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":10,"isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"687","numOutputRows":"1"}}}
Copy link
Contributor

@yingsu00 yingsu00 Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reformat the json files? You can right click "_delta_log" package in Intellij Project view, then click "Reformat code" to do a batch code clean up. I understand these files were program generated, but formatting them for the tests would make it easier for the developers to understand the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Reformetted these files. Changes should be available in next push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to revert the formatting. It looks like the formatting is introducing special characters that are causing the JSON parser in Delta client library to fail.

presto-delta/pom.xml Show resolved Hide resolved
return sessionProperties;
}

public static boolean isCacheEnabled(ConnectorSession session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used. Shall we remove it? It's better to introduce the properties when the real feature is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this property the integration tests fail with the following error callstack. The problem is here, we always look for the property CACHE_ENABLED. If not present it throws an unknown session property error. This is not related to Delta. In any case, caching will be added once this PR lands in master.


	at com.facebook.presto.metadata.SessionPropertyManager.lambda$decodeCatalogPropertyValue$1(SessionPropertyManager.java:181)
	at java.util.Optional.orElseThrow(Optional.java:290)
	at com.facebook.presto.metadata.SessionPropertyManager.decodeCatalogPropertyValue(SessionPropertyManager.java:181)
	at com.facebook.presto.FullConnectorSession.getProperty(FullConnectorSession.java:150)
	at com.facebook.presto.hive.HiveSessionProperties.isCacheEnabled(HiveSessionProperties.java:1087)
	at java.util.Optional.map(Optional.java:215)
	at com.facebook.presto.hive.cache.HiveCachingHdfsConfiguration.lambda$getConfiguration$0(HiveCachingHdfsConfiguration.java:76)
	at com.facebook.presto.hive.cache.HiveCachingHdfsConfiguration$CachingJobConf.createFileSystem(HiveCachingHdfsConfiguration.java:105)
	at org.apache.hadoop.fs.PrestoFileSystemCache.get(PrestoFileSystemCache.java:59)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at com.facebook.presto.hive.HdfsEnvironment.lambda$getFileSystem$0(HdfsEnvironment.java:71)
	at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
	at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:70)
	at com.facebook.presto.hive.HdfsEnvironment.getFileSystem(HdfsEnvironment.java:64)
	at com.facebook.presto.delta.DeltaClient.loadDeltaTableLog(DeltaClient.java:148)
	at com.facebook.presto.delta.DeltaClient.getTable(DeltaClient.java:79)
	at com.facebook.presto.delta.DeltaMetadata.getTableHandle(DeltaMetadata.java:136)
	at com.facebook.presto.delta.DeltaMetadata.getTableHandle(DeltaMetadata.java:57)
	at com.facebook.presto.metadata.MetadataManager.getTableHandle(MetadataManager.java:330)
	at com.facebook.presto.sql.analyzer.StatementAnalyzer$Visitor.lambda$visitTable$17(StatementAnalyzer.java:1244)
`

{
public static final String PARTITION_PRUNING_ENABLED = "partition_pruning_enabled";
public static final String FILTER_PUSHDOWN_ENABLED = "filter_pushdown_enabled";
private static final String CACHE_ENABLED = "cache_enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkorukanti what tests use it? It's better to add properties back when caching is actually supported. Can you modify the tests not to use it?

Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, @vkorukanti
looks good to me. Just a few minor things
I would suggest we try merge this as soon. Then we could add more optimizations in following PRs

presto-delta-driver/pom.xml Outdated Show resolved Hide resolved
String typeBase = columnHandle.getDataType().getBase();
try {
switch (typeBase) {
case StandardTypes.TINYINT:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import TINYINT, SMALLINT, INTEGER, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used only here. I could add, but the imports become too verbose as there are too many types.

Some of them are copied from github.com/delta-io/connectors/golden-tables

Currently these golden tables are not part of any test jars that Delta
provides. Make a PR in Delta to publish these golden tables as a maven
artifact and in Presto add a maven dependency in tests.
Currently the rule is focussed on Hive Parquet tables. It doesn't work
in any other connector other than Hive. Parquet is a common module between
Hive and Delta connectors. Most of the code in the Hive Parquet dereference
pushdown rule is common between Delta and Hive connectors.

Refactor the rule and create a pluggable points so that Hive and other
connectors that work with Parquet data (such as Delta) can extend it and
avoid rewriting the same code.
Currently if two tests use the same Delta table, sometimes the deregistration
of Delta tables in HMS is not immediately applied, resulting in table
already exists exceptions.
@rohanpednekar
Copy link
Contributor

Thanks @vkorukanti and everyone who helped to get this PR reviewed and merged. Really appreciated 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants