Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support views in Delta lake Connector #11763

Merged
merged 3 commits into from
Dec 8, 2022

Conversation

mdesmet
Copy link
Contributor

@mdesmet mdesmet commented Apr 3, 2022

Description

Support views in Delta lake connector

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Delta lake connector

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

Release notes

(x) Release notes entries required with the following suggested text:

# Delta Lake
* Add support for views. ({issue}`11763`)

@cla-bot cla-bot bot added the cla-signed label Apr 3, 2022
@findepi
Copy link
Member

findepi commented Apr 4, 2022

@mdesmet the PR seems to cover AC changes that are unrelated to views.
I guess it wasn't intentional, please make sure they go into separate PRs.

@findepi
Copy link
Member

findepi commented Apr 4, 2022

cc @alexjo2144 @findinpath @homar

@mdesmet
Copy link
Contributor Author

mdesmet commented Apr 4, 2022

@mdesmet the PR seems to cover AC changes that are unrelated to views. I guess it wasn't intentional, please make sure they go into separate PRs.

Some view tests within BaseConnectorTest failed because of existing AC setup, I looked at how it's done in iceberg and based myself on that but haven't really tested all AC use cases and the implications of this change. I will setup a separate PR.

Also for clarity. in the original bug report Spark views were mentioned. I have not investigated that for now. I think If we want to support that we might need a translation layer similar like what Linkedin did for the HIve views (Coral). My focus has only been to support creation of views as in Iceberg/Hive.

@@ -112,7 +112,7 @@ public void testNoColumnStats()
assertQuery("SELECT c_str FROM no_column_stats WHERE c_int = 42", "VALUES 'foo'");
}

@Test
@Test(enabled = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to supporting all access modes. See #11782

I have added delta.security = system and that causes the metadata to be retrieved in MetadataManager (before the access control was set to SYSTEM),

https://github.com/mdesmet/trino/blob/1c1fb3021648803d5138948e13b93c93ec2cf93d/core/trino-main/src/main/java/io/trino/metadata/MetadataManager.java#L2257-L2264

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reenable the test.

then let's think how to make it pass. maybe we don't need delta.security = system here.
or maybe we need to merge the other PR first.
IDK yet, but let's have the test remind us (ie be failing until we can fix it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is related to the other PR. I will remove that commit here, but as I said before it seems that more tests are failing because of the AC setup in the BaseConnectorTest.

@@ -118,7 +157,9 @@ public HiveMetastoreBackedDeltaLakeMetastore(
{
Optional<Table> candidate = delegate.getTable(databaseName, tableName);
candidate.ifPresent(table -> {
if (!TABLE_PROVIDER_VALUE.equalsIgnoreCase(table.getParameters().get(TABLE_PROVIDER_PROPERTY))) {
if (((table.getTableType().equals(EXTERNAL_TABLE.name()) || table.getTableType().equals(MANAGED_TABLE.name())) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one check per line pls (for readability sake)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to split these checks into two static util methods so that this line reads !isDeltaDataTable(table) && !isDeltaView(table)

}

private static boolean isView(String tableType, Map<String, String> tableParameters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

@findepi
Copy link
Member

findepi commented Apr 4, 2022

Delta, Hive and Iceberg can use the same catalog (eg Glue or HMS).
A view created by one connector should be accessible by all others.
Let's have a test for that.

@findepi
Copy link
Member

findepi commented Apr 4, 2022

Some view tests within BaseConnectorTest failed because of existing AC setup, I looked at how it's done in iceberg and based myself on that but haven't really tested all AC use cases and the implications of this change. I will setup a separate PR.

Understood. Let's separate and iterate from there

Also for clarity. in the original bug report Spark views were mentioned. I have not investigated that for now. I think If we want to support that we might need a translation layer similar like what Linkedin did for the HIve views (Coral). My focus has only been to support creation of views as in Iceberg/Hive.

This is totally different story. I don't know what's the level of support for Spark SQL in Coral. So far we were reading only HiveQL-based views with Coral, and that was really challenging to get right-ish. We can invest into SparkSQL-based views for Delta Lake, but that's certainly a project on its own. Let's have Trino Views supported first, just as we did for Iceberg.

@mdesmet mdesmet force-pushed the feature/delta_views branch from 02a785a to 2737614 Compare April 5, 2022 19:16
@mdesmet
Copy link
Contributor Author

mdesmet commented Apr 5, 2022

So basically the issue is with following check in MetadataManager. The isCatalogManagedSecurity method returns false as we setup SYSTEM as the default AC mode in Delta.

    @Override
    public Optional<ViewDefinition> getView(Session session, QualifiedObjectName viewName)
    {
        Optional<ConnectorViewDefinition> connectorView = getViewInternal(session, viewName);
        if (connectorView.isEmpty() || connectorView.get().isRunAsInvoker() || isCatalogManagedSecurity(session, viewName.getCatalogName())) {
            return connectorView.map(view -> new ViewDefinition(viewName, view));
        }

        Identity runAsIdentity = systemSecurityMetadata.getViewRunAsIdentity(session, viewName.asCatalogSchemaTableName())
                .or(() -> connectorView.get().getOwner().map(Identity::ofUser))
                .orElseThrow(() -> new TrinoException(NOT_SUPPORTED, "Catalog does not support run-as DEFINER views: " + viewName));
        return Optional.of(new ViewDefinition(viewName, connectorView.get(), runAsIdentity));
    }

Then in the DeltaLakeAccessControlFactory we setup isUsingSystemSecurity to return true, However in Hive and Iceberg we don't store the owner of the view in that case. The same Iceberg BaseConnectorTest also fails on the same error if i set iceberg.security = system.

That's why i had implemented the Access Control modes first in order to be able to leverage the existing BaseConnector tests as the system AC doesn't support DEFINER views.

getView(session, name).ifPresent(view -> views.put(name, view));
}
catch (TrinoException e) {
if (e.getErrorCode().equals(HIVE_VIEW_TRANSLATION_ERROR.toErrorCode())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to have a Set with these 3 error codes(you can put comments in the place where set is created) and here only one if, something like:
if(errorCoded.contains(e.getErrorCode()) { //do nothing } else { throw e}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE_VIEW_TRANSLATION_ERROR should never be thrown here anyway, since we're not translating views.

@findinpath
Copy link
Contributor

@mdesmet please rebase on top of the changes from #11782

@mdesmet mdesmet force-pushed the feature/delta_views branch from 2737614 to 1701125 Compare September 9, 2022 06:35
@mdesmet mdesmet force-pushed the feature/delta_views branch 2 times, most recently from 2708f08 to 284cbe7 Compare September 11, 2022 22:18
@mdesmet mdesmet force-pushed the feature/delta_views branch 3 times, most recently from 955b6bf to d8c4da0 Compare October 9, 2022 20:50

public class HiveMetastoreBackedDeltaLakeMetastore
implements DeltaLakeMetastore
{
public static final String TABLE_PROVIDER_PROPERTY = "spark.sql.sources.provider";
public static final String TABLE_PROVIDER_VALUE = "DELTA";

// Be compatible with views defined by the Hive connector, which can be useful under certain conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, which can be useful under certain conditions. -> this part of the comment does add a bit of mistery to the statement, but has no real gain in understanding for the reader.

private List<String> listDatabases(Optional<String> database)
{
if (database.isPresent()) {
// TODO: should this decision logic go into DeltaLakeMetadata or here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a question for the review?

@@ -188,6 +224,106 @@ public void renameTable(ConnectorSession session, SchemaTableName from, SchemaTa
delegate.renameTable(from.getSchemaName(), from.getTableName(), to.getSchemaName(), to.getTableName());
}

@Override
public void createView(ConnectorSession session, SchemaTableName schemaViewName, ConnectorViewDefinition definition, boolean replace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic from this method seems to be 1 to 1 copied from io.trino.plugin.iceberg.catalog.hms.TrinoHiveCatalog#createView from trino-iceberg.

This applies also for the other methods of the class newly introduced in this PR.

Can you think of a way to avoid such kind of code duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in Delta and Iceberg is indeed similar but different from the Hive connector as the Hive connector acts upon a SemiTransactionalhiveMetastore with slightly different method calls. In Iceberg and Delta we don't have the SemiTransactionalhiveMetastore.

To share the code we would need to create a helper class in the Hive connector (the easier part) with all views related operations. This helper class would need access to the HiveMetastore implementation (which will relay to either actual Hive Metastore or GlueHiveMetastore (Glue API calls) due to Guice wiring). This works well for Delta. However on Iceberg the implementation is different. It has two factories to create either a TrinoCatalog for Hive metastore, which has access to the HiveMetastore implementation, either to The Glue catalog which directly calls the Glue API.

We could fix the latter by also using Hive's GlueHiveMetastore in the Icebergs TrinoGlueCatalogin order for the helper class to be used in both cases. There are however also differences between theTrinoGlueCatalogandGlueHiveMetastore` that will need to be investigated and properly tested.

Code reuse is possible but at definitely at a much higher cost also it would probably be better handled in a separate PR. The question arises what we do first: provide the views feature and do the internal refactoring after or first refactor and delay the views feature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go for trying out (as discussed offline) for grouping together the code which would be otherwise repetitive after landing your PR. This seems rather cheap to do and shouldn't add an increased complexity in the code of Delta and Iceberg connectors.

Regarding TrinoGlueCatalog I'd suggest to leave it as it is (at least for now).

The main purpose of my comment was DRY (Don't repeat yourself) and not a more complex refactoring (which in any case would not be in the scope of this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed your advice and added a ViewsMetastoreHelper class, potentially it could also be used for Hive but then we would have to use callbacks. I didn't do that for now.

@@ -170,6 +170,9 @@ protected QueryRunner createQueryRunner()
protected boolean hasBehavior(TestingConnectorBehavior connectorBehavior)
{
switch (connectorBehavior) {
case SUPPORTS_CREATE_VIEW:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up to this PR, could you please create an issue (with good-first-issue label) to add support for comments on Delta views ?

@@ -170,6 +170,9 @@ protected QueryRunner createQueryRunner()
protected boolean hasBehavior(TestingConnectorBehavior connectorBehavior)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is missing a test where Trino views for Delta gets created on a Glue backed metastore.
Consider either such a test either in trino-delta-lake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added

@mdesmet mdesmet marked this pull request as draft October 28, 2022 18:50
@mdesmet mdesmet force-pushed the feature/delta_views branch 5 times, most recently from c99bf35 to a016c9c Compare October 30, 2022 19:10
@mdesmet mdesmet marked this pull request as ready for review October 30, 2022 19:13
@mdesmet mdesmet force-pushed the feature/delta_views branch from a016c9c to e00f037 Compare November 5, 2022 21:56
@@ -76,7 +76,6 @@
{
public static final String TABLE_PROVIDER_PROPERTY = "spark.sql.sources.provider";
public static final String TABLE_PROVIDER_VALUE = "DELTA";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change - please restore.

String view = "test_glue_view_" + randomTableSuffix();
try {
assertUpdate(format("CREATE VIEW %s AS SELECT 1 AS val ", view), 1);
assertQuery("SELECT val FROM " + view, "VALUES (1)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The paranthesis in the expression VALUES (1) are not necessary.

public class TestDeltaLakeViewsGlueMetastore
extends AbstractTestQueryFramework
{
protected static final String SCHEMA = "test_delta_lake_glue_views_" + randomTableSuffix();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected? I think private will do just fine.

return Optional.of(definition);
}

public static boolean isView(String tableType, Map<String, String> tableParameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can stay private

}
}

public List<SchemaTableName> listViews(ConnectorSession session, Optional<String> database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectorSession session parameter seems unused.

.map(table -> new SchemaTableName(schema, table));
}

public Optional<ConnectorViewDefinition> getView(ConnectorSession session, SchemaTableName viewName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectorSession session parameter seems to be unused.
Make sure that also the other methods calling this method drop it as well : dropView , getViews.

view.getOwner()));
}

private Map<String, String> createViewProperties(ConnectorSession session)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use createViewProperties(ConnectorSession session, String trinoVersion, String connectorName) call instead (to avoid code duplication) ?

}
}

@AfterClass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AfterClass(alwaysRun = true)

protected static final String SCHEMA = "test_delta_lake_glue_views_" + randomTableSuffix();
protected static final String CATALOG_NAME = "test_delta_lake_glue_views";

private File schemaLocation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be converted to local variable.

@mdesmet
Copy link
Contributor Author

mdesmet commented Nov 23, 2022

Where's the test for #11763 (comment)?

Delta, Hive and Iceberg can use the same catalog (eg Glue or HMS).
A view created by one connector should be accessible by all others.

@ebyhr : The only way to test that all three together is through a product test. It seems we currently don't have any configuration that has both Hive, Iceberg and Delta. Is that what we want or do you see another way to test it?

@mdesmet mdesmet force-pushed the feature/delta_views branch from b8fc01a to 5f8597e Compare November 23, 2022 23:15
@findinpath
Copy link
Contributor

findinpath commented Nov 24, 2022

Delta, Hive and Iceberg can use the same catalog (eg Glue or HMS).
A view created by one connector should be accessible by all others.

I think you can use BaseSharedMetastoreTest from trino-iceberg as PoC in trino-delta-lake project where you could showcase the above mentioned functionality between delta-lake and hive connector.
It is enough to showcase for the two of those mentioned and not for all.

Side note:

#11763 (comment)

@mdesmet the product test enviroment singlenode-delta-lake-oss has both hive and delta with table redirection support already configured.

@mdesmet mdesmet force-pushed the feature/delta_views branch 3 times, most recently from d335361 to e27141c Compare November 24, 2022 08:24
@mdesmet mdesmet self-assigned this Nov 24, 2022
@mdesmet mdesmet force-pushed the feature/delta_views branch from e27141c to 2f14321 Compare November 25, 2022 20:13
@mdesmet mdesmet force-pushed the feature/delta_views branch 3 times, most recently from 8792f4c to 0cfc576 Compare November 29, 2022 10:41
@mdesmet
Copy link
Contributor Author

mdesmet commented Nov 29, 2022

@ebyhr : PTAL

@ebyhr
Copy link
Member

ebyhr commented Nov 30, 2022

/test-with-secrets sha=0cfc5761b4c02ec5137f9dfb751137f637bf42d0

@mdesmet mdesmet force-pushed the feature/delta_views branch from 0cfc576 to cd69957 Compare December 3, 2022 11:44
@ebyhr
Copy link
Member

ebyhr commented Dec 5, 2022

/test-with-secrets sha=cd69957b50ee03ac83f57a86fa77017c47d65c4f

https://github.com/trinodb/trino/actions/runs/3615689927

@mdesmet mdesmet force-pushed the feature/delta_views branch from cd69957 to 94226a3 Compare December 7, 2022 14:33
@ebyhr
Copy link
Member

ebyhr commented Dec 8, 2022

/test-with-secrets sha=94226a3573dde2ac87ba7cadb5cf607174f4bf8f

https://github.com/trinodb/trino/actions/runs/3644481861

@ebyhr ebyhr merged commit 6283fef into trinodb:master Dec 8, 2022
@github-actions github-actions bot added this to the 404 milestone Dec 8, 2022
@ebyhr ebyhr mentioned this pull request Dec 8, 2022

if (!isPrestoView(tableParameters)) {
// Hive views are not compatible
throw new HiveViewNotSupportedException(viewName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the situations where isView above return true (the method is defined as isHiveOrPrestoView(tableType) && PRESTO_VIEW_COMMENT.equals(tableParameters.get(TABLE_COMMENT))) but isPrestoView will return false?

intuitively this line is not reachable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i am changing this in #18570)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Delta Lake connector shows hive views as tables Delta connector does support View creation
8 participants