-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support ALTER TABLE RENAME TO in Delta #11401
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the storage location for the table seems wrong, as any existing data for the table would be lost. Try changing the test to insert into the table before renaming, then verify the data can be read after the rename.
I think the only thing needed to implement rename is to call metastore.renameTable()
.
That's what I have tried first. What I saw is that "location" in the table properties still refers to the old location after running
In my test I'm already using a CTAS, so data will be inserted before the rename and is sucessfully retrieved after the rename. |
In table creation, if location is set, the table is considered as external (unmanaged table), if not the location is set based on the schema location if avaiable and the table becomes a managed table. Trino throws an exception if the schema location is not set. This PR will also throw an exception if the schema location is not set, however there may be an issue if the schema contains both managed and unmanaged tables, the code would convert a before unmanaged table to a managed table by simply renaming it. |
@electrum depends whether Delta's transaction log contains relative or absolute paths. For this reason i think we shouldn't rename the location.
|
In spark it works as follows: If managed -> update to the new storage location and send the table to hms, hms will do the location rename operation and rename the table I remember that's why in iceberg the table is set to unmanaged and the storage location is not updated when renaming the table. Maybe different from Iceberg is that Delta has both managed and unmanaged tables. Why not support both cases, similar to Spark and add a session property as @alexjo2144 mentioned on Slack just as in Databricks that needs to be enabled to rename managed tables? Simply keeping the existing location would require us to convert a managed into a unmanaged table.
Indeed the log entries could be made with both relative path or absolute path. However on first sight this not handled in Spark/Delta as well. Maybe Databricks does something additional there? this would only be a real issue when it's done for managed tables. |
What about Glue? |
Glue at this moment doesn't support table renames. There is only an UpdateTable API but that doesn't perimit renaming the table. See AWS glue docs It's also blocked within Trino in the GlueHiveMetastore class
|
I also looked in the Hive Metastore code if we are able to do a rename without changing the location for a managed table. The data is moved when
This means that we can't stop HMS to change the location unless we (temporarely?) convert it to a external table by setting the EXTERNAL parameter to "TRUE". So it would be up to 2 metadata operations, The first operation is to change the external property to TRUE for managed tables and rename the table, the second operation to revert the external property for managed tables. Would that be an acceptable behaviour? Or if not, are there any other suggestions on what would be an acceptable behaviour? |
414881d
to
ec37cc8
Compare
QueryAssertions.assertUpdate(queryRunner, queryRunner.getDefaultSession(), "CREATE TABLE " + tableName + " AS SELECT 123 x", OptionalLong.of(1), Optional.empty()); | ||
|
||
String renamedTable = "test_rename_new_" + randomTableSuffix(); | ||
QueryAssertions.assertQueryFails(queryRunner, getSession(), "ALTER TABLE " + tableName + " RENAME TO " + renamedTable, "Renaming managed tables will move the data on the underlying storage system. Set 'delta.rename-managed-tables.enabled' to enable this feature."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use here assertThatThrownBy
instead of changing the visibility of the method QueryAssertions.assertQueryFails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would work, but isn't changing the visibility the better option here? Why is public static void assertUpdate(QueryRunner queryRunner, Session session, @Language("SQL") String sql, OptionalLong count, Optional<Consumer<Plan>> planAssertion)
public? Changing the visibility would allow tests to be colocated to test features that require multiple queryrunners and have a similar recognizable API. Other alternative would be to move it to a separate test class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertThatThrownBy(() -> query)
or leverage inherited AbstractTestQueryFramework#assertQueryFails
.
(the fact that you couldn't make a static import could ring a bell:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I'm doing this is because I create another QueryRunner in this specific test without the setting delta-rename-managed-tables.enabled=true
. That's why i can't simply call the standard methods.
I will move this test to a separate class.
} | ||
|
||
@Config("delta.rename-managed-tables.enabled") | ||
public DeltaLakeConfig setRenameManagedTablesEnabled(boolean renameManagedTablesEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an @ConfigDescription
for the setting for explaining the reasoning for this setting.
I have changed the implementation to work as follow:
To make it work as in Iceberg (not moving the data) we should make every table external, but then data won't be removed on DROP (which is different from how it's done in Iceberg). I think Delta in Trino should mimic how it works within the Spark/Databricks ecosystem, hence I consider supporting both managed tables and external tables the better option Test cases have been added for both external and managed tables. |
ec37cc8
to
01c656f
Compare
@@ -1504,6 +1507,43 @@ public void dropTable(ConnectorSession session, ConnectorTableHandle tableHandle | |||
metastore.dropTable(session, handle.getSchemaName(), handle.getTableName()); | |||
} | |||
|
|||
@Override | |||
public void renameTable(ConnectorSession session, ConnectorTableHandle tableHandle, SchemaTableName newTable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the names in line with what's in ConnectorMetadata
public void renameTable(ConnectorSession session, ConnectorTableHandle tableHandle, SchemaTableName newTable) | |
public void renameTable(ConnectorSession session, ConnectorTableHandle tableHandle, SchemaTableName newTableName) |
Database schema = metastore.getDatabase(newTable.getSchemaName()).orElseThrow(() -> new SchemaNotFoundException(newTable.getSchemaName())); | ||
Optional<String> schemaLocation = getSchemaLocation(schema); | ||
if (schemaLocation.isEmpty()) { | ||
throw new TrinoException(NOT_SUPPORTED, "The 'location' property must be specified either for the table or the schema"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Target schema must have a location when moving managed tables"
String location = new Path(schemaLocation.get(), newTable.getTableName()).toString(); | ||
Path targetPath = new Path(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd flip these to avoid the double Path construction, alternatively inline the toString
calls
String location = new Path(schemaLocation.get(), newTable.getTableName()).toString(); | |
Path targetPath = new Path(location); | |
Path targetPath = new Path(schemaLocation.get(), newTable.getTableName()); | |
String location = targetPath.toString(); |
} | ||
|
||
@Test | ||
public void testRenameManagedTableIsNotSupportedWhenRenameManagedTablesNotEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a test that shows the underlying files are actually moved when renaming managed tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default public void testRenameTable()
from BaseConnectorSmokeTest
tests renaming of a managed table. Now we don't actually verify if the files are moved, do you think that's useful? In the end you'll be testing what hive metastore is doing. If hive metastore starts doing something different, the read query would fail. We depend on the SERDE location in the storage descriptor for the reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I forgot about the base class tests. That's fine with me
Overall looks really good, thanks for working on this! |
@findepi : on renaming Glue tables, actually I think it should be doable similar to how materialized views are replaced with 2 metadata operations (create and drop of the storage table). If you think that's interesting we can create another issue to tackle that. There is definitely user demand as in dbt-trino they just introduced a PR for glue tables where they work around the rename swap of the production table with the new production table by dropping the production table and then running the production table CTAS (effectively having no access to the production table during the rebuild process). |
3f34c3b
to
c1bc7b6
Compare
Can you add something to the Delta documentation page, including why managed table renames require an opt-in? |
c1bc7b6
to
2877fe9
Compare
2877fe9
to
6bb959d
Compare
Renaming tables | ||
^^^^^^^^^^^^^^^ | ||
|
||
When using Hive metastore, you can use the connector to :doc:`rename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can rename Delta Lakes tables with ALTER TABLE.
|
||
* Tables created without explicit location | ||
|
||
Renaming will move the data on the underlying storage system. As this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming moves
Renaming will move the data on the underlying storage system. As this | ||
can be an expensive operation, an explicit opt-in is required. Set | ||
``delta.rename-managed-tables.enabled`` to ``true`` in your catalog | ||
properties and restart your Trino server to activate this behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an explicit opt-in is required with setting delta.rename-managed-tables.enabled
to true
.
Drop the rest
|
||
* Tables created with explicit location | ||
|
||
Renaming is only a metadata operation and requires no additional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation, and requires
Renaming is only a metadata operation and requires no additional | ||
configuration. | ||
|
||
Rename operations on Delta Lake tables backed by Glue metastore are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove .. we do no document what we do not support.
3c9f1b9
to
67f138a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good.. no input from my side about the rest..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
I am still unsure what the rename semantics for tables with implicit locations should be.
there is a bunch of options.
The RENAME of tables with explicit location is a clearly defined problem (location mustn't change).
@mdesmet I don't want to block this awesome enhancements for "external" tables. This is a very needed feature. I also don't want to jump the gun on the handing of "managed" tables (with implicit locations). Will add some considerations in a bit.
Can you please split this PR into two PRs? Here, keep support for renaming "external" tables only (with explicit location) and move the "managed" to a new one?
Renaming moves the data on the underlying storage system. As this | ||
can be an expensive operation, an explicit opt-in is required with | ||
setting ``delta.rename-managed-tables.enabled`` to ``true``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true for all metastores, or only for HMS?
* - ``delta.rename-managed-tables.enabled`` | ||
- Enable :ref:`renaming tables <renaming-tables>` created without | ||
explicit location. A rename moves the data on the underlying storage | ||
system. As moving data on the underlying storage system is a potentially |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies, but i am still unsure what the rename semantics for tables with implicit locations should be.
there is a bunch of options, and sending rename to metastore is one of them, but i am not yet convinced it's the best one.
Added Options as a comment here: #11400 (comment) |
6b121b2
to
57eaaaa
Compare
57eaaaa
to
e3919f1
Compare
03d7f57
to
562979c
Compare
562979c
to
e83abac
Compare
CI #12950 |
Did the docs changes get deleted from this PR / moved into a different one? We'll still need to document this. |
Description
New feature
Change of the Delta connector and potentially all connectors based on the Hive connector
Implement support for table renames in the Delta connector
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: