-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iceberg DROP table not removing data from s3 #5616
Comments
Possible solution would be adding new IcebergTableProperty (type) which will be configurable ( something like it https://github.com/prestosql/presto/pull/5656/files |
… properties on table. Possible solution for trinodb#5616
@electrum, why would the Hive table type matter for whether the data for an Iceberg table gets deleted? Iceberg provides utilities for cleaning up the data if locations are mixed, and if the table owns a location then Presto should be able to simply delete recursively. I can understand wanting to control this behavior. We would turn it off, for example, because we have a service that cleans up old data. But by default I would expect drop table to remove the data. Another solution is to stop checking whether the path exists. Iceberg tables won't conflict with other uses of a prefix and can clean up the data referenced by the table, so other data can exist in the location without correctness issues for Iceberg. I would probably relax that constraint. |
Metastore will delete files from s3 but only for tables with type |
@sshkvar, that is the correct behavior for Hive tables, but not for Iceberg tables. I think Iceberg tables should continue to use external and the Iceberg connector should clean up the files. |
@rdblue @electrum based on our discussion I have created additional PR #6108. |
I'm going to put an item on the agenda for the next Iceberg community sync to talk about prefix ownership. There are some good questions here:
|
@rdblue I have created another merge request which adds ability to have unique table location for each table #6063, so based on it each table will have unique location and can delete it when it is removed. Also this PR #6108 adds ability to recursively delete table data on drop (disabled by default, but can be enabled in configuration). |
Curious, what was decided on this? Users from hive and spark-sql world are pretty used to the notion of managed table drops would drop the data, and external tables won't. |
We decided there are valid use cases where the table owns its location and where it doesn't own its location. So the consensus was to create either a table flag or a catalog flag to control it. I think we should add a flag to the Iceberg catalog to determine whether it will drop the data location recursively. |
@rdblue was it decided where the flag is stored and what its name? |
Hi,
I am using presto (version 343) with Iceberg connector and found issue:
Tables which was created as
CREATE TABLE ...
orCREATE TABLE AS SELECT...
then dropped - it will remove data only from Hive Metastore but not from s3.Then if we try to create same table again - we will have
table already exists
exceptionExample:
Note:
In connector config we have
hive.metastore.thrift.delete-files-on-drop=true
Issue investigation results
2.1 First one is Hive Metastore - it will delete files from EXTERNAL table only if we passed
external.table.purge=TRUE
parameter to it2.2. Second place is ThriftHiveMetastore, but as we have hardcoded TableType=EXTERNAL_TABLE - this code will be skipped
As a result data from s3 not deleted
The text was updated successfully, but these errors were encountered: