Delta Lake connector doesn't remove content of `MANAGED_TABLE` Glue tables created by Databricks #13017

findinpath · 2022-06-28T12:32:55Z

Scenario:

Databricks SQL

CREATE SCHEMA my_schema LOCATION "s3://trino-ci-test/my_schema/";


CREATE TABLE my_schema.my_table (a, b) %s AS VALUES (1, 2), (2, 3), (3, 4)

Trino

DROP TABLE delta.my_schema.my_table;

The operation succeeds in Trino. However the S3 content of the table is still present on AWS S3 after completing this operation.
This means that there can't be created new tables having the same name as the previously dropped table, because the directory corresponding to the table already exists.

Context:

aws s3api head-object --bucket trino-ci-test --key my_schema/my_table/_delta_log/

{
    "AcceptRanges": "bytes",
    "LastModified": "2022-06-28T11:44:47+00:00",
    "ContentLength": 0,
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ContentType": "application/octet-stream",
    "Metadata": {}
}

The above mentioned code corresponds to what is happening in

trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java

Lines 735 to 745 in fe608f2

    
           ObjectMetadata getS3ObjectMetadata(Path path) 
        
                   throws IOException 
        
           { 
        
               String bucketName = getBucketName(uri); 
        
               String key = keyFromPath(path); 
        
               ObjectMetadata s3ObjectMetadata = getS3ObjectMetadata(path, bucketName, key); 
        
               if (s3ObjectMetadata == null && !key.isEmpty()) { 
        
                   return getS3ObjectMetadata(path, bucketName, key + PATH_SEPARATOR); 
        
               } 
        
               return s3ObjectMetadata; 
        
           }

As seen in the listing above, the ContentType of this "object" is set to application/octet-stream which is not the same as application/x-directory

trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java

Line 425 in fe608f2

MediaType.parse(metadata.getContentType()).is(DIRECTORY_MEDIA_TYPE),

This interesting detail breaks the logic used for deleting Delta Lake table directories in AWS S3.

The text was updated successfully, but these errors were encountered:

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: trinodb#13017

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: #13017

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: trinodb#13017

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: #13017

ebyhr · 2022-08-08T03:05:15Z

Filed https://community.databricks.com/s/question/0D58Y000093AR43SAG/how-to-identify-s3-object-type-directory-or-file-created-by-databricks. I will update this issue once I got the response from Databricks community.

findinpath added the bug Something isn't working label Jun 28, 2022

ebyhr pushed a commit that referenced this issue Jun 29, 2022

Exclude test from TestDeltaLakeDropTableCompatibility class

63023ad

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: #13017

ebyhr pushed a commit that referenced this issue Jul 4, 2022

Exclude test from TestDeltaLakeDropTableCompatibility class

f89f14c

Avoid running into a test failure when dealing with a test depending on Databricks AWS S3 and AWS Glue environment. Consult for further details: #13017

ebyhr self-assigned this Jul 13, 2022

ebyhr removed their assignment Aug 8, 2022

findinpath mentioned this issue Aug 25, 2022

Respect hive.metastore.thrift.delete-files-on-drop config property for dropping partitions #13822

Merged

findinpath self-assigned this Sep 1, 2022

findinpath mentioned this issue Sep 2, 2022

Perform S3 directory deletion with batch requests #13974

Merged

findepi closed this as completed in #13974 Dec 9, 2022

findepi mentioned this issue Dec 9, 2022

Release notes for 405 #15058

Closed

findepi added this to the 404 milestone Dec 9, 2022

anusudarsan mentioned this issue Mar 15, 2024

Delta lake connector leaves behind directory when dropping managed tables created by Databricks #21111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta Lake connector doesn't remove content of `MANAGED_TABLE` Glue tables created by Databricks #13017

Delta Lake connector doesn't remove content of `MANAGED_TABLE` Glue tables created by Databricks #13017

findinpath commented Jun 28, 2022

ebyhr commented Aug 8, 2022

Delta Lake connector doesn't remove content of MANAGED_TABLE Glue tables created by Databricks #13017

Delta Lake connector doesn't remove content of MANAGED_TABLE Glue tables created by Databricks #13017

Comments

findinpath commented Jun 28, 2022

ebyhr commented Aug 8, 2022

Delta Lake connector doesn't remove content of `MANAGED_TABLE` Glue tables created by Databricks #13017

Delta Lake connector doesn't remove content of `MANAGED_TABLE` Glue tables created by Databricks #13017