Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake connector doesn't remove content of MANAGED_TABLE Glue tables created by Databricks #13017

Closed
findinpath opened this issue Jun 28, 2022 · 1 comment · Fixed by #13974
Assignees
Labels
bug Something isn't working
Milestone

Comments

@findinpath
Copy link
Contributor

Scenario:

Databricks SQL

CREATE SCHEMA my_schema LOCATION "s3://trino-ci-test/my_schema/";


CREATE TABLE my_schema.my_table (a, b) %s AS VALUES (1, 2), (2, 3), (3, 4)

Trino

DROP TABLE delta.my_schema.my_table;

The operation succeeds in Trino. However the S3 content of the table is still present on AWS S3 after completing this operation.
This means that there can't be created new tables having the same name as the previously dropped table, because the directory corresponding to the table already exists.

Context:

aws s3api head-object --bucket trino-ci-test --key my_schema/my_table/_delta_log/

{
    "AcceptRanges": "bytes",
    "LastModified": "2022-06-28T11:44:47+00:00",
    "ContentLength": 0,
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ContentType": "application/octet-stream",
    "Metadata": {}
}

The above mentioned code corresponds to what is happening in

ObjectMetadata getS3ObjectMetadata(Path path)
throws IOException
{
String bucketName = getBucketName(uri);
String key = keyFromPath(path);
ObjectMetadata s3ObjectMetadata = getS3ObjectMetadata(path, bucketName, key);
if (s3ObjectMetadata == null && !key.isEmpty()) {
return getS3ObjectMetadata(path, bucketName, key + PATH_SEPARATOR);
}
return s3ObjectMetadata;
}

As seen in the listing above, the ContentType of this "object" is set to application/octet-stream which is not the same as application/x-directory

MediaType.parse(metadata.getContentType()).is(DIRECTORY_MEDIA_TYPE),

This interesting detail breaks the logic used for deleting Delta Lake table directories in AWS S3.

@findinpath findinpath added the bug Something isn't working label Jun 28, 2022
findinpath added a commit to findinpath/trino that referenced this issue Jun 28, 2022
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

trinodb#13017
findinpath added a commit to findinpath/trino that referenced this issue Jun 29, 2022
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

trinodb#13017
ebyhr pushed a commit that referenced this issue Jun 29, 2022
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

#13017
findinpath added a commit to findinpath/trino that referenced this issue Jul 1, 2022
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

trinodb#13017
ebyhr pushed a commit that referenced this issue Jul 4, 2022
Avoid running into a test failure when dealing with a
test depending on Databricks AWS S3 and AWS Glue environment.

Consult for further details:

#13017
@ebyhr ebyhr self-assigned this Jul 13, 2022
@ebyhr
Copy link
Member

ebyhr commented Aug 8, 2022

Filed https://community.databricks.com/s/question/0D58Y000093AR43SAG/how-to-identify-s3-object-type-directory-or-file-created-by-databricks. I will update this issue once I got the response from Databricks community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

3 participants