Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore empty folder placeholder files on S3 #4552

Merged
merged 1 commit into from
Jul 27, 2020

Conversation

alexjo2144
Copy link
Member

These files are created by the Hadoop file system to indicate empty folders.

@cla-bot cla-bot bot added the cla-signed label Jul 23, 2020
Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need config for this.

@alexjo2144 alexjo2144 force-pushed the s3-ignore-hadoop-empty-folder branch from beb927f to 9d4fa82 Compare July 23, 2020 17:46
@alexjo2144
Copy link
Member Author

Thanks, removed the config.

@alexjo2144
Copy link
Member Author

@rohangarg messaged me on slack with the suggestion we move this change to https://github.com/prestosql/presto/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/util/HiveFileIterator.java#L86 Do you have a preference @electrum ?

@alexjo2144 alexjo2144 force-pushed the s3-ignore-hadoop-empty-folder branch from 9d4fa82 to 0098476 Compare July 23, 2020 18:04
@rohangarg
Copy link
Member

messaged me on slack with the suggestion we move this change to https://github.com/prestosql/presto/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/util/HiveFileIterator.java#L86 Do you have a preference?

One benefit of the split generation change approach could be that it handles custom S3FileSystemType as well..

@pettyjamesm
Copy link
Member

@rohangarg messaged me on slack with the suggestion we move this change to https://github.com/prestosql/presto/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/util/HiveFileIterator.java#L86 Do you have a preference @electrum ?

Not electrum here, but I think it would be a good idea not to .filter those out here, but use it to set the isdir property here instead.

That way it can be up to the caller to decide whether they want to filter those out or not (maybe via Hadoop's PathFilter API), but doing this directly for now in HiveFileIterator like you mentioned makes sense.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i propsoe a naming change

@@ -591,6 +592,11 @@ private static boolean isGlacierObject(S3ObjectSummary object)
return Glacier.toString().equals(object.getStorageClass());
}

private static boolean isHadoopEmptyFolderObject(S3ObjectSummary object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static boolean isHadoopEmptyFolderObject(S3ObjectSummary object)
private static boolean isHadoopFolderMarker(S3ObjectSummary object)

@@ -39,6 +39,7 @@
private GetObjectMetadataRequest getObjectMetadataRequest;
private CannedAccessControlList acl;
private boolean hasGlacierObjects;
private boolean hasHadoopEmptyFolderObjects;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private boolean hasHadoopEmptyFolderObjects;
private boolean hasHadoopFolderMarkers;

@@ -570,6 +570,22 @@ private static void assertSkipGlacierObjects(boolean skipGlacierObjects)
}
}

@Test
public void testSkipHadoopEmptyFolderObjectsEnabled()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename appropriately

@findepi
Copy link
Member

findepi commented Jul 24, 2020

I think it would be a good idea not to .filter those out here, but use it to set the isdir property here instead.

@pettyjamesm the xxx_$folder$ itself is not a directory. It is supposed not to exist and to cause xxx be treated as a folder.
I'd rather ignore them for now and reiterate when we see a need for that. This would help us provide better test cases too.

@findepi
Copy link
Member

findepi commented Jul 24, 2020

A follow-up: #4566

@alexjo2144
Copy link
Member Author

Renamed, thanks @findepi

@findepi
Copy link
Member

findepi commented Jul 24, 2020

A note on filtering in HiveFileIterator -- ive does not ignore those empty files when running on hdfs://
so if we do ignore them, this may lead to some problems for end users.
This is file system specific so let's treat this on file system level.
I've created a follow up for Hive w/caching -- #4566

@alexjo2144 alexjo2144 force-pushed the s3-ignore-hadoop-empty-folder branch from 7ee0e22 to 4f03197 Compare July 24, 2020 17:24
These files are created by the Hadoop file system to indicate empty folders.
@alexjo2144 alexjo2144 force-pushed the s3-ignore-hadoop-empty-folder branch from 4f03197 to c1ec76d Compare July 24, 2020 18:16
@findepi findepi merged commit d72622d into trinodb:master Jul 27, 2020
@findepi
Copy link
Member

findepi commented Jul 27, 2020

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants