-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify the file presense for cached directory lister and retry #20414
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -513,24 +513,45 @@ private List<TrinoFileStatus> listBucketFiles(TrinoFileSystem fs, Location locat | |
|
||
@VisibleForTesting | ||
Iterator<InternalHiveSplit> buildManifestFileIterator(InternalHiveSplitFactory splitFactory, Location location, List<Location> paths, boolean splittable) | ||
{ | ||
return createInternalHiveSplitIterator(splitFactory, splittable, Optional.empty(), verifiedFileStatusesStream(location, paths)); | ||
} | ||
|
||
private Stream<TrinoFileStatus> verifiedFileStatusesStream(Location location, List<Location> paths) | ||
{ | ||
TrinoFileSystem trinoFileSystem = fileSystemFactory.create(session); | ||
// Check if location is cached BEFORE using the directoryLister | ||
boolean isCached = directoryLister.isCached(location); | ||
|
||
Map<String, TrinoFileStatus> fileStatuses = new HashMap<>(); | ||
Iterator<TrinoFileStatus> fileStatusIterator = new HiveFileIterator(table, location, trinoFileSystem, directoryLister, RECURSE); | ||
if (!fileStatusIterator.hasNext()) { | ||
checkPartitionLocationExists(trinoFileSystem, location); | ||
} | ||
fileStatusIterator.forEachRemaining(status -> fileStatuses.put(Location.of(status.getPath()).path(), status)); | ||
Stream<TrinoFileStatus> fileStream = paths.stream() | ||
|
||
// If file statuses came from cache verify that all are present | ||
if (isCached) { | ||
boolean missing = paths.stream() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than fully reloading the whole cache it'd be nice if we could just check any missing paths directly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that would be great. Unfortunately directory lister can't list the files individually, it does it by folders and caches the same way. We are invalidating the cache for a parent folder (not the whole cache!) causing the reloading of it's content. |
||
.anyMatch(path -> !fileStatuses.containsKey(path.path())); | ||
// Invalidate the cache and reload | ||
if (missing) { | ||
directoryLister.invalidate(location); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the general approach here is fine, I would suggest changing the interfaces in a slightly different way though. Exposing What I'd do differently is rather than exposing My reason for that is you're assuming here that the cache key is on a Location, but that's an internal to the caching DirectoryListers that's not guaranteed to be stable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @alexjo2144. I looked into your proposal to remove invalidate(Location) from the interface. That doesn't look right for me. Directory listers are chained with the delegate model, some of them are caching (by Location) ones, some are not (they have a Noop invalidate). If we remove invalidate from the interface we won't be able to push it down the chain. We only call invalidate(location) if isCached(location) is true, so that in a way verifies that the particular directory lister supports cache by location. What do you think? |
||
|
||
fileStatuses.clear(); | ||
fileStatusIterator = new HiveFileIterator(table, location, trinoFileSystem, directoryLister, RECURSE); | ||
fileStatusIterator.forEachRemaining(status -> fileStatuses.put(Location.of(status.getPath()).path(), status)); | ||
} | ||
} | ||
|
||
return paths.stream() | ||
.map(path -> { | ||
TrinoFileStatus status = fileStatuses.get(path.path()); | ||
if (status == null) { | ||
throw new TrinoException(HIVE_FILE_NOT_FOUND, "Manifest file from the location [%s] contains non-existent path: %s".formatted(location, path)); | ||
} | ||
return status; | ||
}); | ||
return createInternalHiveSplitIterator(splitFactory, splittable, Optional.empty(), fileStream); | ||
} | ||
|
||
private ListenableFuture<Void> getTransactionalSplits(Location path, boolean splittable, Optional<BucketConversion> bucketConversion, InternalHiveSplitFactory splitFactory) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't feel right to me to ask each time whether the location is cached.
You are adding handling for a corner case in the happy flow this way.
Maybe it would be better to add a procedure to clear the directory listing caching for a specified location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code under this 'if' verifies if all the listed files are present in directory listing. We know that the discrepancy could be caused by the stale cache and in this case there is a way to handle it (invalidate the cache and retry). There is no sense to do it if location is not cached, invalidation is NoOp and retry would provide the same results.
I did add
invalidate(Location)
call to directory lister, so the conditional code would work in any case. It is just a performance optimization: avoiding verification and retrying if those are not going to change anything anyway.