-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve async queue #12775
Improve async queue #12775
Conversation
1830b51
to
d3fe71f
Compare
@@ -116,6 +116,21 @@ public synchronized ListenableFuture<Void> offer(T element) | |||
return immediateVoidFuture(); | |||
} | |||
|
|||
// Used to insert all elements returned from borrowBatchAsync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop comment (not super helpful) and rename method to offerAll
.
Or maybe we can just drop method at all and do:
synchronized (this) {
for (T element : borrowResult.getElementsToInsert()) {
offer(element);
}
}
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment removed, method ranmed to offerAll
.
We still need to check the finishing && borrowCount == 0
condition as well as the notEmptySignal
notification criteria, so a private method seems warranted to me instead of inlining the logic. Additionall, ArrayDeque
does specialize addAll
to ensure that the capacity grows to accomodate the entire collection of new inserted elements which individual separate offer
calls does not.
if (recordScannedFiles) { | ||
splits.forEach(split -> scannedFilePaths.add(((HiveSplit) split).getPath())); | ||
hiveSplits.stream() | ||
.filter(split -> split.getStart() == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a chance that split startin at 0
could be filtered out alltogether? While still another splits for same file are present? Can this happen with DF for example? cc: @raunaqmorarka ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should not be possible, no- the first split generated for a file will always have start == 0
. Now, it is possible that a file will get eliminated by dynamic filters after the first split is produced which could then eliminate the remainder of the file from being scanned, but that could happen even with the previous implementation. As far as I can tell, this logic is only necessary for "optimizing small files" and collecting the files scanned to delete them after the contents are rewritten- which should not have dynamic filters in the first place because it would be unsafe for this reason with or without this improvement.
d3fe71f
to
65165ee
Compare
Adds a method to handle batch insertion of all elements to insert returned from AsyncQueue.BorrowResult. This simplifies the implementation and threads contending to synchronize on the queue for each individual element inserted.
Only records the Path for scanned files in HiveSplitSource when the split start point is at the beginning of the file. The only consumer of the scanned files list is only interested in unique paths scanned, so enqueuing the same path repeatedly for the same InternalHiveSplit is unnecessarily expensive.
65165ee
to
e9bfb3d
Compare
Do we want release notes for this? Seems like a niche change, but it could be worth mentioning. |
Probably not, it’s a fairly specific internal implementation detail. I would vote “no”, but @losipiuk can weigh in if he disagrees |
Description
Addresses two minor inefficiencies in hive split generation path:
AsyncQueue
lock when inserting new elements returned fromAsyncQueue.BorrowResult
by acquiring the lock, enqueueing all elements to be inserted, and firing any required notifications once.HiveSplit
scanned (when enabled) multiple times and instead only records it once for each file since all consumers only care about the unique paths scanned.Refactoring performance improvement
Refactoring of some core utilities in
trino-hive
split generation.A non-technical user should not need these changes explained to them.
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: