Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve async queue #12775

Merged
merged 2 commits into from
Jun 16, 2022
Merged

Improve async queue #12775

merged 2 commits into from
Jun 16, 2022

Conversation

pettyjamesm
Copy link
Member

Description

Addresses two minor inefficiencies in hive split generation path:

  • Avoids contention on the AsyncQueue lock when inserting new elements returned from AsyncQueue.BorrowResult by acquiring the lock, enqueueing all elements to be inserted, and firing any required notifications once.
  • Avoids recording the path of HiveSplit scanned (when enabled) multiple times and instead only records it once for each file since all consumers only care about the unique paths scanned.

Is this change a fix, improvement, new feature, refactoring, or other?

Refactoring performance improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Refactoring of some core utilities in trino-hive split generation.

How would you describe this change to a non-technical end user or system administrator?

A non-technical user should not need these changes explained to them.

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 9, 2022
@pettyjamesm pettyjamesm requested a review from losipiuk June 9, 2022 21:02
@pettyjamesm pettyjamesm force-pushed the improve-async-queue branch from 1830b51 to d3fe71f Compare June 13, 2022 18:37
@@ -116,6 +116,21 @@ public synchronized ListenableFuture<Void> offer(T element)
return immediateVoidFuture();
}

// Used to insert all elements returned from borrowBatchAsync
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop comment (not super helpful) and rename method to offerAll.

Or maybe we can just drop method at all and do:

                        synchronized (this) {
                            for (T element : borrowResult.getElementsToInsert()) {
                                offer(element);
                            }
                        }

WDYT?

Copy link
Member Author

@pettyjamesm pettyjamesm Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment removed, method ranmed to offerAll.

We still need to check the finishing && borrowCount == 0 condition as well as the notEmptySignal notification criteria, so a private method seems warranted to me instead of inlining the logic. Additionall, ArrayDeque does specialize addAll to ensure that the capacity grows to accomodate the entire collection of new inserted elements which individual separate offer calls does not.

if (recordScannedFiles) {
splits.forEach(split -> scannedFilePaths.add(((HiveSplit) split).getPath()));
hiveSplits.stream()
.filter(split -> split.getStart() == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a chance that split startin at 0 could be filtered out alltogether? While still another splits for same file are present? Can this happen with DF for example? cc: @raunaqmorarka ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be possible, no- the first split generated for a file will always have start == 0. Now, it is possible that a file will get eliminated by dynamic filters after the first split is produced which could then eliminate the remainder of the file from being scanned, but that could happen even with the previous implementation. As far as I can tell, this logic is only necessary for "optimizing small files" and collecting the files scanned to delete them after the contents are rewritten- which should not have dynamic filters in the first place because it would be unsafe for this reason with or without this improvement.

@pettyjamesm pettyjamesm force-pushed the improve-async-queue branch from d3fe71f to 65165ee Compare June 14, 2022 14:16
Adds a method to handle batch insertion of all elements to
insert returned from AsyncQueue.BorrowResult. This simplifies the
implementation and threads contending to synchronize on the queue
for each individual element inserted.
Only records the Path for scanned files in HiveSplitSource when
the split start point is at the beginning of the file. The only
consumer of the scanned files list is only interested in unique
paths scanned, so enqueuing the same path repeatedly for the same
InternalHiveSplit is unnecessarily expensive.
@pettyjamesm pettyjamesm force-pushed the improve-async-queue branch from 65165ee to e9bfb3d Compare June 14, 2022 18:07
@pettyjamesm pettyjamesm requested a review from losipiuk June 15, 2022 19:46
@losipiuk losipiuk merged commit dc7f526 into trinodb:master Jun 16, 2022
@github-actions github-actions bot added this to the 387 milestone Jun 16, 2022
@pettyjamesm pettyjamesm deleted the improve-async-queue branch June 16, 2022 14:48
@colebow
Copy link
Member

colebow commented Jun 16, 2022

Do we want release notes for this? Seems like a niche change, but it could be worth mentioning.

@pettyjamesm
Copy link
Member Author

Do we want release notes for this? Seems like a niche change, but it could be worth mentioning.

Probably not, it’s a fairly specific internal implementation detail. I would vote “no”, but @losipiuk can weigh in if he disagrees

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants