-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure PartitionedOutputOperator is run with fixed local distribution #13834
Ensure PartitionedOutputOperator is run with fixed local distribution #13834
Conversation
dce8c8d
to
2c1b725
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to my limited expertise with this code it looks good. @sopel39 please take a look.
Also we should run benchmarks on this PR before merging IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will introduce performance regression (wall time) due to extra buffer and thread switching. I experimented with something similar for PA aggregation. Due to the issue, we also disabled spill in source stage joins (#315)
Usually source stage loop is super efficient from TS to output because it lacks any context switching.
If anything, IMO we should move merging of pages out of partition output operator to buffering layer (maybe with some opportunistic locking)
Attaching benchmarks for this change. |
b88f4c0
to
3400d4e
Compare
Updated The biggest regressions ( To address the small discrepancy that came out during the last benchmark I optimized local exchange a bit. By removing Currently the benchmark results look nearly identical: I also tried to simulate an alleged worst case locally (when splits are fully consumed and the exchange is not necessary):
And I didn't find any difference in latency and cpu utilization with and without an additional local exchange. The profiler also doesn't show any significant differences neither in CPU utilization not in lock contention: Since there's no observable difference in a common case i think we should consider merging the change as it helps with providing better efficiency for corner cases (high cardinallity partitioning + selective predicates) |
@@ -104,25 +104,33 @@ | |||
typeOf(ExchangeNode.class) | |||
.with(scope().equalTo(REMOTE)) | |||
.with(source().matching( | |||
// PushPartialAggregationThroughExchange adds a projection. However, it can be removed if RemoveRedundantIdentityProjections is run in the mean-time. | |||
typeOf(ProjectNode.class).capturedAs(PROJECTION) | |||
typeOf(ProjectNode.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any ProjectNode is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that one of the rules adds identity projections that are later removed. This rule is trying to capture these projections. @findepi could you please take a look and validate whether my understanding is correct?
.with(step().equalTo(AggregationNode.Step.PARTIAL)) | ||
.with(nonEmpty(groupingColumns())) | ||
typeOf(ExchangeNode.class) | ||
.with(scope().equalTo(LOCAL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any Local exchange is fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding this rule is trying to capture a specific shape of a plan that got change with introduction of an additional local exchange. Thus in this plane shape other types of local exchange shouldn't be expected. But it would be better to wait for @findepi to take a look to make sure my understanding is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arhimondr sorry, i don't remember. it feels like i wrote this code like 5 years ago
return forceFixedDistributionForPartitionedOutputOperatorEnabled; | ||
} | ||
|
||
@Config("experimental.force-fixed-distribution-for-partitioned-output-operator-enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is super long. Alsy why experimental
maybe optimizer.force-partitioned-output-fixed-distribution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add some documentation here and maybe in some .rst
if we have any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to remove this property after there's enough confidence that no regression is introduced. End users shouldn't be changing this property, as the plan shape alterations may result in some planner rules (such as AddExchangesBelowPartialAggregationOverGroupIdRuleSet
to no longer kick in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I wonder if this approach will cause limit queries to run for longer than needed, e.g. output rows will be now buffered until full page is collected. |
Do we have limit queries in benchmark? Should we extend benchmark with those? |
Actually most of TPCDS queruies have |
These are on aggregation output, so entire subquery must execute |
We probably should benchmark with a query which just does super selective table scan with limit. |
@sopel39 It is not necessarily true. The
@losipiuk Yeah, given the complex interactions involved in short circuiting limit and the fact that we had problems with it in the past I don't think it's a bad idea to validate. Let me check. |
@arhimondr could you add a test that
Consumer won't be able to fetch page until |
@sopel39 I think I now understand the problem. Great catch. Discussed with @losipiuk offline and it looks like |
3400d4e
to
a43b3c5
Compare
@sopel39 @losipiuk It feels like it's a classic throughput vs latency problem that is intrinsic to buffering. It is hard to say what is the right flush strategy. I tried to approach this problem by flushing after a certain delay (see the last commit: a43b3c5) However I'm not sure what is the best strategy. CC: @martint |
@arhimondr if you have a plan like:
then |
Yeah. But it still will wait until |
That's FTE issue, right? In pipeline mode it would work |
I think it would not. Unless I am missing sth. We can chat offline. |
a43b3c5
to
06c056b
Compare
Actually it looks like today it's not possible for a partial limit to endup between a table scan and a So the plan of a LEFT OUTER JOIN (seemingly the only query shape that can be impacted by this change) looks like following:
Basically on the left side the plan looks like following: This plan shape may be problematic for very large limits (e.g.: LIMIT 1_000_000_000) as it requires all the rows to be processed by a single node before passing them to a join and we may need to address that at some point. However at this point It doesn't seem like this PR should introduce any regression. I'm dropping the |
06c056b
to
5355abe
Compare
exchangerSupplier = () -> new BroadcastExchanger(buffers, memoryManager); | ||
} | ||
else if (partitioning.equals(FIXED_ARBITRARY_DISTRIBUTION)) { | ||
if (partitioning.equals(SINGLE_DISTRIBUTION) || partitioning.equals(FIXED_ARBITRARY_DISTRIBUTION)) { | ||
exchangerSupplier = () -> new RandomExchanger(buffers, memoryManager); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this more expensive potentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean for single distribution? I don't think so. The broadcast implementation seemed to be more complex.
@@ -41,8 +42,11 @@ | |||
|
|||
private final Consumer<LocalExchangeSource> onFinish; | |||
|
|||
private final BlockingQueue<PageReference> buffer = new LinkedBlockingDeque<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we never relied on buffer to be blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. new LinkedBlockingDeque<>()
is unbounded
@@ -13,6 +13,7 @@ | |||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment that PageReference
is no longer needed after removing broadcast exchange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add it to the commit message? Not sure if it makes sense to keep mentions of the PageReference
in the codebase, since the PageReference itself is going away
5355abe
to
b05a554
Compare
PartitionedOutputOperator maintains buffers for each output partition. When the operator is run in the same pipeline as the TableScanOperator the buffers are flushed after each split resulting in small pages being created.
Broadcast local exchanges are never performed
Use ArrayDeque to avoid unnecessary allocations associated with a LinkedList based queue
Avoid creating PageReference objects for every page
A new instance is created for each ExchangeSink
b05a554
to
af5cc04
Compare
@@ -743,6 +744,9 @@ public PlanWithProperties visitExchange(ExchangeNode node, StreamPreferredProper | |||
any().withOrderSensitivity(), | |||
any().withOrderSensitivity()); | |||
} | |||
if (isForceFixedDistributionForPartitionedOutputOperatorEnabled(session) && node.isHashPartitionedExchange()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this behavior only be induced for partitioned output operators on top of table scans? Intermediate stages shouldn't have the same flushing behaviors, so condition seems potentially overly-broad, but maybe that's handled by planAndEnforceChildren
somehow?
Description
PartitionedOutputOperator maintains buffers for each output partition.
When the operator is run in the same pipeline as the TableScanOperator the
buffers are flushed after each split resulting in small pages being created.
Improvement
Core engine
Performance improvement
Related issues, pull requests, and links
Documentation
(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(X) No release notes entries required.
( ) Release notes entries required with the following suggested text: