-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repartition-based fallback for hash aggregate v3 #11712
repartition-based fallback for hash aggregate v3 #11712
Conversation
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
…tor leak Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not finished yet. Could you post an explanation of the changes? I see places in the code that appear to have duplicate functionality. Not to mention we have the old sort based agg code completely duplicating a lot of the newer hash re-partition based code.
I really just want to understand what the work flow is supposed to be?
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortBasedGpuAggregateExec.scala
Outdated
Show resolved
Hide resolved
@@ -335,7 +513,10 @@ class AggHelper( | |||
// We need to merge the aggregated batches into 1 before calling post process, | |||
// if the aggregate code had to split on a retry | |||
if (aggregatedSeq.size > 1) { | |||
val concatted = concatenateBatches(metrics, aggregatedSeq) | |||
val concatted = | |||
withResource(aggregatedSeq) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this? Was this a bug? This change feels wrong to me.
concatenateBatches has the contract that it will either close everything in toConcat, or if there is a single item in the sequence it will just return it without closing anything. By putting it within a withResource
it looks like we are going to double close the data in aggregatedSeq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpillableColumnarBatch
has the nasty habit (?) of hiding double closes from us (https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala#L137). I'd like to remove this behavior with my spillable changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was mislead by
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala
Line 955 in 5be4bd5
val concatBatch = withResource(batches) { _ => |
I will review this today |
realIter = Some(ConcatIterator.apply(firstPassIter, | ||
(aggOutputSizeRatio * configuredTargetBatchSize).toLong | ||
)) | ||
firstPassAggToggle.set(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this does what you think that it does. The line that reads this is used to create an iterator. It is not within an iterator which decides if we should or should not do the agg. I added in some print statements and I have verified that it does indeed agg for every batch, even if the first batch set this to false. Which is a good thing because if you disabled the initial aggregation on something where the output types do not match the input types you would get a crash or data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Known issue, will revert this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember getting a crash or data corruption when I tried to fix the iterator bug here. Do you think it's beneficial if we do convert the output types, but not try any row-wise aggregate if heuristics show that first pass agg does not agg out a lot rows?
|
||
// Handle the case of skipping second and third pass of aggregation | ||
// This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1 | ||
if (!firstBatchChecked && firstPassIter.hasNext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are doing an aggregate every time, would it be better to check each batch and skip repartitioning if the batch stayed large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this. can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a follow on issue. Right now we check the first batch and decide for the entire task if we want to skip full aggregation or not. I think it would probably be better if we did it for each batch individually. We could even do it for each batch when we merge them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I can open a follow up issue on this. The question is, if we have already skipped agg for first batch, what should we do when second batch does not meet skip criterial? The first batch has already been transferred to next operator by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was we keep trying to aggregate it more. Just like the skipped batches never existed. I am not 100% sure on this, we need to do some experiments to be sure. I also thought we should probably do the same thing at all levels of aggregation. If we do a merge aggregation and it does not make the result smaller by enough, then we can release that merged batch instantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue opened at #11770
Hi @revans2 and @abellina, this PR, as I stated in Slack and marked in itself, is still in DRAFT, so I didn't expect a thorough review on this before I change it from Draft to Ready. (Or we wee already have some special usage of Draft PR in our dev process?) Some major problems such as mixing other features(e.g. so called voluntary release check), not working implementation of skipping first iteration agg, etc. The refinement is not ready last Friday but it is now. Please go ahead and review. The reason why I showed you this PR is for proving "That we do a complete pass over all of the batches and cache them into GPU memory before we make a decision on when and how to release a batch" is no longer true. I assume having access to the state-of-art implementation of agg in our customer may help you better understand the problem we're facing now. |
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala
Show resolved
Hide resolved
@@ -2113,6 +2187,7 @@ class DynamicGpuPartialSortAggregateIterator( | |||
} | |||
val newIter = if (doSinglePassAgg) { | |||
metrics.singlePassTasks += 1 | |||
// TO discuss in PR: is singlePassSortedAgg still necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can come up with contrived situations where the sort is better, but in general I think that the skipAggPassReductionRatio is good enough. Although this sort is on by default where as the skipAggPassReductionRatio is off by default. I think we need to have a follow on issue to explore what is the correct heuristic to use and if any of the contrived cases are enough to keep this extra code around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you said the essence of singlePassSortedAgg is "The heuristic is simply looking at the cost of sorting now vs sorting later." (in here).
Let's consider a case where before agg is N rows and after agg 0.89N rows, in such cases skip agg will not be enabled, and sorting later (it will be "repartitioning later" with this PR checked in) may still be more costly than sorting now. So I don't really understand why skipAggPassReductionRatio could deprecate singlePassSortedAgg.
My original intention of the question is to ask if we should use some kind of "singlePassRepartitionAgg" to replace "singlePassSortedAgg"
Next steps for this PR will be:
|
NDS regression test result on Spark2a (default 3K dataset):
|
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
build |
The premerge tests failed, and there were some memory leaks in there too. |
build |
hi @revans2 , the premerge test has passed, and the suspious "memory leak" is proven to be irrelavant to this PR. I'll use another thread to talk with you the leak issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. I would like to see the one comment removed that was added in so we could discuss the pre-sort code. But beyond that I think we can merge it in.
|
||
// Handle the case of skipping second and third pass of aggregation | ||
// This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1 | ||
if (!firstBatchChecked && firstPassIter.hasNext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was we keep trying to aggregate it more. Just like the skipped batches never existed. I am not 100% sure on this, we need to do some experiments to be sure. I also thought we should probably do the same thing at all levels of aggregation. If we do a merge aggregation and it does not make the result smaller by enough, then we can release that merged batch instantly.
@@ -2113,6 +2187,7 @@ class DynamicGpuPartialSortAggregateIterator( | |||
} | |||
val newIter = if (doSinglePassAgg) { | |||
metrics.singlePassTasks += 1 | |||
// TO discuss in PR: is singlePassSortedAgg still necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can come up with contrived situations where the sort is better, but in general I think that the skipAggPassReductionRatio is good enough. Although this sort is on by default where as the skipAggPassReductionRatio is off by default. I think we need to have a follow on issue to explore what is the correct heuristic to use and if any of the contrived cases are enough to keep this extra code around.
Just for full disclosure I ran some benchmarks to look at the sort based short circuit.
This I can see as being fairly common. The majority of the keys are unique, and multiple passes don't reduce the size, so we should not worry about it. But the sort optimization on is still better than with it off. At least until we try to set the skipAggPassReductionRatio with a reasonable default value.
This one is contrived because the keys are not unique, but they are spread evenly throughout a large task. So the fact that we can sort the data (quickly because the key is a single long) and that the spill does not hit disk (because I have 32 GiB of host spill memory), then it pays for itself. I don't see this happening in the real world. Also the size reduction in the shuffle is extra important because I don't have the multi-threaded shuffle enabled for this setup. Like I said it is contrived.
|
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
build |
Hi @revans2 the comment you mentioned is removed now, and I have some newly aded comments, which are all about follow ups. So please feel free to approve the PR if you're okay with it. We can continue the discussion on followups in here. |
This PR replaces #11116, since there has been too many differences with #11116.