Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repartition-based fallback for hash aggregate v3 #11712

Merged
merged 56 commits into from
Nov 26, 2024

Conversation

binmahone
Copy link
Collaborator

@binmahone binmahone commented Nov 8, 2024

This PR replaces #11116, since there has been too many differences with #11116.

binmahone added 30 commits July 1, 2024 17:14
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
…tor leak

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not finished yet. Could you post an explanation of the changes? I see places in the code that appear to have duplicate functionality. Not to mention we have the old sort based agg code completely duplicating a lot of the newer hash re-partition based code.

I really just want to understand what the work flow is supposed to be?

@@ -335,7 +513,10 @@ class AggHelper(
// We need to merge the aggregated batches into 1 before calling post process,
// if the aggregate code had to split on a retry
if (aggregatedSeq.size > 1) {
val concatted = concatenateBatches(metrics, aggregatedSeq)
val concatted =
withResource(aggregatedSeq) { _ =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this? Was this a bug? This change feels wrong to me.

concatenateBatches has the contract that it will either close everything in toConcat, or if there is a single item in the sequence it will just return it without closing anything. By putting it within a withResource it looks like we are going to double close the data in aggregatedSeq.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpillableColumnarBatch has the nasty habit (?) of hiding double closes from us (https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala#L137). I'd like to remove this behavior with my spillable changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was mislead by

val concatBatch = withResource(batches) { _ =>
on main branch, and thought concatenateBatches will not close input batches. Will revert this part

@abellina
Copy link
Collaborator

abellina commented Nov 8, 2024

I will review this today

realIter = Some(ConcatIterator.apply(firstPassIter,
(aggOutputSizeRatio * configuredTargetBatchSize).toLong
))
firstPassAggToggle.set(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this does what you think that it does. The line that reads this is used to create an iterator. It is not within an iterator which decides if we should or should not do the agg. I added in some print statements and I have verified that it does indeed agg for every batch, even if the first batch set this to false. Which is a good thing because if you disabled the initial aggregation on something where the output types do not match the input types you would get a crash or data corruption.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known issue, will revert this part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember getting a crash or data corruption when I tried to fix the iterator bug here. Do you think it's beneficial if we do convert the output types, but not try any row-wise aggregate if heuristics show that first pass agg does not agg out a lot rows?


// Handle the case of skipping second and third pass of aggregation
// This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1
if (!firstBatchChecked && firstPassIter.hasNext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doing an aggregate every time, would it be better to check each batch and skip repartitioning if the batch stayed large?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this. can you elaborate on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a follow on issue. Right now we check the first batch and decide for the entire task if we want to skip full aggregation or not. I think it would probably be better if we did it for each batch individually. We could even do it for each batch when we merge them too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I can open a follow up issue on this. The question is, if we have already skipped agg for first batch, what should we do when second batch does not meet skip criterial? The first batch has already been transferred to next operator by now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was we keep trying to aggregate it more. Just like the skipped batches never existed. I am not 100% sure on this, we need to do some experiments to be sure. I also thought we should probably do the same thing at all levels of aggregation. If we do a merge aggregation and it does not make the result smaller by enough, then we can release that merged batch instantly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue opened at #11770

@binmahone
Copy link
Collaborator Author

binmahone commented Nov 10, 2024

Hi @revans2 and @abellina, this PR, as I stated in Slack and marked in itself, is still in DRAFT, so I didn't expect a thorough review on this before I change it from Draft to Ready. (Or we wee already have some special usage of Draft PR in our dev process?) Some major problems such as mixing other features(e.g. so called voluntary release check), not working implementation of skipping first iteration agg, etc. The refinement is not ready last Friday but it is now. Please go ahead and review.

The reason why I showed you this PR is for proving "That we do a complete pass over all of the batches and cache them into GPU memory before we make a decision on when and how to release a batch" is no longer true. I assume having access to the state-of-art implementation of agg in our customer may help you better understand the problem we're facing now.

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
@@ -2113,6 +2187,7 @@ class DynamicGpuPartialSortAggregateIterator(
}
val newIter = if (doSinglePassAgg) {
metrics.singlePassTasks += 1
// TO discuss in PR: is singlePassSortedAgg still necessary?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can come up with contrived situations where the sort is better, but in general I think that the skipAggPassReductionRatio is good enough. Although this sort is on by default where as the skipAggPassReductionRatio is off by default. I think we need to have a follow on issue to explore what is the correct heuristic to use and if any of the contrived cases are enough to keep this extra code around.

Copy link
Collaborator Author

@binmahone binmahone Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you said the essence of singlePassSortedAgg is "The heuristic is simply looking at the cost of sorting now vs sorting later." (in here).

Let's consider a case where before agg is N rows and after agg 0.89N rows, in such cases skip agg will not be enabled, and sorting later (it will be "repartitioning later" with this PR checked in) may still be more costly than sorting now. So I don't really understand why skipAggPassReductionRatio could deprecate singlePassSortedAgg.

My original intention of the question is to ask if we should use some kind of "singlePassRepartitionAgg" to replace "singlePassSortedAgg"

@binmahone
Copy link
Collaborator Author

Next steps for this PR will be:

  1. Addressing review comments. When reviewing please aware that we found repartition-based agg is more prone to "Container OOM (then get killed by K8S)" than sort based. I'm troubleshooting on my side, but any insight from the reviewer would be a great help.
  2. We have selected four representative queries at our customer, and before/after result is being collected. Part of the collected results has already shown improvements.
  3. I'll run regression test on NDS to ensure no/minor regression.

@binmahone
Copy link
Collaborator Author

binmahone commented Nov 20, 2024

NDS regression test result on Spark2a (default 3K dataset):

Regression alerts
-----------------

Speedup results
-----------------
query1: Previous (1809.3333333333333 ms) vs Current (1895.0 ms) Diff -85 E2E 0.95x
query2: Previous (1640.6666666666667 ms) vs Current (2042.0 ms) Diff -401 E2E 0.80x
query3: Previous (565.3333333333334 ms) vs Current (567.0 ms) Diff -1 E2E 1.00x
query4: Previous (11023.333333333334 ms) vs Current (11092.666666666666 ms) Diff -69 E2E 0.99x
query5: Previous (2253.6666666666665 ms) vs Current (2584.3333333333335 ms) Diff -330 E2E 0.87x
query6: Previous (778.3333333333334 ms) vs Current (821.6666666666666 ms) Diff -43 E2E 0.95x
query7: Previous (3349.0 ms) vs Current (3399.6666666666665 ms) Diff -50 E2E 0.99x
query8: Previous (1160.6666666666667 ms) vs Current (1158.0 ms) Diff 2 E2E 1.00x
query9: Previous (9345.333333333334 ms) vs Current (5375.666666666667 ms) Diff 3969 E2E 1.74x
query10: Previous (1755.6666666666667 ms) vs Current (1872.0 ms) Diff -116 E2E 0.94x
query11: Previous (5698.0 ms) vs Current (5664.333333333333 ms) Diff 33 E2E 1.01x
query12: Previous (684.3333333333334 ms) vs Current (645.6666666666666 ms) Diff 38 E2E 1.06x
query13: Previous (1791.6666666666667 ms) vs Current (1822.3333333333333 ms) Diff -30 E2E 0.98x
query14_part1: Previous (7540.666666666667 ms) vs Current (8108.333333333333 ms) Diff -567 E2E 0.93x
query14_part2: Previous (6580.0 ms) vs Current (6696.0 ms) Diff -116 E2E 0.98x
query15: Previous (1225.3333333333333 ms) vs Current (1218.0 ms) Diff 7 E2E 1.01x
query16: Previous (4012.3333333333335 ms) vs Current (4044.3333333333335 ms) Diff -32 E2E 0.99x
query17: Previous (2047.3333333333333 ms) vs Current (2044.6666666666667 ms) Diff 2 E2E 1.00x
query18: Previous (2143.0 ms) vs Current (2340.0 ms) Diff -197 E2E 0.92x
query19: Previous (1456.6666666666667 ms) vs Current (1411.6666666666667 ms) Diff 45 E2E 1.03x
query20: Previous (706.6666666666666 ms) vs Current (731.3333333333334 ms) Diff -24 E2E 0.97x
query21: Previous (755.0 ms) vs Current (714.3333333333334 ms) Diff 40 E2E 1.06x
query22: Previous (1390.6666666666667 ms) vs Current (1477.0 ms) Diff -86 E2E 0.94x
query23_part1: Previous (13383.666666666666 ms) vs Current (14087.0 ms) Diff -703 E2E 0.95x
query23_part2: Previous (17956.666666666668 ms) vs Current (18554.333333333332 ms) Diff -597 E2E 0.97x
query24_part1: Previous (9003.0 ms) vs Current (9031.0 ms) Diff -28 E2E 1.00x
query24_part2: Previous (8799.666666666666 ms) vs Current (9124.0 ms) Diff -324 E2E 0.96x
query25: Previous (1918.6666666666667 ms) vs Current (2010.6666666666667 ms) Diff -92 E2E 0.95x
query26: Previous (1468.0 ms) vs Current (1388.3333333333333 ms) Diff 79 E2E 1.06x
query27: Previous (1416.6666666666667 ms) vs Current (1538.3333333333333 ms) Diff -121 E2E 0.92x
query28: Previous (7904.333333333333 ms) vs Current (7976.0 ms) Diff -71 E2E 0.99x
query29: Previous (2934.3333333333335 ms) vs Current (3454.0 ms) Diff -519 E2E 0.85x
query30: Previous (2604.6666666666665 ms) vs Current (2329.6666666666665 ms) Diff 275 E2E 1.12x
query31: Previous (2442.3333333333335 ms) vs Current (2420.6666666666665 ms) Diff 21 E2E 1.01x
query32: Previous (1191.0 ms) vs Current (948.0 ms) Diff 243 E2E 1.26x
query33: Previous (1175.6666666666667 ms) vs Current (1197.3333333333333 ms) Diff -21 E2E 0.98x
query34: Previous (2532.6666666666665 ms) vs Current (2481.6666666666665 ms) Diff 51 E2E 1.02x
query35: Previous (2159.3333333333335 ms) vs Current (2045.0 ms) Diff 114 E2E 1.06x
query36: Previous (1527.0 ms) vs Current (1550.3333333333333 ms) Diff -23 E2E 0.98x
query37: Previous (1414.0 ms) vs Current (1369.6666666666667 ms) Diff 44 E2E 1.03x
query38: Previous (2735.0 ms) vs Current (2708.6666666666665 ms) Diff 26 E2E 1.01x
query39_part1: Previous (2164.0 ms) vs Current (2088.3333333333335 ms) Diff 75 E2E 1.04x
query39_part2: Previous (1595.0 ms) vs Current (1616.3333333333333 ms) Diff -21 E2E 0.99x
query40: Previous (1349.3333333333333 ms) vs Current (1389.0 ms) Diff -39 E2E 0.97x
query41: Previous (401.6666666666667 ms) vs Current (381.3333333333333 ms) Diff 20 E2E 1.05x
query42: Previous (457.6666666666667 ms) vs Current (398.3333333333333 ms) Diff 59 E2E 1.15x
query43: Previous (1125.3333333333333 ms) vs Current (1153.0 ms) Diff -27 E2E 0.98x
query44: Previous (828.6666666666666 ms) vs Current (815.3333333333334 ms) Diff 13 E2E 1.02x
query45: Previous (1332.0 ms) vs Current (1333.0 ms) Diff -1 E2E 1.00x
query46: Previous (1828.6666666666667 ms) vs Current (1610.0 ms) Diff 218 E2E 1.14x
query47: Previous (2201.3333333333335 ms) vs Current (2231.6666666666665 ms) Diff -30 E2E 0.99x
query48: Previous (1289.0 ms) vs Current (1366.3333333333333 ms) Diff -77 E2E 0.94x
query49: Previous (2375.0 ms) vs Current (2423.6666666666665 ms) Diff -48 E2E 0.98x
query50: Previous (9228.333333333334 ms) vs Current (9034.666666666666 ms) Diff 193 E2E 1.02x
query51: Previous (2085.6666666666665 ms) vs Current (2329.0 ms) Diff -243 E2E 0.90x
query52: Previous (591.3333333333334 ms) vs Current (610.3333333333334 ms) Diff -19 E2E 0.97x
query53: Previous (887.0 ms) vs Current (862.6666666666666 ms) Diff 24 E2E 1.03x
query54: Previous (1849.3333333333333 ms) vs Current (1813.0 ms) Diff 36 E2E 1.02x
query55: Previous (518.6666666666666 ms) vs Current (497.6666666666667 ms) Diff 20 E2E 1.04x
query56: Previous (1090.6666666666667 ms) vs Current (1085.6666666666667 ms) Diff 5 E2E 1.00x
query57: Previous (2208.3333333333335 ms) vs Current (1708.3333333333333 ms) Diff 500 E2E 1.29x
query58: Previous (1030.6666666666667 ms) vs Current (1115.3333333333333 ms) Diff -84 E2E 0.92x
query59: Previous (2516.3333333333335 ms) vs Current (2624.0 ms) Diff -107 E2E 0.96x
query60: Previous (1354.0 ms) vs Current (1294.6666666666667 ms) Diff 59 E2E 1.05x
query61: Previous (1527.3333333333333 ms) vs Current (1408.0 ms) Diff 119 E2E 1.08x
query62: Previous (1490.6666666666667 ms) vs Current (1507.3333333333333 ms) Diff -16 E2E 0.99x
query63: Previous (998.0 ms) vs Current (999.3333333333334 ms) Diff -1 E2E 1.00x
query64: Previous (16576.333333333332 ms) vs Current (16515.333333333332 ms) Diff 61 E2E 1.00x
query65: Previous (3914.6666666666665 ms) vs Current (3947.0 ms) Diff -32 E2E 0.99x
query66: Previous (3383.3333333333335 ms) vs Current (3360.0 ms) Diff 23 E2E 1.01x
query67: Previous (28848.0 ms) vs Current (28505.0 ms) Diff 343 E2E 1.01x
query68: Previous (1472.0 ms) vs Current (1400.0 ms) Diff 72 E2E 1.05x
query69: Previous (1523.3333333333333 ms) vs Current (1569.6666666666667 ms) Diff -46 E2E 0.97x
query70: Previous (1876.3333333333333 ms) vs Current (1797.6666666666667 ms) Diff 78 E2E 1.04x
query71: Previous (4076.0 ms) vs Current (4098.666666666667 ms) Diff -22 E2E 0.99x
query72: Previous (3457.6666666666665 ms) vs Current (3291.3333333333335 ms) Diff 166 E2E 1.05x
query73: Previous (1205.0 ms) vs Current (1249.6666666666667 ms) Diff -44 E2E 0.96x
query74: Previous (4539.0 ms) vs Current (4300.666666666667 ms) Diff 238 E2E 1.06x
query75: Previous (8472.0 ms) vs Current (8588.333333333334 ms) Diff -116 E2E 0.99x
query76: Previous (5985.666666666667 ms) vs Current (3096.3333333333335 ms) Diff 2889 E2E 1.93x
query77: Previous (1246.6666666666667 ms) vs Current (1152.6666666666667 ms) Diff 94 E2E 1.08x
query78: Previous (8933.0 ms) vs Current (9010.333333333334 ms) Diff -77 E2E 0.99x
query79: Previous (1256.6666666666667 ms) vs Current (1458.0 ms) Diff -201 E2E 0.86x
query80: Previous (4989.0 ms) vs Current (4801.333333333333 ms) Diff 187 E2E 1.04x
query81: Previous (2466.0 ms) vs Current (2396.6666666666665 ms) Diff 69 E2E 1.03x
query82: Previous (2457.3333333333335 ms) vs Current (2557.0 ms) Diff -99 E2E 0.96x
query83: Previous (835.6666666666666 ms) vs Current (832.6666666666666 ms) Diff 3 E2E 1.00x
query84: Previous (1933.0 ms) vs Current (1569.3333333333333 ms) Diff 363 E2E 1.23x
query85: Previous (1911.6666666666667 ms) vs Current (1751.0 ms) Diff 160 E2E 1.09x
query86: Previous (1103.3333333333333 ms) vs Current (1107.0 ms) Diff -3 E2E 1.00x
query87: Previous (2615.0 ms) vs Current (2565.0 ms) Diff 50 E2E 1.02x
query88: Previous (5482.0 ms) vs Current (5435.0 ms) Diff 47 E2E 1.01x
query89: Previous (1485.0 ms) vs Current (1514.3333333333333 ms) Diff -29 E2E 0.98x
query90: Previous (1197.3333333333333 ms) vs Current (1106.0 ms) Diff 91 E2E 1.08x
query91: Previous (1237.6666666666667 ms) vs Current (1247.0 ms) Diff -9 E2E 0.99x
query92: Previous (725.3333333333334 ms) vs Current (658.0 ms) Diff 67 E2E 1.10x
query93: Previous (10936.333333333334 ms) vs Current (10472.0 ms) Diff 464 E2E 1.04x
query94: Previous (4522.333333333333 ms) vs Current (4524.333333333333 ms) Diff -2 E2E 1.00x
query95: Previous (6579.0 ms) vs Current (6594.0 ms) Diff -15 E2E 1.00x
query96: Previous (5449.0 ms) vs Current (5373.0 ms) Diff 76 E2E 1.01x
query97: Previous (2201.6666666666665 ms) vs Current (2155.3333333333335 ms) Diff 46 E2E 1.02x
query98: Previous (1754.6666666666667 ms) vs Current (1671.3333333333333 ms) Diff 83 E2E 1.05x
query99: Previous (2092.6666666666665 ms) vs Current (2536.3333333333335 ms) Diff -443 E2E 0.83x
benchmark: Previous (356333.3333333333 ms) vs Current (351000.0 ms) Diff 5333 E2E 1.02x

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
@binmahone
Copy link
Collaborator Author

build

@binmahone binmahone marked this pull request as ready for review November 20, 2024 10:34
@revans2
Copy link
Collaborator

revans2 commented Nov 20, 2024

The premerge tests failed, and there were some memory leaks in there too.

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
@binmahone
Copy link
Collaborator Author

build

@binmahone
Copy link
Collaborator Author

hi @revans2 , the premerge test has passed, and the suspious "memory leak" is proven to be irrelavant to this PR. I'll use another thread to talk with you the leak issue.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I would like to see the one comment removed that was added in so we could discuss the pre-sort code. But beyond that I think we can merge it in.


// Handle the case of skipping second and third pass of aggregation
// This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1
if (!firstBatchChecked && firstPassIter.hasNext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was we keep trying to aggregate it more. Just like the skipped batches never existed. I am not 100% sure on this, we need to do some experiments to be sure. I also thought we should probably do the same thing at all levels of aggregation. If we do a merge aggregation and it does not make the result smaller by enough, then we can release that merged batch instantly.

@@ -2113,6 +2187,7 @@ class DynamicGpuPartialSortAggregateIterator(
}
val newIter = if (doSinglePassAgg) {
metrics.singlePassTasks += 1
// TO discuss in PR: is singlePassSortedAgg still necessary?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can come up with contrived situations where the sort is better, but in general I think that the skipAggPassReductionRatio is good enough. Although this sort is on by default where as the skipAggPassReductionRatio is off by default. I think we need to have a follow on issue to explore what is the correct heuristic to use and if any of the contrived cases are enough to keep this extra code around.

@revans2
Copy link
Collaborator

revans2 commented Nov 25, 2024

Just for full disclosure I ran some benchmarks to look at the sort based short circuit.

spark.time(spark.range(0, 1000000000L, 1, 2).selectExpr("CAST(id DIV 1 as STRING) as k", "id").groupBy("k").agg(avg(col("id")).alias("id_a"),avg(col("id") + 10).alias("idt_a")).orderBy("k").show())

This I can see as being fairly common. The majority of the keys are unique, and multiple passes don't reduce the size, so we should not worry about it. But the sort optimization on is still better than with it off. At least until we try to set the skipAggPassReductionRatio with a reasonable default value.

  • DEFAULT
    • 119657 ms, 117633 ms, 117105 ms
    • spill time seconds-ish: 20, 19, 19
  • spark.conf.set("spark.rapids.sql.agg.singlePassPartialSortEnabled",false)
    • 188092 ms, 182510 ms, 180220 ms
    • spill time seconds-ish: 95, 90, 90
  • spark.conf.set("spark.rapids.sql.agg.skipAggPassReductionRatio", "0.99") (sort still disabled)
    • 90188 ms, 89922 ms, 89841 ms
    • spill time seconds-ish: 0, 0, 0
spark.time(spark.range(0, 1000000000L, 1, 2).selectExpr("id % 125000000 as k", "id").groupBy("k").agg(avg(col("id")).alias("id_a"),avg(col("id") + 10).alias("idt_a")).orderBy("k").show())

This one is contrived because the keys are not unique, but they are spread evenly throughout a large task. So the fact that we can sort the data (quickly because the key is a single long) and that the spill does not hit disk (because I have 32 GiB of host spill memory), then it pays for itself. I don't see this happening in the real world. Also the size reduction in the shuffle is extra important because I don't have the multi-threaded shuffle enabled for this setup. Like I said it is contrived.

  • DEFAULT:
    • 12361 ms, 12345 ms, 12306 ms
    • shuffle-size: 3.3 GiB
    • spill time seconds-ish: 0, 0, 0
  • spark.conf.set("spark.rapids.sql.agg.singlePassPartialSortEnabled",false)
    • 60392 ms, 67075 ms, 60071 ms
    • shuffle-size: 4.0 GiB
    • spill time seconds-ish: 48, 55, 48
  • spark.conf.set("spark.rapids.sql.agg.skipAggPassReductionRatio", "0.99") (sort still disabled)
    • 45085 ms, 44738 ms, 45094 ms
    • shuffle-size: 15.4 GiB
    • spill time seconds-ish: 0, 0, 0

Signed-off-by: Hongbin Ma (Mahone) <mahongbin@apache.org>
@binmahone
Copy link
Collaborator Author

build

@binmahone
Copy link
Collaborator Author

This is looking good. I would like to see the one comment removed that was added in so we could discuss the pre-sort code. But beyond that I think we can merge it in.

Hi @revans2 the comment you mentioned is removed now, and I have some newly aded comments, which are all about follow ups. So please feel free to approve the PR if you're okay with it. We can continue the discussion on followups in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants