A small improvment in GPU `If` and `CaseWhen` for the all-true and all-false cases. #4329

firestarman · 2021-12-08T10:36:48Z

This PR is to support the lazy evaluation of the true or false expression in GPU If and CaseWhen only when the predictions evaluated from an input batch are all true or false.

This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.

fixes #4350

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-08T10:45:01Z

build

jlowe

This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.

This skip benefit is very specific to the case where everything is skewing away from one side of the condition. Have we done any performance measurements on this? Personally I do not want yet another config that users will not notice or remember to enable/disable. If this is really cheap, I'd rather just enable it all the time than have a config. We should run this on the NDS benchmark to see if it is noticeable when enabled.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

firestarman · 2021-12-09T00:54:26Z

This skip benefit is very specific to the case where everything is skewing away from one side of the condition. Have we done any performance measurements on this?

We indeed have this kind of case. So made this change and @viadea verified it got better perf. (1.5 min -> 40 sec)

Personally I do not want yet another config that users will not notice or remember to enable/disable. If this is really cheap, I'd rather just enable it all the time than have a config.

Per the doc of getNullCount, it could be an expensive op, which could slow down the normal cases I thought, so added this config. I am OK to remove this config and always enable the possible improvement if it is cheap enough.

firestarman · 2021-12-09T01:02:29Z

We should run this on the NDS benchmark to see if it is noticeable when enabled.

@nvliyuan is helping on this.
He just completed the benchmark with this PR, and it is good enough to enable this all the time. So I removed the config.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-09T11:59:05Z

build

integration_tests/src/main/python/conditionals_test.py

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-10T05:20:28Z

build

firestarman · 2021-12-10T05:56:04Z

build

firestarman · 2021-12-10T06:56:49Z

build

firestarman · 2021-12-10T09:08:17Z

build

integration_tests/src/main/python/data_gen.py

integration_tests/src/main/python/conditionals_test.py

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-13T03:12:26Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-13T04:02:24Z

build

pxLi · 2021-12-13T06:02:15Z

some of the parallel java agents crashed in CI w/ this change, this is always reproducible in pre-merge CI but not locally (even w/ pre-merge image)
Could be some other issues like hardware or kernel level, still not quite sure what's the root cause.

according to the pod mem the agent was OOM killed (50G+). still not sure why only in pre-merge case could case the OOM, but not local

pxLi · 2021-12-13T06:04:52Z

build

firestarman · 2021-12-13T13:02:34Z

build

jlowe

The new CaseWhen algorithm can take a lot more memory when there's a large number of cases to check, so I'm not sure it makes sense to unconditionally do this new algorithm. CI showed that use-cases that ran perfectly well before this change can suddenly crash with the new algorithm. As we move down the predicates and cache the result of their evaluation, we really should be making them spillable to avoid the OOM condition. However even if we make them spillable, if we cache so many that we start spilling the whole point of doing this algorithm (i.e.: performance) is foiled.

We may want to be smarter about deciding when to perform the algorithm for CaseWhen (e.g.: only perform the new algorithm when the number of predicates to check is small). For IfElse it seems clear this is generally a good thing, but I'm becoming doubtful that this is a net win for CaseWhen as it is written now.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

integration_tests/src/main/python/conditionals_test.py

firestarman · 2021-12-14T01:01:56Z

The new CaseWhen algorithm can take a lot more memory when there's a large number of cases to check, so I'm not sure it makes sense to unconditionally ...

Yes, you are absolutely right. We need to take more care in CaseWhen to get a balance between the performance and memory.

I think it is a little hard to figure out the number which should be small enough to go this new CaseWhen algorithm. How about dropping the cache and evaluating the predicates again if it finally goes into the mixed case path? This can minimize the memory size for computations by sacrificing some performance, hope it would not be noticeable. I will ask Yuan for more NDS benchmarks. At worst, we would resume the config for only the CaseWhen.

One more thing, I am a little confused, why will it take so much host memory( 50GB+ for 20 predicates in premerge build) ? Since we only cache the predicate references, but the data is actually on GPU.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-14T07:19:17Z

build

firestarman · 2021-12-14T07:58:21Z

Move to draft since premerge still failed without caching the predicates.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-14T08:04:32Z

build

jlowe

Given the complexity of the CaseWhen optimization, we may want to split this PR and get the more straightforward GpuIf optimization in while we work through the complexity of CaseWhen.

jlowe · 2021-12-14T14:04:25Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

+    }
+    withResource(col.getBase.any()) { anyTrue =>
+      // null values are considered false values in this context
+      !anyTrue.isValid || !anyTrue.getBoolean


Now that the row count is being checked against the null count, we shouldn't need to check the validity of the any result (and thus avoid another GPU synchronization to fetch the value). It can only be null if there are no rows or all rows are null, but that case is now covered by the row count == null count check.

jlowe · 2021-12-14T14:19:19Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

+        // Here evaluates the predicate expressions again, sacrificing a little of perf
+        // for minimizing the memory size of computations. Because caching the predicates
+        // may get OOM if there are too many branches.
+        withResource(GpuExpressionsUtils.columnarEvalToColumn(predExpr, batch)) { p =>


Do we really want to double-evaluate the expression? What if the predicate expression is relatively expensive to evaluate (e.g.: is a regular expression match on a large string column or a UDF)?

Rather than waste work re-evaluating expressions that have unknown time and complexity, I'd rather see an optimization for the common cases where the case/when is really just an if/else or we budget how much memory we're willing to spend on caching the predicate expressions. In other words, if we see a case/when that has a lot of predicate expressions, maybe we abandon trying to perform the optimization (or maybe just check the very first predicate for the all true case but otherwise fallback) to avoid excessive memory computation.

As it is now, I'm not convinced this is a net-win since we now are making this significantly more expensive in the case where we don't fall entirely in one predicate for each batch.

This was for OOM test. Now updated.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-15T03:47:56Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-15T04:29:53Z

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-12-15T10:25:01Z

build

firestarman · 2021-12-15T10:31:19Z

I have confirmed the OOM issue in the premerge build has nothing to do with the code change in Scala. I can reproduce it by this simple PR #4363. Besides, I also checked the memory metrics when running the tests locally, but found no noticeable increase.

However the OOM issue made me to think about the possible GPU OOM if caching too many predicate columns. So I introduce a config to limit the number of columns to be cached.

firestarman · 2021-12-15T10:45:12Z

Given the complexity of the CaseWhen optimization, we may want to split this PR and get the more straightforward GpuIf optimization in while we work through the complexity of CaseWhen.

Yes, however the all-true case is using case-when, so I would like to get GpuCaseWhen in more than 'GpuIf'.

jlowe

however the all-true case is using case-when

Is this use-case using a degenerate case-when (i.e.: just one branch and therefore like an If) or does it have three possible paths (i.e.: two predicate branches + else)? I'm assuming the latter since you defaulted the config to 2.

jlowe · 2021-12-15T16:03:59Z

jenkins/spark-premerge-build.sh

@@ -108,7 +108,7 @@ ci_2() {
    export TEST_TYPE="pre-commit"
    export TEST_PARALLEL=4
    # separate process to avoid OOM kill
-    TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh
+    TEST_PARALLEL=2 TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh


Not sure why this change is part of this PR if GPU OOM during premerge tests are not related to the case when changes?

This is to pass the premerge build.
The OOM we are talking about is not on GPU, but the host memory on CPU.

What I just found is if we add enough new tests for case_when with the default parallel value (it is 4 according to the build log), the ci_2 process will run into OOM (Host memory reached beyond the up limit of a pod). When reduce it to 2, the build can pass.

I doult if we missed to release some memory between tests, or there would be some issues of memory management in our premerge build. Since the host memory in the ci_2 process kept increasing during the tests running.

jlowe · 2021-12-15T16:10:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+      "and else expressions if the predicates on a batch are all-true or all-false. Big number " +
+      "may get GPU OOM easily since the predicates are cached during the computation.")
+    .integerConf
+    .createWithDefault(2)


I'm not a fan of this config. Users will not know what to set it to for their queries, nor will they realize this config is what needs to be tuned if the GPU runs out of memory due to it.

For small input batches we probably could cache far more than 2, but for very large batches we may not even want to do 2. What we really want is a budget, in memory size not branch count or row sizes, that the CaseWhen expression is allowed to use. Unfortunately we don't currently have a clear budget for this (and other expressions could similarly leverage this budget for its processing/optimizations).

cc: @revans2 since he may have further thoughts on this.

Agree. Besides, different GPUs have different memory sizes, a value can work on A100 but may not work on T4.

However for most cases, I think, users do not need to take care of this config, since 2 should be small enough to run without GPU OOM. Personally I don't want to maximize this config value to enable this special optimization as much as possible. After all, it is not a common optimization. But make it configurable to give users a chance to tune it if they really have such cases. Maybe the doc is not good enough, needing more explanation.

firestarman · 2021-12-16T02:37:12Z

Is this use-case using a degenerate case-when (i.e.: just one branch and therefore like an If) or does it have three possible paths (i.e.: two predicate branches + else)? I'm assuming the latter since you defaulted the config to 2.

It is not related to the use-case. 2 is just one more than that in GpuIf, and does not take noticeable additional GPU memory for computation.

jlowe · 2021-12-17T14:32:56Z

Based on offline discussions we're holding off on CaseWhen corner-case optimizations for now due to the memory impact concerns, and the GpuIf all-true / all-false optimization is already folded into #4358.

firestarman added 2 commits December 8, 2021 11:43

A small optimization for conditional expressions

e39b33b

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Add a config for this opt

95237f9

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from revans2, abellina and jlowe December 8, 2021 13:54

jlowe reviewed Dec 8, 2021

View reviewed changes

jlowe added the performance A performance related task/issue label Dec 8, 2021

Address the comments

c25986b

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

jlowe reviewed Dec 9, 2021

View reviewed changes

firestarman added 2 commits December 10, 2021 12:30

Merge branch 'branch-22.02' into condition-opt

e139b49

Address comments

cb963cb

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

jlowe reviewed Dec 10, 2021

View reviewed changes

integration_tests/src/main/python/data_gen.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/conditionals_test.py Outdated Show resolved Hide resolved

integration_tests/src/main/python/conditionals_test.py Outdated Show resolved Hide resolved

Address the comments

1057071

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Fix an OOM issue for premerge build

c29b868

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman requested review from GaryShen2008, NvTimLiu and tgravescs as code owners December 13, 2021 04:01

Merge branch 'branch-22.02' into condition-opt

df49e47

jlowe reviewed Dec 13, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala Outdated Show resolved Hide resolved

jlowe reviewed Dec 13, 2021

View reviewed changes

integration_tests/src/main/python/conditionals_test.py Outdated Show resolved Hide resolved

Do not cache the predicates

4ed653f

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman marked this pull request as draft December 14, 2021 07:57

Debug premerge

0c23c50

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

jlowe reviewed Dec 14, 2021

View reviewed changes

andygrove mentioned this pull request Dec 14, 2021

Update GpuIf to support expressions with side effects #4358

Merged

premerge debug 2

7eea3a4

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman changed the title ~~A small improvment in GPU If and CaseWhen for the all-true and all-false cases.~~ WIP: A small improvment in GPU If and CaseWhen for the all-true and all-false cases. Dec 15, 2021

debug premerge 3

6dafd72

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman mentioned this pull request Dec 15, 2021

[DO NOT REVIEW] Premerge ci_2 host memory cost debug #4363

Closed

firestarman added 2 commits December 15, 2021 17:41

Merge branch 'branch-22.02' into condition-opt

3d904f7

Check branch number to enable the opt conditionally

18a8a0c

Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman changed the title ~~WIP: A small improvment in GPU If and CaseWhen for the all-true and all-false cases.~~ A small improvment in GPU If and CaseWhen for the all-true and all-false cases. Dec 15, 2021

firestarman marked this pull request as ready for review December 15, 2021 10:24

jlowe reviewed Dec 15, 2021

View reviewed changes

firestarman closed this Dec 17, 2021

firestarman deleted the condition-opt branch December 30, 2021 03:00

A small improvment in GPU If and CaseWhen for the all-true and all-false cases. #4329

A small improvment in GPU If and CaseWhen for the all-true and all-false cases. #4329

Conversation

firestarman commented Dec 8, 2021 • edited Loading

firestarman commented Dec 8, 2021

jlowe left a comment

Choose a reason for hiding this comment

firestarman commented Dec 9, 2021 • edited Loading

firestarman commented Dec 9, 2021 • edited Loading

firestarman commented Dec 9, 2021

firestarman commented Dec 10, 2021

firestarman commented Dec 10, 2021

firestarman commented Dec 10, 2021

firestarman commented Dec 10, 2021

firestarman commented Dec 13, 2021

firestarman commented Dec 13, 2021

pxLi commented Dec 13, 2021 • edited Loading

pxLi commented Dec 13, 2021

firestarman commented Dec 13, 2021

jlowe left a comment

Choose a reason for hiding this comment

firestarman commented Dec 14, 2021 • edited Loading

firestarman commented Dec 14, 2021

firestarman commented Dec 14, 2021

firestarman commented Dec 14, 2021

jlowe left a comment

Choose a reason for hiding this comment

jlowe Dec 14, 2021

Choose a reason for hiding this comment

firestarman Dec 15, 2021

Choose a reason for hiding this comment

jlowe Dec 14, 2021

Choose a reason for hiding this comment

firestarman Dec 15, 2021

Choose a reason for hiding this comment

firestarman commented Dec 15, 2021

firestarman commented Dec 15, 2021

firestarman commented Dec 15, 2021

firestarman commented Dec 15, 2021 • edited Loading

firestarman commented Dec 15, 2021 • edited Loading

jlowe left a comment

Choose a reason for hiding this comment

jlowe Dec 15, 2021

Choose a reason for hiding this comment

firestarman Dec 16, 2021 • edited Loading

Choose a reason for hiding this comment

jlowe Dec 15, 2021

Choose a reason for hiding this comment

firestarman Dec 16, 2021 • edited Loading

Choose a reason for hiding this comment

firestarman commented Dec 16, 2021 • edited Loading

jlowe commented Dec 17, 2021

A small improvment in GPU `If` and `CaseWhen` for the all-true and all-false cases. #4329

A small improvment in GPU `If` and `CaseWhen` for the all-true and all-false cases. #4329

firestarman commented Dec 8, 2021 •

edited

Loading

firestarman commented Dec 9, 2021 •

edited

Loading

firestarman commented Dec 9, 2021 •

edited

Loading

pxLi commented Dec 13, 2021 •

edited

Loading

firestarman commented Dec 14, 2021 •

edited

Loading

firestarman commented Dec 15, 2021 •

edited

Loading

firestarman commented Dec 15, 2021 •

edited

Loading

firestarman Dec 16, 2021 •

edited

Loading

firestarman Dec 16, 2021 •

edited

Loading

firestarman commented Dec 16, 2021 •

edited

Loading