-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A small improvment in GPU If
and CaseWhen
for the all-true and all-false cases.
#4329
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.
This skip benefit is very specific to the case where everything is skewing away from one side of the condition. Have we done any performance measurements on this? Personally I do not want yet another config that users will not notice or remember to enable/disable. If this is really cheap, I'd rather just enable it all the time than have a config. We should run this on the NDS benchmark to see if it is noticeable when enabled.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
We indeed have this kind of case. So made this change and @viadea verified it got better perf. (1.5 min -> 40 sec)
Per the doc of |
@nvliyuan is helping on this. |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
3 similar comments
build |
build |
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
some of the parallel java agents crashed in CI w/ this change, this is always reproducible in pre-merge CI but not locally (even w/ pre-merge image) according to the pod mem the agent was OOM killed (50G+). still not sure why only in pre-merge case could case the OOM, but not local |
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new CaseWhen algorithm can take a lot more memory when there's a large number of cases to check, so I'm not sure it makes sense to unconditionally do this new algorithm. CI showed that use-cases that ran perfectly well before this change can suddenly crash with the new algorithm. As we move down the predicates and cache the result of their evaluation, we really should be making them spillable to avoid the OOM condition. However even if we make them spillable, if we cache so many that we start spilling the whole point of doing this algorithm (i.e.: performance) is foiled.
We may want to be smarter about deciding when to perform the algorithm for CaseWhen (e.g.: only perform the new algorithm when the number of predicates to check is small). For IfElse it seems clear this is generally a good thing, but I'm becoming doubtful that this is a net win for CaseWhen as it is written now.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala
Outdated
Show resolved
Hide resolved
Yes, you are absolutely right. We need to take more care in CaseWhen to get a balance between the performance and memory. I think it is a little hard to figure out the number which should be small enough to go this new CaseWhen algorithm. How about dropping the cache and evaluating the predicates again if it finally goes into the mixed case path? This can minimize the memory size for computations by sacrificing some performance, hope it would not be noticeable. I will ask Yuan for more NDS benchmarks. At worst, we would resume the config for only the CaseWhen. One more thing, I am a little confused, why will it take so much host memory( 50GB+ for 20 predicates in premerge build) ? Since we only cache the predicate references, but the data is actually on GPU. |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Move to draft since premerge still failed without caching the predicates. |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the complexity of the CaseWhen optimization, we may want to split this PR and get the more straightforward GpuIf optimization in while we work through the complexity of CaseWhen.
} | ||
withResource(col.getBase.any()) { anyTrue => | ||
// null values are considered false values in this context | ||
!anyTrue.isValid || !anyTrue.getBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the row count is being checked against the null count, we shouldn't need to check the validity of the any
result (and thus avoid another GPU synchronization to fetch the value). It can only be null if there are no rows or all rows are null, but that case is now covered by the row count == null count check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
// Here evaluates the predicate expressions again, sacrificing a little of perf | ||
// for minimizing the memory size of computations. Because caching the predicates | ||
// may get OOM if there are too many branches. | ||
withResource(GpuExpressionsUtils.columnarEvalToColumn(predExpr, batch)) { p => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to double-evaluate the expression? What if the predicate expression is relatively expensive to evaluate (e.g.: is a regular expression match on a large string column or a UDF)?
Rather than waste work re-evaluating expressions that have unknown time and complexity, I'd rather see an optimization for the common cases where the case/when is really just an if/else or we budget how much memory we're willing to spend on caching the predicate expressions. In other words, if we see a case/when that has a lot of predicate expressions, maybe we abandon trying to perform the optimization (or maybe just check the very first predicate for the all true case but otherwise fallback) to avoid excessive memory computation.
As it is now, I'm not convinced this is a net-win since we now are making this significantly more expensive in the case where we don't fall entirely in one predicate for each batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was for OOM test. Now updated.
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
If
and CaseWhen
for the all-true and all-false cases.If
and CaseWhen
for the all-true and all-false cases.
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
If
and CaseWhen
for the all-true and all-false cases.If
and CaseWhen
for the all-true and all-false cases.
build |
I have confirmed the OOM issue in the premerge build has nothing to do with the code change in Scala. I can reproduce it by this simple PR #4363. Besides, I also checked the memory metrics when running the tests locally, but found no noticeable increase. However the OOM issue made me to think about the possible GPU OOM if caching too many predicate columns. So I introduce a config to limit the number of columns to be cached. |
Yes, however the all-true case is using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however the all-true case is using case-when
Is this use-case using a degenerate case-when (i.e.: just one branch and therefore like an If
) or does it have three possible paths (i.e.: two predicate branches + else)? I'm assuming the latter since you defaulted the config to 2.
@@ -108,7 +108,7 @@ ci_2() { | |||
export TEST_TYPE="pre-commit" | |||
export TEST_PARALLEL=4 | |||
# separate process to avoid OOM kill | |||
TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh | |||
TEST_PARALLEL=2 TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this change is part of this PR if GPU OOM during premerge tests are not related to the case when changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to pass the premerge build.
The OOM we are talking about is not on GPU, but the host memory on CPU.
What I just found is if we add enough new tests for case_when with the default parallel value (it is 4 according to the build log), the ci_2 process will run into OOM (Host memory reached beyond the up limit of a pod). When reduce it to 2, the build can pass.
I doult if we missed to release some memory between tests, or there would be some issues of memory management in our premerge build. Since the host memory in the ci_2 process kept increasing during the tests running.
"and else expressions if the predicates on a batch are all-true or all-false. Big number " + | ||
"may get GPU OOM easily since the predicates are cached during the computation.") | ||
.integerConf | ||
.createWithDefault(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this config. Users will not know what to set it to for their queries, nor will they realize this config is what needs to be tuned if the GPU runs out of memory due to it.
For small input batches we probably could cache far more than 2, but for very large batches we may not even want to do 2. What we really want is a budget, in memory size not branch count or row sizes, that the CaseWhen expression is allowed to use. Unfortunately we don't currently have a clear budget for this (and other expressions could similarly leverage this budget for its processing/optimizations).
cc: @revans2 since he may have further thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Besides, different GPUs have different memory sizes, a value can work on A100 but may not work on T4.
However for most cases, I think, users do not need to take care of this config, since 2 should be small enough to run without GPU OOM. Personally I don't want to maximize this config value to enable this special optimization as much as possible. After all, it is not a common optimization. But make it configurable to give users a chance to tune it if they really have such cases. Maybe the doc is not good enough, needing more explanation.
It is not related to the use-case. 2 is just one more than that in GpuIf, and does not take noticeable additional GPU memory for computation. |
Based on offline discussions we're holding off on CaseWhen corner-case optimizations for now due to the memory impact concerns, and the GpuIf all-true / all-false optimization is already folded into #4358. |
This PR is to support the lazy evaluation of the true or false expression in GPU
If
andCaseWhen
only when the predictions evaluated from an input batch are all true or false.This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.
fixes #4350