Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A small improvment in GPU If and CaseWhen for the all-true and all-false cases. #4329

Closed
wants to merge 20 commits into from

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Dec 8, 2021

This PR is to support the lazy evaluation of the true or false expression in GPU If and CaseWhen only when the predictions evaluated from an input batch are all true or false.

This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.

fixes #4350

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can improve the performance if the true or false expression is an expensive operation (e.g. row-based UDF), and will be skipped according to the prediction result.

This skip benefit is very specific to the case where everything is skewing away from one side of the condition. Have we done any performance measurements on this? Personally I do not want yet another config that users will not notice or remember to enable/disable. If this is really cheap, I'd rather just enable it all the time than have a config. We should run this on the NDS benchmark to see if it is noticeable when enabled.

@jlowe jlowe added the performance A performance related task/issue label Dec 8, 2021
@firestarman
Copy link
Collaborator Author

firestarman commented Dec 9, 2021

This skip benefit is very specific to the case where everything is skewing away from one side of the condition. Have we done any performance measurements on this?

We indeed have this kind of case. So made this change and @viadea verified it got better perf. (1.5 min -> 40 sec)

Personally I do not want yet another config that users will not notice or remember to enable/disable. If this is really cheap, I'd rather just enable it all the time than have a config.

Per the doc of getNullCount, it could be an expensive op, which could slow down the normal cases I thought, so added this config. I am OK to remove this config and always enable the possible improvement if it is cheap enough.

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 9, 2021

We should run this on the NDS benchmark to see if it is noticeable when enabled.

@nvliyuan is helping on this.
He just completed the benchmark with this PR, and it is good enough to enable this all the time. So I removed the config.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

3 similar comments
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@pxLi
Copy link
Collaborator

pxLi commented Dec 13, 2021

some of the parallel java agents crashed in CI w/ this change, this is always reproducible in pre-merge CI but not locally (even w/ pre-merge image)
Could be some other issues like hardware or kernel level, still not quite sure what's the root cause.

according to the pod mem the agent was OOM killed (50G+). still not sure why only in pre-merge case could case the OOM, but not local

@pxLi
Copy link
Collaborator

pxLi commented Dec 13, 2021

build

@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CaseWhen algorithm can take a lot more memory when there's a large number of cases to check, so I'm not sure it makes sense to unconditionally do this new algorithm. CI showed that use-cases that ran perfectly well before this change can suddenly crash with the new algorithm. As we move down the predicates and cache the result of their evaluation, we really should be making them spillable to avoid the OOM condition. However even if we make them spillable, if we cache so many that we start spilling the whole point of doing this algorithm (i.e.: performance) is foiled.

We may want to be smarter about deciding when to perform the algorithm for CaseWhen (e.g.: only perform the new algorithm when the number of predicates to check is small). For IfElse it seems clear this is generally a good thing, but I'm becoming doubtful that this is a net win for CaseWhen as it is written now.

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 14, 2021

The new CaseWhen algorithm can take a lot more memory when there's a large number of cases to check, so I'm not sure it makes sense to unconditionally ...

Yes, you are absolutely right. We need to take more care in CaseWhen to get a balance between the performance and memory.

I think it is a little hard to figure out the number which should be small enough to go this new CaseWhen algorithm. How about dropping the cache and evaluating the predicates again if it finally goes into the mixed case path? This can minimize the memory size for computations by sacrificing some performance, hope it would not be noticeable. I will ask Yuan for more NDS benchmarks. At worst, we would resume the config for only the CaseWhen.

One more thing, I am a little confused, why will it take so much host memory( 50GB+ for 20 predicates in premerge build) ? Since we only cache the predicate references, but the data is actually on GPU.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman marked this pull request as draft December 14, 2021 07:57
@firestarman
Copy link
Collaborator Author

Move to draft since premerge still failed without caching the predicates.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the complexity of the CaseWhen optimization, we may want to split this PR and get the more straightforward GpuIf optimization in while we work through the complexity of CaseWhen.

}
withResource(col.getBase.any()) { anyTrue =>
// null values are considered false values in this context
!anyTrue.isValid || !anyTrue.getBoolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the row count is being checked against the null count, we shouldn't need to check the validity of the any result (and thus avoid another GPU synchronization to fetch the value). It can only be null if there are no rows or all rows are null, but that case is now covered by the row count == null count check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// Here evaluates the predicate expressions again, sacrificing a little of perf
// for minimizing the memory size of computations. Because caching the predicates
// may get OOM if there are too many branches.
withResource(GpuExpressionsUtils.columnarEvalToColumn(predExpr, batch)) { p =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to double-evaluate the expression? What if the predicate expression is relatively expensive to evaluate (e.g.: is a regular expression match on a large string column or a UDF)?

Rather than waste work re-evaluating expressions that have unknown time and complexity, I'd rather see an optimization for the common cases where the case/when is really just an if/else or we budget how much memory we're willing to spend on caching the predicate expressions. In other words, if we see a case/when that has a lot of predicate expressions, maybe we abandon trying to perform the optimization (or maybe just check the very first predicate for the all true case but otherwise fallback) to avoid excessive memory computation.

As it is now, I'm not convinced this is a net-win since we now are making this significantly more expensive in the case where we don't fall entirely in one predicate for each batch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was for OOM test. Now updated.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman firestarman changed the title A small improvment in GPU If and CaseWhen for the all-true and all-false cases. WIP: A small improvment in GPU If and CaseWhen for the all-true and all-false cases. Dec 15, 2021
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman changed the title WIP: A small improvment in GPU If and CaseWhen for the all-true and all-false cases. A small improvment in GPU If and CaseWhen for the all-true and all-false cases. Dec 15, 2021
@firestarman firestarman marked this pull request as ready for review December 15, 2021 10:24
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 15, 2021

I have confirmed the OOM issue in the premerge build has nothing to do with the code change in Scala. I can reproduce it by this simple PR #4363. Besides, I also checked the memory metrics when running the tests locally, but found no noticeable increase.

However the OOM issue made me to think about the possible GPU OOM if caching too many predicate columns. So I introduce a config to limit the number of columns to be cached.

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 15, 2021

Given the complexity of the CaseWhen optimization, we may want to split this PR and get the more straightforward GpuIf optimization in while we work through the complexity of CaseWhen.

Yes, however the all-true case is using case-when, so I would like to get GpuCaseWhen in more than 'GpuIf'.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however the all-true case is using case-when

Is this use-case using a degenerate case-when (i.e.: just one branch and therefore like an If) or does it have three possible paths (i.e.: two predicate branches + else)? I'm assuming the latter since you defaulted the config to 2.

@@ -108,7 +108,7 @@ ci_2() {
export TEST_TYPE="pre-commit"
export TEST_PARALLEL=4
# separate process to avoid OOM kill
TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh
TEST_PARALLEL=2 TEST='conditionals_test or window_function_test' ./integration_tests/run_pyspark_from_build.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this change is part of this PR if GPU OOM during premerge tests are not related to the case when changes?

Copy link
Collaborator Author

@firestarman firestarman Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to pass the premerge build.
The OOM we are talking about is not on GPU, but the host memory on CPU.

What I just found is if we add enough new tests for case_when with the default parallel value (it is 4 according to the build log), the ci_2 process will run into OOM (Host memory reached beyond the up limit of a pod). When reduce it to 2, the build can pass.

I doult if we missed to release some memory between tests, or there would be some issues of memory management in our premerge build. Since the host memory in the ci_2 process kept increasing during the tests running.

"and else expressions if the predicates on a batch are all-true or all-false. Big number " +
"may get GPU OOM easily since the predicates are cached during the computation.")
.integerConf
.createWithDefault(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this config. Users will not know what to set it to for their queries, nor will they realize this config is what needs to be tuned if the GPU runs out of memory due to it.

For small input batches we probably could cache far more than 2, but for very large batches we may not even want to do 2. What we really want is a budget, in memory size not branch count or row sizes, that the CaseWhen expression is allowed to use. Unfortunately we don't currently have a clear budget for this (and other expressions could similarly leverage this budget for its processing/optimizations).

cc: @revans2 since he may have further thoughts on this.

Copy link
Collaborator Author

@firestarman firestarman Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Besides, different GPUs have different memory sizes, a value can work on A100 but may not work on T4.

However for most cases, I think, users do not need to take care of this config, since 2 should be small enough to run without GPU OOM. Personally I don't want to maximize this config value to enable this special optimization as much as possible. After all, it is not a common optimization. But make it configurable to give users a chance to tune it if they really have such cases. Maybe the doc is not good enough, needing more explanation.

@firestarman
Copy link
Collaborator Author

firestarman commented Dec 16, 2021

Is this use-case using a degenerate case-when (i.e.: just one branch and therefore like an If) or does it have three possible paths (i.e.: two predicate branches + else)? I'm assuming the latter since you defaulted the config to 2.

It is not related to the use-case. 2 is just one more than that in GpuIf, and does not take noticeable additional GPU memory for computation.

@jlowe
Copy link
Member

jlowe commented Dec 17, 2021

Based on offline discussions we're holding off on CaseWhen corner-case optimizations for now due to the memory impact concerns, and the GpuIf all-true / all-false optimization is already folded into #4358.

@firestarman firestarman deleted the condition-opt branch December 30, 2021 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Optimize the all-true and all-false cases in GPU If and CaseWhen
3 participants