-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] In ANSI mode we can fail in cases Spark would not due to conditionals #3849
Comments
This is related to #3855 because it could happen for them too. |
Talking to @sameerz and @tgravescs I think the best way to try and fix this would be to have an optional boolean column passed with the batch when doing a project. It would be a mask of which rows are actually being executed, and which ones are not. That way in ANSI mode we could look at this mask and not throw exceptions if the row was not intended to be taken. And in the case of #3855 we could just skip execution on that path and insert a null or some other default value instead as the result. |
@jlowe and @tgravescs would love feedback if there is a better way you can think of to fix this. Okay lets have an example like the one above
So in this case we end up with an expression tree like
Right now what happens with spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala Lines 30 to 55 in 9d5ed8d
The false value has already been executed. The I think the proposal would be to have an API like spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/arithmetic.scala Line 170 in 9d5ed8d
It would combine the result of the overflow with this This is going to be problematic for a number of reasons. The ones I can think of off the top of my head are
To work around some of the higher memory usage/etc I would like it if we could have a new method in
Then there is nothing under the if that has side effects so no |
When I first saw this issue go by, I wasn't thinking we would change This isn't completely thought out, but I was thinking of doing something like the following in
Having a |
The algorithm I described above could prove to be more efficient in practice even without side-effects if it's particularly expensive to evaluate one or both of the conditional expressions, as it only evaluates each conditional expression on the minimal set of input values for that expression. The current |
@jlowe I like your proposal because it would have less impact on existing code. I am concerned about the memory impact it might have. For each if/else in the code we are going to make up to a full copy of the data. So if an expression has lots of nested conditions in it, the amount of memory needed is going to grow proportionally. Possibly not the end of the world though, and in the future we might even be able to make it spillable. Because of the memory issues I would say that I think for steps 2 and 3 we would just filter the batch and not bother with a gather map. For step 4 a custom kernel would be best. I think we could make a version that would work without a custom kernel if needed. We could do a scan on the conditional and count the true values. This would give us a gather map for the trueExpr result, but with repeated keys. We then set all of the false values to -1 so we get a null in the gathered result. The null in the gathered result would only be needed for nested types to avoid using too much memory with repeated values.
Then we do the same thing for the false side, but with the inverse of the condition. Finally we can use copy if else to put it all back together into a final result. Not great, but should be doable without anything custom. |
Describe the bug
When Spark evaluates an
if
or acase/when
function it will do lazy evaluation of the children. So the else clause of if/else is only evaluated if the condition is false. For the cudf version we evaluate both sides no matter what. This is fine when an operation has no side effects. But in ANSI mode many operations can have big side effects (failing query). This can also be true for UDFs and forassert_true/raise_error
which we do not support yet in non-ANSI mode.Steps/Code to reproduce bug
The above query works just fine on the CPU because the if/else avoids executing the overflow condition. On the GPU it fails.
Expected behavior
It should behave the same way as the CPU
I think in the short term we can just document this, and possibly fall back to the CPU if ANSI is enabled and we encounter any conditional statements like this. Longer term I think we need to come up with a plan on how to address this.
The text was updated successfully, but these errors were encountered: