Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap potentially conditional expressions with IF in PreAggregateCaseAggregations #20652

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

weiatwork
Copy link
Contributor

@weiatwork weiatwork commented Feb 9, 2024

Description

There can be query failures with PreAggregateCaseAggregations.

Repro:

create table t (c1 int, c2 varchar);

insert into t values (1, '1');
insert into t values (2, '2');
insert into t values (3, '3');
insert into t values (4, 'd');

select
  max(case when c1 = 1 then cast(c2 as int) end) as m1,
  max(case when c1 = 2 then c2 end) as m2,
  max(case when c1 = 3 then c2 end) as m3,
  max(case when c1 = 4 then c2 end) as m4
from t;

Query 20240209_015013_00035_6x8dz failed: Cannot cast 'd' to INT

The problem is due to overgeneralized application of the optimization rule. The above example shows a type compatibility issue, but other issues such as division by zero can also happen.

Note: I have not modified the test cases in TestPreAggregateCaseAggregations since it has many uses of CAST. I can update it once this fix is deemed feasible (not sure whether an allowlist or a denylist makes more sense).

Related: #12548

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix `CASE` aggregation queries incorrectly failing when aggregation
  input evaluation could throw an error. ({issue}`20652`)

@cla-bot cla-bot bot added the cla-signed label Feb 9, 2024
@weiatwork weiatwork requested a review from sopel39 February 9, 2024 18:52
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx! Could you add a test case?

@weiatwork
Copy link
Contributor Author

@sopel39 Thanks for taking a look! I just realized there can be other failing cases, e.g.,

select
  max(CAST(case when c1 = 1 then c2 end AS INT)) as m1,
  max(case when c1 = 2 then c2 end) as m2,
  max(case when c1 = 3 then c2 end) as m3,
  max(case when c1 = 4 then c2 end) as m4
from t;

This example is slightly different in that the CAST appears as the direct argument of the aggregation. I think we need to do something similar to limit the applicability here. But what's tricky is we don't know what type of Expression it is at this point (it may just be a symbol like expr), and we have to go down the tree to find the source assignment for this. Do you know if there is a good way to do this?

@sopel39
Copy link
Member

sopel39 commented Feb 13, 2024

I think the problem might be in PreAggregateCaseAggregations#extractCaseAggregations which doesn't check that aggregation type matches between all case aggregations. IIRC it only resolves that

        if (!cumulativeFunction.getSignature().getReturnType().equals(aggregationType)) {
            // aggregation type after rewrite must not change
            return Optional.empty();
        }

and some other checks

@weiatwork do you think you will be able to fix this?

@sopel39
Copy link
Member

sopel39 commented Feb 13, 2024

it could be that io.trino.sql.planner.iterative.rule.PreAggregateCaseAggregations.PreAggregationKey is missing some type information

@weiatwork
Copy link
Contributor Author

I think regardless the type, the problem is due to the over-generous qualification for CAST in expression like this:

  max(CAST(case when c1 = 1 then c2 end AS INT)) as m1

It shouldn't be allowed in the first place, as pre-aggregating on a CAST can also have side effect.

I added this additional restriction, as well as updating the UT. Please take a look.

@weiatwork weiatwork changed the title Be strict on WhenClause expression with PreAggregateCaseAggregations Make PreAggregateCaseAggregations more restrictive Feb 14, 2024
@sopel39
Copy link
Member

sopel39 commented Feb 14, 2024

BW: the query is correctly rewritten. It correctly fails with Query 20240209_015013_00035_6x8dz failed: Cannot cast 'd' to INT, because d indeed cannot be converted to int

@weiatwork
Copy link
Contributor Author

Yes, VARCHAR 'd' cannot be converted to INT, but the conversion shouldn't have been attempted in the first place: it is conditional based on the context provided by the specific case statement. The optimization's pre-aggregation makes the conversion unconditional, thus causing the failure.

@sopel39
Copy link
Member

sopel39 commented Feb 14, 2024

Yes, VARCHAR 'd' cannot be converted to INT, but the conversion shouldn't have been attempted in the first place:

@martint what does spec says about it?

I would guess we could wrap pre-aggregate expression in if(c1 = 1, cast(c2 as int), null) predicate, so that it only fails when original aggregation would fail

@martint
Copy link
Member

martint commented Feb 14, 2024

In a CASE statement, the conditions are expected to be evaluated eagerly, while the branches only get evaluated for the first condition that matches. So, in the example shown in the description above, this should never fail: case when c1 = 1 then cast(c2 as int) end, as cast(c2 as int) would never get evaluated for c1 = 4

@sopel39
Copy link
Member

sopel39 commented Feb 14, 2024

@weiatwork would you like to work on a fix? For expressions that can fail we need to wrap them in expression similar to if(c1 = 1, cast(c2 as int), null)

@weiatwork weiatwork changed the title Make PreAggregateCaseAggregations more restrictive Wrap potentially conditional expressions with IF in PreAggregateCaseAggregations Feb 15, 2024
@weiatwork
Copy link
Contributor Author

@sopel39 The wrapping approach sounds good. I've identified two scenarios that are subject to this treatment:

  1. agg(CAST(case when c1 = 1 then c2 end AS INT))
  2. agg(case when c1 = 1 then CAST(c2 as int) end)

@sopel39
Copy link
Member

sopel39 commented Feb 15, 2024

@weiatwork there is also division by zero etc.

agg(CAST(case when c1 = 1 then c2 end AS INT))
agg(case when c1 = 1 then CAST(c2 as int) end)

I think these end-up being same plan really even before input to pre-aggregate rule

@weiatwork
Copy link
Contributor Author

The conditional expressions that could potentially fail (incl. division by zero) are covered by #2 above.

You're right the two use cases produce the same plan. I just wanted to exemplify what query patterns can be subject to such failures.

@weiatwork
Copy link
Contributor Author

@sopel39 Do you have any other concerns regarding the fix?

@sopel39
Copy link
Member

sopel39 commented Feb 19, 2024

@sopel39 Do you have any other concerns regarding the fix?

No. I think it should do the job

@weiatwork
Copy link
Contributor Author

Can we have this merged? @sopel39

Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a query test under io.trino.sql.query that replicates the issue before the fix is applied?

@@ -347,9 +349,11 @@ private Optional<CaseAggregation> extractCaseAggregation(Symbol aggregationSymbo
Symbol projectionSymbol = Symbol.from(aggregation.getArguments().get(0));
Expression projection = projectNode.getAssignments().get(projectionSymbol);
Expression unwrappedProjection;
boolean wrappedByCast = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martint do you recall if we have a visitor checking whether expression can throw an exception?

Maybe only then we could wrap expression in IF

@@ -410,13 +468,22 @@ private Optional<CaseAggregation> extractCaseAggregation(Symbol aggregationSymbo
toSqlType(aggregationType)));
}

Expression whenOperand = caseExpression.getWhenClauses().getFirst().getOperand();
Expression whenResult = caseExpression.getWhenClauses().getFirst().getResult();
if (wrappedByCast ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe extend it to simple aritmetic (e.g. mul, addition etc). @martint wdyt? Would it be better to whitelist here or blacklist?

@weiatwork
Copy link
Contributor Author

@sopel39 In my latest revision, there is still one failing case - the CAST-wrapped CASE: #20652 (comment)

In my previous approach, this case was covered, because of the "early wrapping" in extractCaseAggregation when dealing with the original aggregations. In this new approach, we can only see the unwrapped case aggregations and unaware of any original outer CAST.

I'm thinking of creating a mini-rule, something like PushCastIntoRow, that pushes CAST into Case. Wdyt?

CAST(CASE WHEN c1 = 1 THEN c2 END AS INT)

                    ||
                    \/

CASE WHEN c1 = 1 THEN CAST(c2 AS INT) END

@sopel39
Copy link
Member

sopel39 commented Mar 1, 2024

looks much easier, small nits

@sopel39 sopel39 merged commit baf4efe into trinodb:master Mar 4, 2024
92 checks passed
@sopel39
Copy link
Member

sopel39 commented Mar 4, 2024

thx!

@github-actions github-actions bot added this to the 440 milestone Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants