Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql): add option to enable/disable select merging #9065

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Apr 26, 2024

Add a global option to enable or disable merging select statements where possible. This can serve as an escape route if the optimization fails. It is currently enabled by default, we may change to make it opt-in.

xref #9064 and #9058

Closes #9058.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

I'm not sure why we should keep select merging around at all. Does it have any non-aesthetic effect or benefit?

@kszucs
Copy link
Member Author

kszucs commented Apr 26, 2024

I'm not sure why we should keep select merging around at all. Does it have any non-aesthetic effect or benefit?

It reduces the query size. Also the user facing compiled query is much more readable with fusing enabled.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

It reduces the query size. Also the user facing compiled query is much more readable with fusing enabled.

We don't have any evidence that either of these things is worth the bugs we've already had to fix with merging selects, and are likely to continue having to fix if we keep it around.

We do have a bunch of recent examples of correctness bugs with it enabled though and IMO correctness is more important than query size and readability of generated SQL.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

Why not wait for issues to come in for the others? If we never have to add these optimizations, then that's code we don't have to maintain/support/test, etc.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

I feel pretty strongly about this given all the time and effort we have spent to avoid these bugs from the previous implementation.

It seems like if we keep merging around we're back to having to whack-a-mole incorrect optimizations yet again.

Personally I don't want to spend time on that anymore.

I think we're in a really great place with our core abstractions otherwise, and the work you've done on it has really moved the whole project forward.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2024

I'm not saying we should never explore this problem, only that we should avoid enabling optimizations--right before then next release--that we have seen are at high risk for correctness issues when we can simply disable them.

@kszucs
Copy link
Member Author

kszucs commented Apr 27, 2024

This PR is about providing an option to enable/disable the select merges. I would prefer to discuss the default value in the relevant issue to not scatter the discussion.

@kszucs
Copy link
Member Author

kszucs commented Apr 29, 2024

Closing in favor of #9070

@kszucs kszucs closed this Apr 29, 2024
@kszucs kszucs reopened this Apr 29, 2024
@cpcloud
Copy link
Member

cpcloud commented Apr 29, 2024

I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.

  • According to this SO Q&A the SQL standard does not specify the evaluation order of predicates. That means that an engine is free to reorder (or not) predicates in any way AFAICT, including if it would change the semantics of the query.

For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing.

In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:

D explain select * from (select * from t where cast(x as int) != 0) where x <> '*';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘
D explain select * from (select * from t where x <> '*') where cast(x as int) != 0;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘

This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior.

IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct.

Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users.

I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it.

On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature.

That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports.

The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it.

Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as LIKE functionality in certain positions.

+1 on merging this PR and cutting a release after #8917 is merged.

@cpcloud cpcloud added this to the 9.0 milestone Apr 29, 2024
@cpcloud cpcloud added feature Features or general enhancements sql Backends that generate SQL labels Apr 29, 2024
@cpcloud cpcloud merged commit 4bc9314 into ibis-project:main Apr 30, 2024
169 checks passed
@cpcloud cpcloud deleted the option-for-select-fusion branch April 30, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements sql Backends that generate SQL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: merging selections combines filters in incorrect way
2 participants