Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up IN list types analysis #11907

Closed
wants to merge 6 commits into from

Conversation

wendigo
Copy link
Contributor

@wendigo wendigo commented Apr 11, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Apr 11, 2022
@wendigo
Copy link
Contributor Author

wendigo commented Apr 11, 2022

#5950

@wendigo wendigo force-pushed the serafin/speedup-type-analysis branch 2 times, most recently from 567a739 to d7ea68d Compare April 11, 2022 22:50
@wendigo wendigo requested a review from sopel39 April 11, 2022 22:51
@wendigo wendigo changed the title WIP: Speed up IN list types analysis Speed up IN list types analysis Apr 12, 2022
Copy link
Member

@kasiafi kasiafi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some thoughts on the ExpressionAnalyzer part. Please give me a chance to comment, I'll hopefully get back to it tomorrow.

@wendigo wendigo force-pushed the serafin/speedup-type-analysis branch from d7ea68d to 10fee6e Compare April 12, 2022 16:49
@wendigo wendigo requested a review from sopel39 April 12, 2022 16:50
@wendigo
Copy link
Contributor Author

wendigo commented Apr 12, 2022

Together with #11902 it yields following performance improvement:

Screenshot 2022-04-12 at 18 36 45

@wendigo wendigo force-pushed the serafin/speedup-type-analysis branch 2 times, most recently from 8c739ca to b67f405 Compare April 12, 2022 18:37
@wendigo wendigo force-pushed the serafin/speedup-type-analysis branch from b67f405 to 0452821 Compare April 12, 2022 18:38
// We need an explicit copy to avoid ConcurrentModificationException
Set<Type> types = typeExpressions.keySet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see explicit copy here.


private void addOrReplaceExpressionsCoercion(List<Expression> expressions, Type type, Type superType)
{
Map<NodeRef<Expression>, Type> expressionRefs = expressions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"expressionRefs"?

(the name could be OK for a list, but that's a Map)

@@ -45,6 +55,14 @@
private final PlannerContext plannerContext;
private final StatementAnalyzerFactory statementAnalyzerFactory;

private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionAnalyzer is a ton of mutable state, it's totally insuitable for caching & sharing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was proposed by @sopel39 . My previous version was actually caching only Map<RefNode<Expression>, Type>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type is immutable, so it was definitely a better idea

(not judging whether this particular cache in this place is required)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 wdyt?

Copy link
Member

@sopel39 sopel39 Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionAnalyzer is a ton of mutable state, it's totally insuitable for caching & sharing.

I don't think it's a problem since it only adds new information to analysis. It already caches via io.trino.sql.analyzer.ExpressionAnalyzer.Visitor#process.

That was proposed by @sopel39 . My previous version was actually caching only Map<RefNode, Type>

IIRC that map was not passed to ExpressionAnalyzer hence it wouldn't work with sub-expressions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, for me planning is a process so keeping ExpressionAnalyzer is not really an issue. Actually, it's probably desirable since I want to cache ExpressionInterpreter too and it needs to be able to fetch type information for new expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner. It is intended for the Analysis phase, where it provides different kind of information to the Planner via the Analysis object. If you need to analyze expressions later for some reason, there's the analyzeExpressions method, and createConstantAnalyzer method, depending on your use case. Those methods give you access to the "analyzing" capability of the ExpressionAnalyzer while they hide the complexity which is not relevant at that point.

Copy link
Member

@sopel39 sopel39 Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner.

@kasiafi This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.
We want to reduce that cost by keeping ExpressionAnalyzer instance during planning, which should be fine since planning is a process rather than moving from one immutable state to the other.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

I would like to keep ExpressionInterpreter too. For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef<Expression>, Type> map. In that context keeping ExpressionAnalyzer is more natural.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cache expensive TypeAnalyzer instance creation and reduce type analys…
…is cost"

.orElse(session.getQueryId().getId());
}

private static class CacheIdentityKey
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the meaning of the class's name?

typeExpressions.put(process(expression, context), expression);
}

// We need an explicit copy to avoid ConcurrentModificationException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove obsolete comment

@@ -2592,13 +2594,22 @@ private Type coerceToSingleType(StackableAstVisitorContext<Context> context, Str

private void addOrReplaceExpressionCoercion(Expression expression, Type type, Type superType)
{
NodeRef<Expression> ref = NodeRef.of(expression);
expressionCoercions.put(ref, superType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List.of(expression) -> ImmutableList.of(expression)

@@ -45,6 +55,14 @@
private final PlannerContext plannerContext;
private final StatementAnalyzerFactory statementAnalyzerFactory;

private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(
Copy link
Member

@sopel39 sopel39 Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner.

@kasiafi This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.
We want to reduce that cost by keeping ExpressionAnalyzer instance during planning, which should be fine since planning is a process rather than moving from one immutable state to the other.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

I would like to keep ExpressionInterpreter too. For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef<Expression>, Type> map. In that context keeping ExpressionAnalyzer is more natural.

@@ -45,6 +55,14 @@
private final PlannerContext plannerContext;
private final StatementAnalyzerFactory statementAnalyzerFactory;

private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(
CacheBuilder.newBuilder()
// Try to evict entries as soon as possible to keep cache relatively small
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO TypeAnalyzer should be created per query instance and possibly extracted as an interface. It should be then created and used in io.trino.sql.planner.LogicalPlanner#plan. No NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache is needed then

@Override
protected Object[][] largeInValuesCountData()
{
return new Object[][] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add a case where IN list is in subexpression, e.g: x IS NULL or y IN (...)

.getExpressionTypes();
try {
ExpressionAnalyzer analyzer = analyzerCache.get(new CacheIdentityKey(getIdFromSession(session), expressions), () -> createExpressionAnalyzer(session, plannerContext, statementAnalyzerFactory, inputTypes));
return analyzeExpressions(analyzer, expressions).getExpressionTypes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call ExpressionAnalyzer#analyze directly, no need to go through ExpressionAnalysis

@kasiafi
Copy link
Member

kasiafi commented Apr 13, 2022

This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.

Wouldn't caching Map<NodeRef<Expression>, Type> solve the issue? We could even entirely avoid re-analyzing expressions if we captured all the expressionTypes established by the ExpresionAnalyzer in the Analysis phase, and initially fed the cache with them.
There could be some nuance around the NodeRef though, if the expression was for some reason swapped for identical copy.

For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef, Type> map.

Type for a (sub)expression would be there in the cache.

I would like to keep ExpressionInterpreter too.

Why? Is that to avoid re-evaluating constant expressions? We could cache Expression -> Object.

@sopel39
Copy link
Member

sopel39 commented Apr 13, 2022

Type for a (sub)expression would be there in the cache.

Why would it be in cache? It can be a new instance of expression (e.g. symbol changes). Bringing new instance of ExpressionAnalyzer just to analyze an expression seems like an overkill.

Why? Is that to avoid re-evaluating constant expressions? We could cache Expression -> Object.

Yes. We do that repetitively in planning (e.g. with simplifyExpressions). Why to externalize cache from io.trino.sql.planner.ExpressionInterpreter rather than keeping io.trino.sql.planner.ExpressionInterpreter instance? Note that io.trino.sql.planner.ExpressionInterpreter already has some cache internally

@kasiafi
Copy link
Member

kasiafi commented Apr 13, 2022

In my opinion, keeping the ExpressionAnalyzer is much more acceptable than creating and caching many ExpressionAnalyzers.

However, I still consider it kind of "abstraction leak". The ExpressionAnalyzer is here to perform correctness checks, determine coercions, etc on the pre-planning phase. Computing types is kind of "implementation detail". We learned to reuse this capability throughout the Planner, because we found that we need to know the types over and over. The new IR should definitely be enhanced with types (and also with pre-computed constants).

While we currently reuse the ExpressionAnalyzer for getting the expression types, we should try to isolate this capability from all the other work that ExpressionAnalyzer does. For that reason, we use the public methods rather than the constructor, which also involves irrelevant parts, like the Analysis.

@kasiafi
Copy link
Member

kasiafi commented Apr 13, 2022

I'm OK with having a persistent (per query) ExpressionInterpreter with caching.
Regarding the ExpressionAnalyzer, we could extract just the part we need, that is - pure type analysis and the cache, and let it live through the Planning phase.

wendigo and others added 2 commits April 13, 2022 14:53
Creating a new instance of InPredicate would cause expression type cache miss,
which is using node reference as a cache key.
@sopel39
Copy link
Member

sopel39 commented Apr 13, 2022

Regarding the ExpressionAnalyzer, we could extract just the part we need, that is - pure type analysis and the cache, and let it live through the Planning phase.

@kasiafi What does it mean in practice? You mean you would like to have something like ExpressionTypeAnalyzer that just populates io.trino.sql.analyzer.ExpressionAnalyzer#expressionTypes? Is it practical to split ExpressionAnalyzer like that?

@kasiafi
Copy link
Member

kasiafi commented Apr 13, 2022

@kasiafi What does it mean in practice? You mean you would like to have something like ExpressionTypeAnalyzer that just populates io.trino.sql.analyzer.ExpressionAnalyzer#expressionTypes? Is it practical to split ExpressionAnalyzer like that?

Yes, that was my thinking.
It might not be practical, because in the ExpressionAnalyzer the type analysis logic is intermixed with correctness checks, recording coercions, subqueries etc. Also, some kinds of expressions, e.g. SubqueryExpression, should not be present at later phases of planning.

However, I still think that caching Expression -> Type should be sufficient, and occasional creation of new ExpressionAnalyzer is better than "pulling" it throughout the Optimizer. And to make it the most "occasional", we should consider preserving the NodeRefs whenever possible (mostly, in ExpressionInterpreter) instead of creating identical copies of Expressions.

@sopel39
Copy link
Member

sopel39 commented Apr 13, 2022

However, I still think that caching Expression -> Type should be sufficient, and occasional creation of new ExpressionAnalyzer is better than "pulling" it throughout the Optimizer.

Why it would be better? You end up performing all these correctness checks anyway, so why to pretend we don't use ExpressionAnalyzer? Is splitting ExpressionAnalyzer something we could do later?

And to make it the most "occasional", we should consider preserving the NodeRefs whenever possible (mostly, in ExpressionInterpreter) instead of creating identical copies of Expressions.

I don't like that approach mostly because getType call becomes expensive with this indirect approach for what would otherwise be a swift ExpressionAnalyzer#analyze call

@findepi
Copy link
Member

findepi commented Apr 13, 2022

I dismissed my review. @martint ptal if you happen to have time.

@wendigo wendigo force-pushed the serafin/speedup-type-analysis branch from 0452821 to d94e061 Compare April 13, 2022 19:43
@martint martint self-requested a review June 14, 2022 20:26
@bitsondatadev
Copy link
Member

👋 @wendigo - this PR has become inactive. If you're still interested in working on it, please let us know.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@wendigo wendigo closed this Feb 20, 2023
@wendigo wendigo deleted the serafin/speedup-type-analysis branch February 20, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants