Speed up IN list types analysis #11907

wendigo · 2022-04-11T22:03:57Z

Description

Is this change a fix, improvement, new feature, refactoring, or other?

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

How would you describe this change to a non-technical end user or system administrator?

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

wendigo · 2022-04-11T22:04:26Z

#5950

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

core/trino-main/src/main/java/io/trino/sql/planner/plan/SimplePlanRewriter.java

testing/trino-tests/src/test/java/io/trino/tests/TestLocalQueries.java

kasiafi

I have some thoughts on the ExpressionAnalyzer part. Please give me a chance to comment, I'll hopefully get back to it tomorrow.

wendigo · 2022-04-12T16:52:12Z

Together with #11902 it yields following performance improvement:

findepi · 2022-04-12T20:06:50Z

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

+            // We need an explicit copy to avoid ConcurrentModificationException
+            Set<Type> types = typeExpressions.keySet();


i don't see explicit copy here.

findepi · 2022-04-12T20:08:29Z

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

+
+        private void addOrReplaceExpressionsCoercion(List<Expression> expressions, Type type, Type superType)
+        {
+            Map<NodeRef<Expression>, Type> expressionRefs = expressions


"expressionRefs"?

(the name could be OK for a list, but that's a Map)

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

findepi · 2022-04-12T20:12:05Z

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

@@ -45,6 +55,14 @@
    private final PlannerContext plannerContext;
    private final StatementAnalyzerFactory statementAnalyzerFactory;

+    private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(


ExpressionAnalyzer is a ton of mutable state, it's totally insuitable for caching & sharing.

That was proposed by @sopel39 . My previous version was actually caching only Map<RefNode<Expression>, Type>

Type is immutable, so it was definitely a better idea

(not judging whether this particular cache in this place is required)

@sopel39 wdyt?

ExpressionAnalyzer is a ton of mutable state, it's totally insuitable for caching & sharing.

I don't think it's a problem since it only adds new information to analysis. It already caches via io.trino.sql.analyzer.ExpressionAnalyzer.Visitor#process.

That was proposed by @sopel39 . My previous version was actually caching only Map<RefNode, Type>

IIRC that map was not passed to ExpressionAnalyzer hence it wouldn't work with sub-expressions

Anyway, for me planning is a process so keeping ExpressionAnalyzer is not really an issue. Actually, it's probably desirable since I want to cache ExpressionInterpreter too and it needs to be able to fetch type information for new expressions.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner. It is intended for the Analysis phase, where it provides different kind of information to the Planner via the Analysis object. If you need to analyze expressions later for some reason, there's the analyzeExpressions method, and createConstantAnalyzer method, depending on your use case. Those methods give you access to the "analyzing" capability of the ExpressionAnalyzer while they hide the complexity which is not relevant at that point.

The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner.

@kasiafi This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.
We want to reduce that cost by keeping ExpressionAnalyzer instance during planning, which should be fine since planning is a process rather than moving from one immutable state to the other.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

I would like to keep ExpressionInterpreter too. For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef<Expression>, Type> map. In that context keeping ExpressionAnalyzer is more natural.

findepi

"Cache expensive TypeAnalyzer instance creation and reduce type analys…
…is cost"

core/trino-main/src/main/java/io/trino/sql/planner/plan/SimplePlanRewriter.java

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

findepi · 2022-04-13T06:56:37Z

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

+                .orElse(session.getQueryId().getId());
+    }
+
+    private static class CacheIdentityKey


Can you describe the meaning of the class's name?

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

sopel39 · 2022-04-13T08:36:11Z

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

+                typeExpressions.put(process(expression, context), expression);
+            }
+
+            // We need an explicit copy to avoid ConcurrentModificationException


remove obsolete comment

sopel39 · 2022-04-13T08:37:43Z

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java

@@ -2592,13 +2594,22 @@ private Type coerceToSingleType(StackableAstVisitorContext<Context> context, Str

        private void addOrReplaceExpressionCoercion(Expression expression, Type type, Type superType)
        {
-            NodeRef<Expression> ref = NodeRef.of(expression);
-            expressionCoercions.put(ref, superType);


List.of(expression) -> ImmutableList.of(expression)

sopel39 · 2022-04-13T09:06:15Z

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

@@ -45,6 +55,14 @@
    private final PlannerContext plannerContext;
    private final StatementAnalyzerFactory statementAnalyzerFactory;

+    private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(


The ExpressionAnalyzer was not designed to be created in multiple instances throughout the Planner.

@kasiafi This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.
We want to reduce that cost by keeping ExpressionAnalyzer instance during planning, which should be fine since planning is a process rather than moving from one immutable state to the other.

I strongly opt for caching Map<RefNode, Type>, as that's what you actually need here.

I would like to keep ExpressionInterpreter too. For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef<Expression>, Type> map. In that context keeping ExpressionAnalyzer is more natural.

sopel39 · 2022-04-13T09:09:42Z

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

@@ -45,6 +55,14 @@
    private final PlannerContext plannerContext;
    private final StatementAnalyzerFactory statementAnalyzerFactory;

+    private final NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache = buildNonEvictableCache(
+            CacheBuilder.newBuilder()
+                    // Try to evict entries as soon as possible to keep cache relatively small


IMO TypeAnalyzer should be created per query instance and possibly extracted as an interface. It should be then created and used in io.trino.sql.planner.LogicalPlanner#plan. No NonEvictableCache<CacheIdentityKey, ExpressionAnalyzer> analyzerCache is needed then

sopel39 · 2022-04-13T09:10:12Z

testing/trino-tests/src/test/java/io/trino/tests/TestLocalQueries.java

+    @Override
+    protected Object[][] largeInValuesCountData()
+    {
+        return new Object[][] {


We also need to add a case where IN list is in subexpression, e.g: x IS NULL or y IN (...)

sopel39 · 2022-04-13T09:13:08Z

core/trino-main/src/main/java/io/trino/sql/planner/TypeAnalyzer.java

-                .getExpressionTypes();
+        try {
+            ExpressionAnalyzer analyzer = analyzerCache.get(new CacheIdentityKey(getIdFromSession(session), expressions), () -> createExpressionAnalyzer(session, plannerContext, statementAnalyzerFactory, inputTypes));
+            return analyzeExpressions(analyzer, expressions).getExpressionTypes();


Just call ExpressionAnalyzer#analyze directly, no need to go through ExpressionAnalysis

kasiafi · 2022-04-13T10:31:47Z

This is exactly what is happening right now. Every io.trino.sql.planner.TypeAnalyzer#getTypes call creates a new ExpressionAnalyzer instance (via analyzeExpressions) and does a full analysis. Overall it's expensive if done repeatedly.

Wouldn't caching Map<NodeRef<Expression>, Type> solve the issue? We could even entirely avoid re-analyzing expressions if we captured all the expressionTypes established by the ExpresionAnalyzer in the Analysis phase, and initially fed the cache with them.
There could be some nuance around the NodeRef though, if the expression was for some reason swapped for identical copy.

For that I need a utility to return type for a (sub)expression rather than analyzing entire expression to get full Map<NodeRef, Type> map.

Type for a (sub)expression would be there in the cache.

I would like to keep ExpressionInterpreter too.

Why? Is that to avoid re-evaluating constant expressions? We could cache Expression -> Object.

sopel39 · 2022-04-13T10:38:52Z

Type for a (sub)expression would be there in the cache.

Why would it be in cache? It can be a new instance of expression (e.g. symbol changes). Bringing new instance of ExpressionAnalyzer just to analyze an expression seems like an overkill.

Why? Is that to avoid re-evaluating constant expressions? We could cache Expression -> Object.

Yes. We do that repetitively in planning (e.g. with simplifyExpressions). Why to externalize cache from io.trino.sql.planner.ExpressionInterpreter rather than keeping io.trino.sql.planner.ExpressionInterpreter instance? Note that io.trino.sql.planner.ExpressionInterpreter already has some cache internally

kasiafi · 2022-04-13T10:41:50Z

In my opinion, keeping the ExpressionAnalyzer is much more acceptable than creating and caching many ExpressionAnalyzers.

However, I still consider it kind of "abstraction leak". The ExpressionAnalyzer is here to perform correctness checks, determine coercions, etc on the pre-planning phase. Computing types is kind of "implementation detail". We learned to reuse this capability throughout the Planner, because we found that we need to know the types over and over. The new IR should definitely be enhanced with types (and also with pre-computed constants).

While we currently reuse the ExpressionAnalyzer for getting the expression types, we should try to isolate this capability from all the other work that ExpressionAnalyzer does. For that reason, we use the public methods rather than the constructor, which also involves irrelevant parts, like the Analysis.

kasiafi · 2022-04-13T11:05:38Z

I'm OK with having a persistent (per query) ExpressionInterpreter with caching.
Regarding the ExpressionAnalyzer, we could extract just the part we need, that is - pure type analysis and the cache, and let it live through the Planning phase.

Creating a new instance of InPredicate would cause expression type cache miss, which is using node reference as a cache key.

sopel39 · 2022-04-13T12:53:43Z

Regarding the ExpressionAnalyzer, we could extract just the part we need, that is - pure type analysis and the cache, and let it live through the Planning phase.

@kasiafi What does it mean in practice? You mean you would like to have something like ExpressionTypeAnalyzer that just populates io.trino.sql.analyzer.ExpressionAnalyzer#expressionTypes? Is it practical to split ExpressionAnalyzer like that?

kasiafi · 2022-04-13T13:29:50Z

@kasiafi What does it mean in practice? You mean you would like to have something like ExpressionTypeAnalyzer that just populates io.trino.sql.analyzer.ExpressionAnalyzer#expressionTypes? Is it practical to split ExpressionAnalyzer like that?

Yes, that was my thinking.
It might not be practical, because in the ExpressionAnalyzer the type analysis logic is intermixed with correctness checks, recording coercions, subqueries etc. Also, some kinds of expressions, e.g. SubqueryExpression, should not be present at later phases of planning.

However, I still think that caching Expression -> Type should be sufficient, and occasional creation of new ExpressionAnalyzer is better than "pulling" it throughout the Optimizer. And to make it the most "occasional", we should consider preserving the NodeRefs whenever possible (mostly, in ExpressionInterpreter) instead of creating identical copies of Expressions.

sopel39 · 2022-04-13T13:41:58Z

However, I still think that caching Expression -> Type should be sufficient, and occasional creation of new ExpressionAnalyzer is better than "pulling" it throughout the Optimizer.

Why it would be better? You end up performing all these correctness checks anyway, so why to pretend we don't use ExpressionAnalyzer? Is splitting ExpressionAnalyzer something we could do later?

And to make it the most "occasional", we should consider preserving the NodeRefs whenever possible (mostly, in ExpressionInterpreter) instead of creating identical copies of Expressions.

I don't like that approach mostly because getType call becomes expensive with this indirect approach for what would otherwise be a swift ExpressionAnalyzer#analyze call

.

findepi · 2022-04-13T15:35:53Z

I dismissed my review. @martint ptal if you happen to have time.

bitsondatadev · 2022-11-19T07:31:22Z

👋 @wendigo - this PR has become inactive. If you're still interested in working on it, please let us know.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

cla-bot bot added the cla-signed label Apr 11, 2022

wendigo force-pushed the serafin/speedup-type-analysis branch 2 times, most recently from 567a739 to d7ea68d Compare April 11, 2022 22:50

wendigo requested a review from sopel39 April 11, 2022 22:51

wendigo changed the title ~~WIP: Speed up IN list types analysis~~ Speed up IN list types analysis Apr 12, 2022

sopel39 reviewed Apr 12, 2022

View reviewed changes

kasiafi reviewed Apr 12, 2022

View reviewed changes

wendigo force-pushed the serafin/speedup-type-analysis branch from d7ea68d to 10fee6e Compare April 12, 2022 16:49

wendigo requested a review from sopel39 April 12, 2022 16:50

wendigo force-pushed the serafin/speedup-type-analysis branch 2 times, most recently from 8c739ca to b67f405 Compare April 12, 2022 18:37

Pass TypeAnalyzer singleton in LocalQueryRunner

e62df96

wendigo force-pushed the serafin/speedup-type-analysis branch from b67f405 to 0452821 Compare April 12, 2022 18:38

findepi reviewed Apr 12, 2022

View reviewed changes

findepi previously requested changes Apr 12, 2022

View reviewed changes

findepi reviewed Apr 12, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/planner/plan/SimplePlanRewriter.java Show resolved Hide resolved

findepi reviewed Apr 13, 2022

View reviewed changes

kasiafi reviewed Apr 13, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/sql/analyzer/ExpressionAnalyzer.java Outdated Show resolved Hide resolved

sopel39 reviewed Apr 13, 2022

View reviewed changes

wendigo added 3 commits April 13, 2022 14:52

Speed up coercion to a single type for IN predicates

98c881b

Generate unique query id in the tests

16e9ad9

Cache resolved expression types in TypeAnalyzer

3963111

wendigo and others added 2 commits April 13, 2022 14:53

Remove expensive stream collection for known list size

49d2449

Do not create a new instance of InPredicate if it doesn't simplify

d94e061

Creating a new instance of InPredicate would cause expression type cache miss, which is using node reference as a cache key.

wendigo force-pushed the serafin/speedup-type-analysis branch from 0452821 to d94e061 Compare April 13, 2022 19:43

martint self-requested a review June 14, 2022 20:26

wendigo closed this Feb 20, 2023

wendigo deleted the serafin/speedup-type-analysis branch February 20, 2023 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up IN list types analysis #11907

Speed up IN list types analysis #11907

wendigo commented Apr 11, 2022

wendigo commented Apr 11, 2022

kasiafi left a comment

wendigo commented Apr 12, 2022

findepi Apr 12, 2022

findepi Apr 12, 2022

findepi Apr 12, 2022

wendigo Apr 12, 2022

findepi Apr 12, 2022

wendigo Apr 12, 2022

sopel39 Apr 12, 2022 •

edited

Loading

sopel39 Apr 12, 2022

kasiafi Apr 13, 2022

sopel39 Apr 13, 2022 •

edited

Loading

findepi left a comment

findepi Apr 13, 2022

sopel39 Apr 13, 2022

sopel39 Apr 13, 2022

sopel39 Apr 13, 2022 •

edited

Loading

sopel39 Apr 13, 2022

sopel39 Apr 13, 2022

sopel39 Apr 13, 2022

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022 •

edited

Loading

kasiafi commented Apr 13, 2022

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022 •

edited

Loading

findepi commented Apr 13, 2022

bitsondatadev commented Nov 19, 2022

		// We need an explicit copy to avoid ConcurrentModificationException
		Set<Type> types = typeExpressions.keySet();

Speed up IN list types analysis #11907

Speed up IN list types analysis #11907

Conversation

wendigo commented Apr 11, 2022

Description

Related issues, pull requests, and links

Documentation

Release notes

wendigo commented Apr 11, 2022

kasiafi left a comment

Choose a reason for hiding this comment

wendigo commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022 • edited Loading

kasiafi commented Apr 13, 2022

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022

kasiafi commented Apr 13, 2022

sopel39 commented Apr 13, 2022 • edited Loading

findepi commented Apr 13, 2022

bitsondatadev commented Nov 19, 2022

sopel39 Apr 12, 2022 •

edited

Loading

sopel39 Apr 13, 2022 •

edited

Loading

sopel39 Apr 13, 2022 •

edited

Loading

sopel39 commented Apr 13, 2022 •

edited

Loading

sopel39 commented Apr 13, 2022 •

edited

Loading