remove input argument for `Expression.ConstItem` #49464

lcwangchao · 2023-12-14T08:20:16Z

Enhancement

The method Expression.ConstItem returns a bool to indicate whether the expression returns a constant ignoring the input row. However, it has an input argument with type *stmtctx.StatementContext.

When we take a closer look at its implementation. We found the input stmtctx is required by Constant:

tidb/pkg/expression/constant.go

Lines 426 to 428 in d076bc1

    
           func (c *Constant) ConstItem(sc *stmtctx.StatementContext) bool { 
        
           	return !sc.UseCache || (c.DeferredExpr == nil && c.ParamMarker == nil) 
        
           }

So this function has two meanings:

When UseCache is true. This returns whether the expression is constant ignoring the EvalContext which means it will return a constant value at any time.
When UseCache is false. This returns whether the expression is constant only with the same EvalContext.

The concern of UseCache field is to make sure some intermediate states in ScalarExpression can be cached correctly. See implement of regexp: https://github.com/pingcap/tidb/blob/master/pkg/expression/builtin_regexp.go . When an expression will be cached in plan cache, it should not cache the intermediate state because some other statements may use it later.

However, it will has some problems:

The semantic of ConstItem is confusing, and the expression is coupled with plan cache heavily.
We cannot cache intermediate states when using plan cache sometimes.
Sometimes we do not care UseCache and just want to know whether the expression returns a constant with same context. But the current implement forces us to check it: approx_percentile report error some time in prepare/execute #49343

Some possible refactors

Remove input argument in `ConstItem` to decouple with plan cache

Make it clear whether the expression is a pure constant or context-level constant.

First option, introduce a new enum ConstLevel to tell the expression's level of const

type ConstLevel unit

const (
    ConstLevelNone ConstLevel = iota
    ConstLevelConext
    ConstLevelStrict
)

type Expression interface {
    // ...
    ConstLevel() ConstLevel
    // ...
}

Another option, introduce two methods to indicate the const property seperately

type Expression interface {
    // ...
    ConstItemStrictly() boll 
    ConstItemContextful() bool
    // ...
}

Make `ScalarFunction` stateless and move cache to context

To make sure we can also cache some states when plan cache enabled, we can move the expression cache from scalar function to context:

func (s *xxxFuncSig) evalInt(ctx EvalContext, row chunk.Row) (int64, bool, error) {
    var state Foo
    if s.args[0].ConstLevel > ConstLevelNone {
        state, ok := ctx.GetCache(s.cacheKey()).(Foo)
        if !ok {
            state = makeState()
            ctx.SetCache(s.cacheKey(), state)
        }
    } else {
        state = makeState()
    }
    // other codes...
}

The method cacheKey() should return a unique string for each scaclar function and we can use a global uint64 to generate it.

Another option is to still cache some states in the scalar function but use a contextID to tell which context it belongs to. For example:

func (s *xxxFuncSig) evalInt(ctx EvalContext, row chunk.Row) (int64, bool, error) {
    var state Foo
    if s.args[0].ConstLevel > ConstLevelNone {
        if s.cacheValid && s.cacheCtxID = ctx.ID() {
            state = s.cachedState
        } else {
            state = makeState()
            s.cachedState = state
        }
    } else {
        state = makeState()
    }
    // other codes...
}

Is seems a little wired, but more efficient than the first way

Tasks

The text was updated successfully, but these errors were encountered:

YangKeao · 2023-12-14T09:36:35Z

For the cache inside expression sig, I think we can distinguish two different usages of the cache:

Statement layer. The cache only works for the current statement. Then it's fine to cache all the Constant.
Across multiple statements. The cache will work across multiple statement (e.g. stored in the plan cache).

The layer 2 cache should be prefered: if the layer 2 cache is nil, try to read from layer 1 cache.

For the first situation, we can simple return true for all Constant. For the second situation, we can return c.DeferredExpr == nil && c.ParamMarker == nil. I think it'll also have better performance:

For example

prepare stmt1 from 'select regexp_replace(a, ?, ?) from t';
set p1='a';
execute stmt1 using @p1,@p1;

If t has a lot of rows, I guess using plan cache will be even slower than executing select regexp_replace(a, @p1, @p1) from t directly because the pattern is not cacheable. If we can distinguish two layers cache, we can also accelerate the performance of this prepared statement.

bb7133 · 2023-12-19T07:25:50Z

First option, introduce a new enum ConstLevel to tell the expression's level of const

I prefer this one since ConstLevelStrict and ConstLevelConext is exclusive. But I don't fully get the point from @YangKeao (for now).

To make sure we can also cache some states when plan cache enabled, we can move the expression cache from scalar function to context:

I prefer this one(s.cacheKey()) since it is easier to understand, ContextID looks confusing.

YangKeao · 2023-12-19T08:48:00Z

But I don't fully get the point from @YangKeao (for now).

@bb7133 Let me try to explain in a clearer way. Now, the cache in expression can take effect in two different scopes:

In a single query. For example, run a regex_replace projection on a lot of rows (e.g. select regex_replace(t.str, ?, 'WORLD') from t), then the constant regex can be compiled only once and run on many rows.
Across many queries through plan cache. For example, we prepare a query with regex_replace projection on a single row, e.g. select regex_replace(t.str, 'HELLO', 'WORLD') from t where t.id = ?. Then in the same session, the user may execute this statement for many times, then this constant regex can be cached and don't need to be compiled multiple times.

The requirement of the second cache is stronger than the first one. A Constant in TiDB can represent a literal, a parameter or a deferred function. All of them can be used in the first cache, because the parameter will never change during the execution of a single query, but only the literal can be used in the second cache, because the user may apply different parameters across different executions.

However, we only have a single cache currently. Therefore, the current implementation has an assumption: the prepared statement can only use the second kind of cache. It's unreasonable and can harm the performance.

For example, the statement select regex_replace(t.str, ?, 'WORLD') from t can run faster if we have the first kind of cache, because we don't need to re-compile the regex (though it's a parameter) several times. However, in the current implementation, if we execute this statement through prepare/execute, it'll become slower, because we cannot cache the regex, as this statement is using plan-cache and the regex is a parameter.

I have discussed with @lcwangchao, and we agreed that our plans don't have much difference technically, and he'd like to only take care of the first kind of cache for the simplicity at first. It may cause some performance degradation (for prepared statement), but it'll be easier for us to take the first step 👍 .

lcwangchao · 2023-12-20T01:56:44Z

I prefer this one(s.cacheKey()) since it is easier to understand, ContextID looks confusing.

@bb7133 In PR #49584, I'm using ContextID way to separate caches because its performance is better. The stmtCache in StatementContext seems to have a coarse-grained and I'm not sure of the performance impaction if we use it.

Any way, the cache logic is in builtinFuncCache and we can change its implement any time if we want.

lcwangchao added type/enhancement The issue or PR belongs to an enhancement. component/expression labels Dec 14, 2023

This was referenced Dec 19, 2023

refactor cache implement for function regexp_substr #49583

Closed

expression: refactor cache logic for builtinRegexpSubstrFuncSig #49584

Merged

lcwangchao closed this as completed Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove input argument for `Expression.ConstItem` #49464

remove input argument for `Expression.ConstItem` #49464

lcwangchao commented Dec 14, 2023 •

edited

Loading

YangKeao commented Dec 14, 2023 •

edited

Loading

bb7133 commented Dec 19, 2023

YangKeao commented Dec 19, 2023

lcwangchao commented Dec 20, 2023

remove input argument for Expression.ConstItem #49464

remove input argument for Expression.ConstItem #49464

Comments

lcwangchao commented Dec 14, 2023 • edited Loading

Enhancement

Some possible refactors

Remove input argument in ConstItem to decouple with plan cache

Make ScalarFunction stateless and move cache to context

Tasks

YangKeao commented Dec 14, 2023 • edited Loading

bb7133 commented Dec 19, 2023

YangKeao commented Dec 19, 2023

lcwangchao commented Dec 20, 2023

remove input argument for `Expression.ConstItem` #49464

remove input argument for `Expression.ConstItem` #49464

lcwangchao commented Dec 14, 2023 •

edited

Loading

Remove input argument in `ConstItem` to decouple with plan cache

Make `ScalarFunction` stateless and move cache to context

YangKeao commented Dec 14, 2023 •

edited

Loading