proposal: maintaining histograms in plan. #7605

winoros · 2018-09-04T09:25:55Z

What problem does this PR solve?

Proposal about maintaining histogram information in plan's operators.

Check List

Tests

No code

Side effects

Possible performance regression
Increased code complexity

zz-jason · 2018-09-05T05:19:45Z

@CaitinChen PTAL, thanks!

CaitinChen · 2018-09-05T02:53:03Z

docs/design/2018-09-04-histograms-in-plan.md

+## Background
+
+Currently, TiDB only uses statistics when deciding which physical scan method a table should use. And TiDB only stores simple statistics in the plan structure. But when deciding the join order and considering some other optimization rules, we need more detailed statistics.
+So we need to maintain the statistics in the plan structure to get  sufficient statistics information to do optimizations.


So we need to maintain the statistics in the plan structure to get sufficient statistics information to do optimizations.

CaitinChen · 2018-09-05T03:24:19Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Sort`, we can just copy children's `statsInfo` without doing any change.
+
+For `Limit`, we can just copy children's `statsInfo` or ignore the histogram information. As you know, its execution logic is based on randomization. Hard to maintain the statistics information after it. But we may use the information before it to do some estimation in some scenarios.


For Limit, we can just copy children's statsInfo or ignore the histogram information. As you know, its execution logic is based on randomization. It is hard to maintain the statistics information after Limit. But we may use the information before Limit to do some estimation in some scenarios.
？

CaitinChen · 2018-09-05T03:33:11Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Join`, there’re joins as follows:
+
+- Inner join: use histograms to do the row count estimation with the join key condition. Since it won’t have one side filter, we only need to consider the composite filters after considering the join key. We can simply multiply `selectionFactor` if there are other composite filters in our first version of implementation. Since `Selectivity` cannot calculate selectivity of expression that containing multiple column.


-> ... selectivity of an expression that contains multiple columns.
or
-> ... selectivity of an expression containing multiple columns.

CaitinChen · 2018-09-05T03:36:17Z

docs/design/2018-09-04-histograms-in-plan.md

+
+- One side outer join: It depends on the join keys’ NDV. And we can just use histograms to estimate it if there’re non-join-key filters.
+
+- Semi join: It’s something similar to inner join. But no data expanding occurs. When we maintain the range information. We can get a nearly accurate answer of its row count.


... When we maintain the range information, we can get a nearly accurate answer of its row count
or
... But no data expanding occurs when we maintain the range information. We can get a nearly accurate answer of its row count
??

CaitinChen · 2018-09-05T03:37:58Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Selection`, just use it to calculate the selectivity. 
+
+For `DataSource`, if it’s a non-partitioned table, we just maintain the map. If it’s a partitioned table, we now only store the statistics of each partition So we need to merge them. We’ll need a cache or something else to ensure that we won’t merge them each time we need it, which will consume tooooo much time and memory space.


For DataSource, if it’s a non-partitioned table, we just maintain the map. If it’s a partitioned table, we now only store the statistics of each partition. So we need to merge them. We need a cache or something else to ensure that we won’t merge them each time we need it, which will consume tooooo much time and memory space.

CaitinChen · 2018-09-05T05:23:17Z

docs/design/2018-09-04-histograms-in-plan.md

+
+### What is the impact of not doing this?
+
+Many cases reported by our customer already prove that we need more accurate statistics to choose a better join order and a proper join algorithm. Only maintaining a number about row count and a slice about ndv is not enough for making that decision.


ndv -> NDV

Note: The terms in an article should be consistent.

CaitinChen · 2018-09-05T05:24:01Z

docs/design/2018-09-04-histograms-in-plan.md

+
+## Implementation
+
+First maintain the histogram in `DataSource`. In this step, there will be some changes in the `statistics` package to make it work. It may take a little long time to do this. [PR#7385](https://github.com/pingcap/tidb/pull/7385)


First, maintain the histogram in DataSource. In this step, there will be some changes in the statistics package to make it work. It may take a little long time to do this. PR#7385

CaitinChen · 2018-09-05T05:26:33Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Projection`, change the reflection of the map we maintained.
+
+For `Aggregation`, use the histogram to estimate the ndv of group-by items. If one index cannot cover the group-by item, we’ll multiply the ndv of each group-by column. If the output of `Aggregation` includes group-by columns, we’ll maintain the histogram of them for future use.


ndv -> NDV
"ndv" or "NDV"? Please make your terms consistent.

CaitinChen · 2018-09-05T05:32:23Z

docs/design/2018-09-04-histograms-in-plan.md

+
+I’ve looked into Spark. They did nearly the same thing with what I said. They only maintain the max and min values, no `ranges` information. And they don’t have index, so they only maintain the column’s max/min value which make problem far more much easier.
+
+As for Orca and Calcite, I haven’t discovered where they maintain this information. But there’s something about statistics in Orca’s paper. According to the paper, I think they construct new histogram during planning and cache it for not building to often.


... I think they construct a new histogram during planning and cache it to avoid building too many times.
or
... I think they construct a new histogram during planning and cache it to avoid repeated building.

CaitinChen · 2018-09-05T05:36:41Z

docs/design/2018-09-04-histograms-in-plan.md

+
+And the `expectedCount` we used in physical plan is something same with `Limit`. So the row count modification during physical plan won’t be affected.
+
+After we switch to the cascade like planner. The rule that needs cost to make decision is still a small set of all. And the existence of `Group` can also help us. If we lazily construct the `statsInfo`, this may not be the bottleneck.


After we switch to the cascade-like planner. The rule that needs cost to make a decision is still a small set of all.

winoros · 2018-09-06T07:38:51Z

@CaitinChen I've addressed these comments. PTAL thanks!

CaitinChen

Rest LGTM

CaitinChen · 2018-09-06T08:19:03Z

docs/design/2018-09-04-histograms-in-plan.md

+## Background
+
+Currently, TiDB only uses statistics when deciding which physical scan method a table should use. And TiDB only stores simple statistics in the plan structure. But when deciding the join order and considering some other optimization rules, we need more detailed statistics.
+So we need to maintain the statistics in the plan structure to get sufficient statistics information to do optimizations.


You can click "View" in the upper right corner. Then you can see that these two paragraphs are combined together.
Next time, when you want to write another paragraph, please break a line~ Then this article can be displayed normally.

CaitinChen · 2018-09-06T08:24:40Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Aggregation`, we only need to cut off the things which are not in ranges when doing estimation. There is no need to update the ranges information.
+
+For `TopN`, we now have the alibity to maintain histograms of the order-by items.


alibity -> ability

CaitinChen · 2018-09-06T08:26:32Z

docs/design/2018-09-04-histograms-in-plan.md

+
+I’ve looked into Spark. They did nearly the same thing with what I said. They only maintain the max and min values, rather than the `ranges` information. And they don’t have the index, so they only maintain the column’s max/min value which make problem much easier to solve.
+
+As for Orca and Calcite, I haven’t discovered where they maintain this information. But there’s something about statistics in Orca’s paper. According to the paper, I think they construct new histogram during planning and cache it to avoid building too many times.


... I think they construct a new histogram during planning...
or
... I think they construct new histograms during planning...

CaitinChen · 2018-09-06T08:28:35Z

docs/design/2018-09-04-histograms-in-plan.md

+
+And the `expectedCount` we used in physical plan is something same with `Limit`. So the row count modification during physical plan won’t be affected.
+
+After we switch to the cascade-like planner. The rule that needs cost to make decision is still a small set of all. And the existence of `Group` can also help us. If we lazily construct the `statsInfo`, this may not be the bottleneck.


After we switch to the cascade-like planner, the rule that needs cost to make a decision...

CaitinChen · 2018-09-06T08:30:34Z

docs/design/2018-09-04-histograms-in-plan.md

@@ -0,0 +1,115 @@
+# Proposal: Maintain statistics in `Plan`
+
+- Author:     Yiding CUI


You can add the link of your GitHub profile page after your name~

shenli · 2018-09-06T12:07:56Z

@CaitinChen PTAL

CaitinChen

LGTM

zz-jason · 2018-09-07T07:51:44Z

docs/design/2018-09-04-histograms-in-plan.md

+
+The new `statsInfo` of `plan` should be something like the following structure:
+
+```


zz-jason · 2018-09-07T07:53:54Z

docs/design/2018-09-04-histograms-in-plan.md

+
+We maintain the histogram in `Projection`, `Selection`, `Join`, `Aggregation`, `Sort`, `Limit` and `DataSource` operators.
+
+For `Sort`, we can just copy children's `statsInfo` without doing any change.


How about use a separate section to describe how to maintain statistics for each operators, like:

### `Sort` ### `Limit` ### `Project` ### ...

zz-jason · 2018-09-07T08:00:08Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Join`, there’re joins as follows:
+
+- Inner join: use histograms to do the row count estimation with the join key condition. Since it won’t have one side filter, we only need to consider the composite filters after considering the join key. We can simply multiply `selectionFactor` if there are other composite filters in our first version of implementation. Since `Selectivity` cannot calculate selectivity of an expression containing multiple columns.


s/`Selectivity`/`Selectivity()`/

how the statistics of Join is derived from the statistics of its children?

zz-jason · 2018-09-07T08:03:06Z

docs/design/2018-09-04-histograms-in-plan.md

+
+For `Projection`, change the reflection of the map we maintained.
+
+For `Aggregation`, use the histogram to estimate the NDV of group-by items. If one index cannot cover the group-by item, we’ll multiply the NDV of each group-by column. If the output of `Aggregation` includes group-by columns, we’ll maintain the histogram of them for future use.


how to maintain the statistics info for the Aggregate operator?

It may not need to do anything. Just use the child is okay.

winoros · 2018-09-11T05:16:14Z

I'll update this soon.

winoros · 2018-09-14T08:15:57Z

updated.
It seems that the formulas work well.

zz-jason · 2018-09-17T06:22:15Z

docs/design/2018-09-04-histograms-in-plan.md

+
+We maintain the histogram in `Projection`, `Selection`, `Join`, `Aggregation`, `Sort`, `Limit` and `DataSource` operators.
+
+#### `Sort`


I think we should describe the following things for each operator:

How to maintain and set the the content of the statsInfo struct of the operator based on the statsInfo of it's child operator?

how to calculate the ndv slice?

how to calculate the histColl slice?

how to calculate the rangesOfXXX slice?

how to calculate the max/min Values map?

How to estimate the output row count for the operator based on the statsInfo of it's child operator?

alivxxx · 2018-10-11T07:44:11Z

docs/design/2018-09-04-histograms-in-plan.md

+
+Where <img alt="$joinKeySelectivity = \frac{1}{NDV(t1.key)}*\frac{1}{NDV(t2.key)}*ndvAfterJoin$" src="svgs/291c9eb6e8db885402c716ffc3e17a65.png?invert_in_darkmode" align="middle" width="466.6166208pt" height="27.7756545pt"/>.
+
+The `ndvAfterJoin` can be <img alt="$min(NDV(t1.key), NDV(t2.key))$" src="svgs/30df1c648fa9fe43985776847c8dbe60.png?invert_in_darkmode" align="middle" width="248.4423216pt" height="24.657534pt"/> or a more detailed one if we can caculate it.


caculate -> calculate

alivxxx · 2018-10-11T07:45:51Z

docs/design/2018-09-04-histograms-in-plan.md

+##### One side outer join
+It's almost the same as inner join's behavior. But we need to consider two more thing:
+
+- The unmatched row will be filled as `NULL`. This should be calculated in the new histogram. The null count can be caculated when we estimate the matched count bucket by bucket.


caculated -> calculated

alivxxx · 2018-10-11T07:46:52Z

docs/design/2018-09-04-histograms-in-plan.md

+It's almost the same as inner join's behavior. But we need to consider two more thing:
+
+- The unmatched row will be filled as `NULL`. This should be calculated in the new histogram. The null count can be caculated when we estimate the matched count bucket by bucket.
+- There will be one side filters of the outer table. If the filter is about join key and can be converted to range information, it's can be easily involved when we do the caculation bucket by bucket. Otherwise it's a little hard to deal with it. Don't consider this case currently.


caculation -> calculation

alivxxx · 2018-10-11T07:47:26Z

docs/design/2018-09-04-histograms-in-plan.md

+Same with semi join.
+
+#### `Aggregate`
+Just read the NDV information from the `statsInfo` to dicide the row count after aggregate. If there's index can fully match the group-by items. We just use its NDV. Otherwise we multiply the ndv of each column(or index that can match part of the group-by item).


dicide -> decide.

. We just use -> , we just use?

alivxxx · 2018-10-11T07:48:12Z

docs/design/2018-09-04-histograms-in-plan.md

+We can just copy children's `statsInfo` without doing any change. Since the data distribution is not changed.
+
+#### `Limit`
+Currently we won't maintain hitogram information for it. But it can be considered in the future.


hitogram -> histogram

alivxxx · 2018-10-11T07:59:32Z

docs/design/2018-09-04-histograms-in-plan.md

+<img alt="Step 2" src="./histogram-3.png" width="150pt"/>
+</div>
+
+The calculation inside the bucket can be calculated as this formula <img alt="$selecivity=joinKeySelectivity*RowCount(t1)*RowCount(t2)$" src="svgs/35fa60f709be6b9ab8aa9036bd5e7f7f.png?invert_in_darkmode" align="middle" width="476.19356895pt" height="24.657534pt"/>


selecivity -> selectivity.

alivxxx · 2018-10-11T08:03:39Z

docs/design/2018-09-04-histograms-in-plan.md

+#### `Aggregate`
+Just read the NDV information from the `statsInfo` to dicide the row count after aggregate. If there's index can fully match the group-by items. We just use its NDV. Otherwise we multiply the ndv of each column(or index that can match part of the group-by item).
+
+If some of the group-by items are also in the select field. We will create new histograms modify the `totalCnt` of each bucket(set it the same with `NDV`).


. We will -> , we will?

modify the -> by modifying the?

alivxxx · 2018-10-11T08:05:29Z

docs/design/2018-09-04-histograms-in-plan.md

+If some of the group-by items are also in the select field. We will create new histograms modify the `totalCnt` of each bucket(set it the same with `NDV`).
+
+#### `Sort`
+We can just copy children's `statsInfo` without doing any change. Since the data distribution is not changed.


. Since - , since?

alivxxx · 2018-10-11T08:06:57Z

docs/design/2018-09-04-histograms-in-plan.md

+
+This struct will be maintained when we call `deriveStats`.
+
+Currently we don't change the histogram itself during planning. Because it will consume a lot of time and memory space. I’ll try to maintain ranges slice or the max/min value to improve the accuracy of row count estimation instead.


Actually, we change the histogram in this proposal?

docs/design/2018-09-04-histograms-in-plan.md

alivxxx · 2018-10-11T08:14:18Z

Actually, we only use the *.png in design/svgs?

winoros · 2018-10-11T12:42:17Z

@lamxTyler Yes, it can be removed

zz-jason · 2018-10-18T07:12:01Z

LGTM

zz-jason · 2018-10-23T04:49:43Z

@lamxTyler @eurekaka PTAL

alivxxx · 2018-10-23T05:31:01Z

@winoros They are still some comments not addressed.

winoros · 2018-10-24T09:20:49Z

@lamxTyler So now it can be reviewed.
The conflict will be resolved when addressing new comments.

docs/design/2018-09-04-histograms-in-plan.md

alivxxx · 2018-10-25T05:22:38Z

Seems the file histogram-3.jpeg and DS_Store are not used.

eurekaka

LGTM

alivxxx

LGTM

bugfix fixed pingcap#7518 expression: MySQL compatible current_user function (pingcap#7801) plan: propagate constant over outer join (pingcap#7794) - extract `outerCol = const` from join conditions and filter conditions, substitute `outerCol` in join conditions with `const`; - extract `outerCol = innerCol` from join conditions, derive new join conditions based on this column equal condition and `outerCol` related expressions in join conditions and filter conditions; util/timeutil: fix data race caused by forgetting set stats lease to 0 (pingcap#7901) stats: handle ddl event for partition table (pingcap#7903) plan: implement Operand and Pattern of cascades planner. (pingcap#7910) planner: not convert to TableDual if empty range is derived from deferred constants (pingcap#7808) plan: move projEliminate behind aggEliminate (pingcap#7909) admin: fix admin check table bug of byte compare (pingcap#7887) * admin: remove reflect deepEqual stats: fix panic caused by empty histogram (pingcap#7912) plan: fix panic caused by empty schema of LogicalTableDual (pingcap#7906) * fix drop view if exist error (pingcap#7833) executor: refine `explain analyze` (pingcap#7888) executor: add an variable to compatible with MySQL insert for OGG (pingcap#7863) expression: maintain `DeferredExpr` in aggressive constant folding. (pingcap#7915) stats: fix histogram boundaries overflow error (pingcap#7883) ddl:support the definition of `null` change to `not null` using `alter table` (pingcap#7771) * ddl:support the definition of null change to not null using alter table ddl: add check when create table with foreign key. (pingcap#7885) * ddl: add check when create table with foreign key planner: eliminate if null on non null column (pingcap#7924) executor: fix a bug in point get (pingcap#7934) planner, executor: refine ColumnPrune for LogicalUnionAll (pingcap#7930) executor: fix panic when limit is too large (pingcap#7936) ddl: add TiDB version to metrics (pingcap#7902) stats: limit the length of sample values (pingcap#7931) vendor: update tipb (pingcap#7893) planner: support the Group and GroupExpr for the cascades planner (pingcap#7917) store/tikv: log more information when other err occurs (pingcap#7948) types: fix date time parse (pingcap#7933) ddl: just print error message when ddl job is normal to calcel, to eliminate noisy log (pingcap#7875) stats: update delta info for partition table (pingcap#7947) explaintest: add explain test for partition pruning (pingcap#7505) util: move disjoint set to util package (pingcap#7950) util: add PreAlloc4Row and Insert for Chunk and List (pingcap#7916) executor: add the slow log for commit (pingcap#7951) expression: add builtin json_keys (pingcap#7776) privilege: add USAGE in `show grants` for mysql compatibility (pingcap#7955) ddl: fix invailid ddl job panic (pingcap#7940) *: move ast.NewValueExpr to standalone parser_driver package (pingcap#7952) Make the ast package get rid of the dependency of types.Datum server: allow cors http request (pingcap#7939) *: move `Statement` and `RecordSet` from ast to sqlexec package (pingcap#7970) pr suggestion update executor/aggfuncs: split unit tests to corresponding file (pingcap#7993) store/tikv: fix typo (pingcap#7990) executor, planner: clone proj schema for different children in buildProj4Union (pingcap#7999) executor: let information_schema be the first database in ShowDatabases (pingcap#7938) stats: use local feedback for partition table (pingcap#7963) executor: add unit test for aggfuncs (pingcap#7966) server: add log for binary execute statement (pingcap#7987) admin: refine admin check decoder (pingcap#7862) executor: improve wide table insert & update performance (pingcap#7935) ddl: fix reassigned partition id in `truncate table` does not take effect (pingcap#7919) fix reassigned partition id in truncate table does not take effect add changelog for 2.1.0 rc4 (pingcap#8020) *: make parser package dependency as small as possible (pingcap#7989) parser: support `:=` in the `set` syntax (pingcap#8018) According to MySQL document, `set` use the = assignment operator, but the := assignment operator is also permitted stats: garbage collect stats for partition table (pingcap#7962) docs: add the proposal for the column pool (pingcap#7988) expression: refine built-in func truncate to support uint arg (pingcap#8000) stats: support show stats for partition table (pingcap#8023) stats: update error rate for partition table (pingcap#8022) stats: fix estimation for out of range point queries (pingcap#8015) *: move parser to a separate repository (pingcap#8036) executor: fix wrong result when index join on union scan. (pingcap#8031) Do not modify Plan of dataReaderBuilder directly, because it would impact next batch of outer rows, as well as other concurrent inner workers. Instead, build a local child builder to store the child plan. planner: fix a panic of a cached prepared statement with IndexScan (pingcap#8017) *: fix the issue of executing DDL after executing SQL failure in txn (pingcap#8044) * ddl, executor: fix the issue of executing DDL after executing SQL failure in txn add unit test remove debug info add like evaluator case sensitive test ddl, domain: make schema correct after canceling jobs (pingcap#7997) unit test fix code format proposal: maintaining histograms in plan. (pingcap#7605) support _tidb_rowid for table scan range (pingcap#8047) var rename fix

proposal: add proposal for maintaining histograms in plan.

e595ef1

winoros added proposal sig/planner SIG: Planner labels Sep 4, 2018

shenli added the component/docs label Sep 5, 2018

shenli changed the title ~~proposal: add proposal for maintaining histograms in plan.~~ proposal: maintaining histograms in plan. Sep 5, 2018

CaitinChen reviewed Sep 5, 2018

View reviewed changes

address comments.

b55c924

CaitinChen reviewed Sep 6, 2018

View reviewed changes

address comments

e693210

CaitinChen reviewed Sep 6, 2018

View reviewed changes

zz-jason reviewed Sep 7, 2018

View reviewed changes

Change structure.

8e012e6

winoros force-pushed the histogram-proposal branch from 8ad555a to 539c59a Compare September 14, 2018 08:12

add formula.

bdd26fa

winoros force-pushed the histogram-proposal branch from 539c59a to bdd26fa Compare September 14, 2018 08:13

zz-jason reviewed Sep 17, 2018

View reviewed changes

modify proposal.

45054f0

alivxxx reviewed Oct 11, 2018

View reviewed changes

address comments

bdd89c8

Merge branch 'master' into histogram-proposal

9349f0a

zz-jason added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 18, 2018

Merge branch 'master' into histogram-proposal

0e66398

winoros added 3 commits October 24, 2018 17:13

address comments

3f932a1

add to doc/design's readme

3f5358b

add imgs back

089a502

eurekaka reviewed Oct 24, 2018

View reviewed changes

docs/design/2018-09-04-histograms-in-plan.md Show resolved Hide resolved

docs/design/2018-09-04-histograms-in-plan.md Show resolved Hide resolved

eurekaka reviewed Oct 25, 2018

View reviewed changes

eurekaka added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 25, 2018

winoros added 2 commits October 25, 2018 19:24

remove unused file

799106f

Merge branch 'master' into histogram-proposal

1d95d0b

alivxxx approved these changes Oct 25, 2018

View reviewed changes

alivxxx merged commit 1f57184 into pingcap:master Oct 25, 2018

winoros deleted the histogram-proposal branch October 25, 2018 12:09

winoros mentioned this pull request Oct 29, 2018

planner, statistics: maintain histogram for inner join #8097

Closed


		For `Sort`, we can just copy children's `statsInfo` without doing any change.

		For `Limit`, we can just copy children's `statsInfo` or ignore the histogram information. As you know, its execution logic is based on randomization. Hard to maintain the statistics information after it. But we may use the information before it to do some estimation in some scenarios.


		For `Join`, there’re joins as follows:

		- Inner join: use histograms to do the row count estimation with the join key condition. Since it won’t have one side filter, we only need to consider the composite filters after considering the join key. We can simply multiply `selectionFactor` if there are other composite filters in our first version of implementation. Since `Selectivity` cannot calculate selectivity of expression that containing multiple column.


		- One side outer join: It depends on the join keys’ NDV. And we can just use histograms to estimate it if there’re non-join-key filters.

		- Semi join: It’s something similar to inner join. But no data expanding occurs. When we maintain the range information. We can get a nearly accurate answer of its row count.


		For `Selection`, just use it to calculate the selectivity.

		For `DataSource`, if it’s a non-partitioned table, we just maintain the map. If it’s a partitioned table, we now only store the statistics of each partition So we need to merge them. We’ll need a cache or something else to ensure that we won’t merge them each time we need it, which will consume tooooo much time and memory space.


		### What is the impact of not doing this?

		Many cases reported by our customer already prove that we need more accurate statistics to choose a better join order and a proper join algorithm. Only maintaining a number about row count and a slice about ndv is not enough for making that decision.


		## Implementation

		First maintain the histogram in `DataSource`. In this step, there will be some changes in the `statistics` package to make it work. It may take a little long time to do this. [PR#7385](https://github.com/pingcap/tidb/pull/7385)


		For `Projection`, change the reflection of the map we maintained.

		For `Aggregation`, use the histogram to estimate the ndv of group-by items. If one index cannot cover the group-by item, we’ll multiply the ndv of each group-by column. If the output of `Aggregation` includes group-by columns, we’ll maintain the histogram of them for future use.


		I’ve looked into Spark. They did nearly the same thing with what I said. They only maintain the max and min values, no `ranges` information. And they don’t have index, so they only maintain the column’s max/min value which make problem far more much easier.

		As for Orca and Calcite, I haven’t discovered where they maintain this information. But there’s something about statistics in Orca’s paper. According to the paper, I think they construct new histogram during planning and cache it for not building to often.


		And the `expectedCount` we used in physical plan is something same with `Limit`. So the row count modification during physical plan won’t be affected.

		After we switch to the cascade like planner. The rule that needs cost to make decision is still a small set of all. And the existence of `Group` can also help us. If we lazily construct the `statsInfo`, this may not be the bottleneck.


		For `Aggregation`, we only need to cut off the things which are not in ranges when doing estimation. There is no need to update the ranges information.

		For `TopN`, we now have the alibity to maintain histograms of the order-by items.


		And the `expectedCount` we used in physical plan is something same with `Limit`. So the row count modification during physical plan won’t be affected.

		After we switch to the cascade-like planner. The rule that needs cost to make decision is still a small set of all. And the existence of `Group` can also help us. If we lazily construct the `statsInfo`, this may not be the bottleneck.

		@@ -0,0 +1,115 @@
		# Proposal: Maintain statistics in `Plan`

		- Author: Yiding CUI


		The new `statsInfo` of `plan` should be something like the following structure:

		```


		We maintain the histogram in `Projection`, `Selection`, `Join`, `Aggregation`, `Sort`, `Limit` and `DataSource` operators.

		For `Sort`, we can just copy children's `statsInfo` without doing any change.


		We maintain the histogram in `Projection`, `Selection`, `Join`, `Aggregation`, `Sort`, `Limit` and `DataSource` operators.

		#### `Sort`


		Where <img alt="$joinKeySelectivity = \frac{1}{NDV(t1.key)}\frac{1}{NDV(t2.key)}ndvAfterJoin$" src="svgs/291c9eb6e8db885402c716ffc3e17a65.png?invert_in_darkmode" align="middle" width="466.6166208pt" height="27.7756545pt"/>.

		The `ndvAfterJoin` can be <img alt="$min(NDV(t1.key), NDV(t2.key))$" src="svgs/30df1c648fa9fe43985776847c8dbe60.png?invert_in_darkmode" align="middle" width="248.4423216pt" height="24.657534pt"/> or a more detailed one if we can caculate it.


		This struct will be maintained when we call `deriveStats`.

		Currently we don't change the histogram itself during planning. Because it will consume a lot of time and memory space. I’ll try to maintain ranges slice or the max/min value to improve the accuracy of row count estimation instead.

proposal: maintaining histograms in plan. #7605

proposal: maintaining histograms in plan. #7605

Conversation

winoros commented Sep 4, 2018

What problem does this PR solve?

Check List

zz-jason commented Sep 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winoros commented Sep 6, 2018

CaitinChen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shenli commented Sep 6, 2018

CaitinChen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winoros commented Sep 11, 2018

winoros commented Sep 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alivxxx commented Oct 11, 2018

winoros commented Oct 11, 2018

zz-jason commented Oct 18, 2018

zz-jason commented Oct 23, 2018

alivxxx commented Oct 23, 2018

winoros commented Oct 24, 2018 • edited Loading

alivxxx commented Oct 25, 2018

eurekaka left a comment

Choose a reason for hiding this comment

alivxxx left a comment

Choose a reason for hiding this comment

winoros commented Oct 24, 2018 •

edited

Loading