Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for Predicate Columns #53567

Closed
Rustin170506 opened this issue May 27, 2024 · 1 comment
Closed

Tracking Issue for Predicate Columns #53567

Rustin170506 opened this issue May 27, 2024 · 1 comment
Labels
sig/planner SIG: Planner type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@Rustin170506
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe:
The ANALYZE statement would collect the statistics of all columns currently. If the table is big and wide, executing ANALYZE would consume lots of time, memory, and CPU. See #27358 for details.
However, only the statistics of some columns are used in creating query plans, while the statistics of others are not. Predicate columns are those columns whose statistics are used in query plans, usually in where conditions, join conditions, and so on. If ANALYZE only collects statistics for predicate columns and indexed columns (statistics of indexed columns are important for index selection), the cost of ANALYZE can be reduced.

Describe the feature you'd like:
See: #53511
Describe alternatives you've considered:
None
Teachability, Documentation, Adoption, Migration Strategy:
None

@Rustin170506 Rustin170506 added type/feature-request Categorizes issue or PR as related to a new feature. sig/planner SIG: Planner labels May 27, 2024
@Rustin170506
Copy link
Member Author

Rustin170506 commented Jul 18, 2024

How do we deal with the outdated stats?

  1. Create a tidb cluster with the latest master code: tiup playground nightly --db.binpath /Volumes/t7/code/tidb/bin/tidb-server
  2. Create a table with three columns:
CREATE TABLE example_table
(
    id   INT PRIMARY KEY,
    name VARCHAR(255),
    age  INT
);

INSERT INTO example_table (id, name, age)
VALUES (1, 'Alice', 30),
       (2, 'Bob', 25),
       (3, 'Charlie', 35),
       (4, 'Diana', 28),
       (5, 'Evan', 40),
       (6, 'Fiona', 22),
       (7, 'George', 33),
       (8, 'Hannah', 29),
       (9, 'Ian', 31),
       (10, 'Jenny', 27);
  1. Wait for the stats meta:
SELECT * FROM mysql.stats_meta;
+------------------+--------+------------+-----+--------+
|version           |table_id|modify_count|count|snapshot|
+------------------+--------+------------+-----+--------+
|451225860582932481|104     |10          |10   |0       |
+------------------+--------+------------+-----+--------+
  1. Disable tidb_persist_analyze_options: set global tidb_persist_analyze_options='OFF'
  2. Analyze table with all columns: ANALYZE TABLE example_table ALL COLUMNS
  3. Check stats:
SELECT * FROM mysql.stats_meta;
+------------------+--------+------------+-----+------------------+
|version           |table_id|modify_count|count|snapshot          |
+------------------+--------+------------+-----+------------------+
|451225906812289034|104     |0           |10   |451225906799181830|
+------------------+--------+------------+-----+------------------+
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
|table_id|is_index|hist_id|distinct_count|null_count|tot_col_size|modify_count|version           |cm_sketch|stats_ver|flag|correlation         |last_analyze_pos|

SELECT * FROM mysql.stats_histograms;
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
|104     |0       |1      |10            |0         |80          |0           |451225906812289034|null     |2        |1   |1                   |null            |
|104     |0       |2      |10            |0         |59          |0           |451225906812289034|null     |2        |1   |1                   |null            |
|104     |0       |3      |10            |0         |80          |0           |451225906812289034|null     |2        |1   |-0.05454545454545454|null            |
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
  1. Insert more data:
INSERT INTO example_table (id, name, age)
VALUES (1, 'Alice', 30),
       (2, 'Bob', 25),
       (3, 'Charlie', 35),
       (4, 'Diana', 28),
       (5, 'Evan', 40),
       (6, 'Fiona', 22),
       (7, 'George', 33),
       (8, 'Hannah', 29),
       (9, 'Ian', 31),
       (10, 'Jenny', 27);
  1. Select data with column a: SELECT * FROM example_table WHERE name = '';
  2. Wait for the column usage:
SELECT * FROM mysql.column_stats_usage;
+--------+---------+-------------------+-------------------+
|table_id|column_id|last_used_at       |last_analyzed_at   |
+--------+---------+-------------------+-------------------+
|104     |1        |null               |2024-07-18 16:10:32|
|104     |2        |2024-07-18 16:14:19|2024-07-18 16:10:32|
|104     |3        |null               |2024-07-18 16:10:32|
+--------+---------+-------------------+-------------------+
  1. Analyze the table without all columns syntax: ANALYZE TABLE example_table;
  2. Check the stats again:
SELECT * FROM mysql.stats_histograms;
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
|table_id|is_index|hist_id|distinct_count|null_count|tot_col_size|modify_count|version           |cm_sketch|stats_ver|flag|correlation         |last_analyze_pos|
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+
|104     |0       |3      |10            |0         |80          |0           |451225906812289034|null     |2        |1   |-0.05454545454545454|null            |
|104     |0       |1      |10            |0         |80          |0           |451226051173679118|null     |2        |1   |1                   |null            |
|104     |0       |2      |10            |0         |59          |0           |451226051173679118|null     |2        |1   |1                   |null            |
+--------+--------+-------+--------------+----------+------------+------------+------------------+---------+---------+----+--------------------+----------------+

As you can see column 3 still retains the old stats.
12. Check we don't delete the old TopN for column 3:

SELECT * FROM mysql.stats_top_n WHERE hist_id =3;
+--------+--------+-------+--------------------+-----+
|table_id|is_index|hist_id|value               |count|
+--------+--------+-------+--------------------+-----+
|104     |0       |3      |0x038000000000000028|1    |
|104     |0       |3      |0x038000000000000023|1    |
|104     |0       |3      |0x038000000000000021|1    |
|104     |0       |3      |0x03800000000000001F|1    |
|104     |0       |3      |0x03800000000000001E|1    |
|104     |0       |3      |0x03800000000000001D|1    |
|104     |0       |3      |0x03800000000000001C|1    |
|104     |0       |3      |0x03800000000000001B|1    |
|104     |0       |3      |0x038000000000000019|1    |
|104     |0       |3      |0x038000000000000016|1    |
+--------+--------+-------+--------------------+-----+
  1. Wait for a GC window(15min) and check the TopN again:
+--------+--------+-------+--------------------+-----+
|table_id|is_index|hist_id|value               |count|
+--------+--------+-------+--------------------+-----+
|104     |0       |3      |0x038000000000000028|1    |
|104     |0       |3      |0x038000000000000023|1    |
|104     |0       |3      |0x038000000000000021|1    |
|104     |0       |3      |0x03800000000000001F|1    |
|104     |0       |3      |0x03800000000000001E|1    |
|104     |0       |3      |0x03800000000000001D|1    |
|104     |0       |3      |0x03800000000000001C|1    |
|104     |0       |3      |0x03800000000000001B|1    |
|104     |0       |3      |0x038000000000000019|1    |
|104     |0       |3      |0x038000000000000016|1    |
+--------+--------+-------+--------------------+-----+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant