Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report "update partition record fails" error when upgrade from v4.0.16 to v5.2.0 #30489

Open
seiya-annie opened this issue Dec 7, 2021 · 23 comments · May be fixed by #34109
Open

report "update partition record fails" error when upgrade from v4.0.16 to v5.2.0 #30489

seiya-annie opened this issue Dec 7, 2021 · 23 comments · May be fixed by #34109
Assignees
Labels
affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects the 5.4.x(LTS) versions. component/tablepartition This issue is related to Table Partition of TiDB. severity/moderate sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@seiya-annie
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

upgrade from v4.0.16 to v5.2.0
after upgrade, run stmtflow test ddl/all.jsonnet test. found following error in log:
2021/12/07 17:08:16.565 +08:00] [INFO] [domain.go:129] ["diff load InfoSchema success"] [currentSchemaVersion=4402] [neededSchemaVersion=4403] ["start time"=898.82µss
] [phyTblIDs="[]"] [actionTypes="[]"]
[2021/12/07 17:08:16.567 +08:00] [INFO] [schema_validator.go:291] ["the schema validator enqueue, queue is too long"] ["delta max count"=1024] ["remove schema version"=1863]
[2021/12/07 17:08:16.630 +08:00] [INFO] [domain.go:129] ["diff load InfoSchema success"] [currentSchemaVersion=4403] [neededSchemaVersion=4404] ["start time"=602.692µ�
s] [phyTblIDs="[]"] [actionTypes="[]"]
[2021/12/07 17:08:16.886 +08:00] [INFO] [domain.go:129] ["diff load InfoSchema success"] [currentSchemaVersion=4404] [neededSchemaVersion=4405] ["start time"=4.286041ms] [phyTblIDs="[2515,2516,2517]"] [actionTypes="[8,8,8]"]
[2021/12/07 17:08:16.965 +08:00] [ERROR] [partition.go:1218] ["update partition record fails"] [message="new record inserted while old record is not removed"] [error="EncodeRow error: data and columnID count not match 4 vs 3"]

2. What did you expect to see? (Required)

no error

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

v5.2.0

@seiya-annie seiya-annie added the type/bug The issue is confirmed as a bug. label Dec 7, 2021
@seiya-annie
Copy link
Author

tidb-5000.tar.gz

@seiya-annie
Copy link
Author

@seiya-annie
Copy link
Author

this tidb node is scale-out after upgrade

@bb7133
Copy link
Member

bb7133 commented Dec 7, 2021

PTAL @mjonss @tiancaiamao

@bb7133 bb7133 added the component/tablepartition This issue is related to Table Partition of TiDB. label Dec 7, 2021
@seiya-annie
Copy link
Author

the testcases report this error:
negative.tar.gz
table_partition__drop_partition_2
table_partition__drop_partition_3
table_partition__truncate_partition_2
table_partition__truncate_partition_3

@bb7133
Copy link
Member

bb7133 commented Dec 8, 2021

@seiya-annie What is the configuration of the cluster? I noticed that 'amend transaction' is enabled but not sure if it is related to this error.

It would be helpful if we can have a configuration for the cluster as well.

@tiancaiamao
Copy link
Contributor

This is the same issue with #28292

It's introduced in 5.1 by the cherry-pick PR #21148

It's a critical problem and not easy to fix, see our internal document https://pingcap.feishu.cn/docs/doccnDrJ22mWwkFi9NkUfzwlADc

@jebter jebter added affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects the 5.4.x(LTS) versions. labels Jan 11, 2022
@xiongjiwei
Copy link
Contributor

dup of #28292

@github-actions
Copy link

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

@seiya-annie
Copy link
Author

also exists in v5.3.1

@seiya-annie
Copy link
Author

--- /tmp/ddl__negative__table_partition__truncate_partition_2.531154131/a 2022-03-22 07:50:37.956115775 +0000
+++ /tmp/ddl__negative__table_partition__truncate_partition_2.531154131/b 2022-03-22 07:50:37.956115775 +0000
@@ -9,7 +9,7 @@
/* txn / begin;
-- txn >> 0 rows affected
/
txn / update t set id = 7 where id = 1;
--- txn >> 2 rows affected
+-- txn >> E1105: EncodeRow error: data and columnID count not match 4 vs 3
/
ddl / alter table t truncate partition p0;
-- ddl >> 0 rows affected
/
txn / commit; -- E8028
2022/03/22 07:50:37 [ddl/all.jsonnet#ddl/negative/table_partition__truncate_partition_2] failed: event#11 mismatch: txn:return(/
txn */ update t set id = 7 where id = 1;): expect a result, got (E1105: EncodeRow error: data and columnID count not match 4 vs 3)
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*matchHistory).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:113
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*Test).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:46
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.testOne
/go/tidb-test-util/cmd/stmtflow/command/test.go:117
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.Test.func1
/go/tidb-test-util/cmd/stmtflow/command/test.go:72
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
main.main
/go/tidb-test-util/cmd/stmtflow/main.go:29

@seiya-annie
Copy link
Author

--- /tmp/ddl__negative__table_partition__truncate_partition_3.699959318/a 2022-03-22 07:50:38.351116125 +0000
+++ /tmp/ddl__negative__table_partition__truncate_partition_3.699959318/b 2022-03-22 07:50:38.351116125 +0000
@@ -9,7 +9,7 @@
/* txn / begin;
-- txn >> 0 rows affected
/
txn / update t set id = 2 where id = 5;
--- txn >> 2 rows affected
+-- txn >> E1105: EncodeRow error: data and columnID count not match 4 vs 3
/
ddl / alter table t truncate partition p0;
-- ddl >> 0 rows affected
/
txn / commit; -- E8028
2022/03/22 07:50:38 [ddl/all.jsonnet#ddl/negative/table_partition__truncate_partition_3] failed: event#11 mismatch: txn:return(/
txn */ update t set id = 2 where id = 5;): expect a result, got (E1105: EncodeRow error: data and columnID count not match 4 vs 3)
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*matchHistory).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:113
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*Test).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:46
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.testOne
/go/tidb-test-util/cmd/stmtflow/command/test.go:117
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.Test.func1
/go/tidb-test-util/cmd/stmtflow/command/test.go:72
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
main.main
/go/tidb-test-util/cmd/stmtflow/main.go:29
runtime.main

@seiya-annie
Copy link
Author

--- /tmp/ddl__negative__table_partition__drop_partition_3.222860356/a 2022-03-22 07:50:34.110112366 +0000
+++ /tmp/ddl__negative__table_partition__drop_partition_3.222860356/b 2022-03-22 07:50:34.110112366 +0000
@@ -9,7 +9,7 @@
/* txn / begin;
-- txn >> 0 rows affected
/
txn / update t set id = 2 where id = 5;
--- txn >> 2 rows affected
+-- txn >> E1105: EncodeRow error: data and columnID count not match 4 vs 3
/
ddl / alter table t drop partition p0;
-- ddl >> 0 rows affected
/
txn / commit; -- E8028
2022/03/22 07:50:34 [ddl/all.jsonnet#ddl/negative/table_partition__drop_partition_3] failed: event#11 mismatch: txn:return(/
txn */ update t set id = 2 where id = 5;): expect a result, got (E1105: EncodeRow error: data and columnID count not match 4 vs 3)
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*matchHistory).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:113
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*Test).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:46
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.testOne
/go/tidb-test-util/cmd/stmtflow/command/test.go:117
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.Test.func1
/go/tidb-test-util/cmd/stmtflow/command/test.go:72
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
main.main
/go/tidb-test-util/cmd/stmtflow/main.go:29
runtime.main

@seiya-annie
Copy link
Author

2022/03/22 07:50:31 [ddl/all.jsonnet#ddl/negative/table_partition__drop_partition_1] passed
--- /tmp/ddl__negative__table_partition__drop_partition_2.291812921/a 2022-03-22 07:50:32.535110970 +0000
+++ /tmp/ddl__negative__table_partition__drop_partition_2.291812921/b 2022-03-22 07:50:32.535110970 +0000
@@ -9,7 +9,7 @@
/* txn / begin;
-- txn >> 0 rows affected
/
txn / update t set id = 7 where id = 1;
--- txn >> 2 rows affected
+-- txn >> E1105: EncodeRow error: data and columnID count not match 4 vs 3
/
ddl / alter table t drop partition p0;
-- ddl >> 0 rows affected
/
txn / commit; -- E8028
2022/03/22 07:50:32 [ddl/all.jsonnet#ddl/negative/table_partition__drop_partition_2] failed: event#11 mismatch: txn:return(/
txn */ update t set id = 7 where id = 1;): expect a result, got (E1105: EncodeRow error: data and columnID count not match 4 vs 3)
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*matchHistory).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:113
github.com/zyguan/tidb-test-util/cmd/stmtflow/core.(*Test).Assert
/go/tidb-test-util/cmd/stmtflow/core/test.go:46
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.testOne
/go/tidb-test-util/cmd/stmtflow/command/test.go:117
github.com/zyguan/tidb-test-util/cmd/stmtflow/command.Test.func1
/go/tidb-test-util/cmd/stmtflow/command/test.go:72
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
main.main
/go/tidb-test-util/cmd/stmtflow/main.go:29

@seiya-annie seiya-annie reopened this Mar 23, 2022
@seiya-annie
Copy link
Author

reopen it because issue still exist in v6.0.0

@mjonss
Copy link
Contributor

mjonss commented Mar 30, 2022

i'm still analysing this, but I think there are more issues that can trigger this. I moved the check for equal length for row and colIDs to be done in RemoveRecord which than will fail for several other cases, like TestInsertOnDuplicateKey, with a read row longer than number of colIDs. The test is just not using binary log, so that is why it has no issues.

@mjonss
Copy link
Contributor

mjonss commented Mar 30, 2022

These tests are also using set @@tidb_enable_amend_pessimistic_txn = 1 which is not compatible with binlog (according to the docs).
Also there exists bugs with amending pessimistic transactions (not sure if it is related to this): #20996.

I have not found an easy reproducible test case, but the issue can be triggered by QA, but it needs to be reduced for further investigation.

So my current conclusion is that binlog has similar issues (data and columnID count not match), see #33608.
And tidb_enable_amend_pessimistic_txn is not compatible with binlog, so I will stop investing for now.

@bb7133 & @seiya-annie should we keep this open or close it as duplicate on any of the above bugs? If we keep it open, can we lower the severity?

@tiancaiamao
Copy link
Contributor

also exists in v5.3.1

I execute the SQL in table_partition__drop_partition_2.r.sql and table_partition__drop_partition_3.r.sql,
but can't reproduce it.

@mjonss mjonss assigned bb7133 and unassigned tiancaiamao Apr 7, 2022
@bb7133
Copy link
Member

bb7133 commented Apr 19, 2022

I am able to reproduce this issue, it is actually not related to 'upgrade':

Setup a TiDB cluster with 5.4.0 and nightly(I verified on those versions but I believe it can be reproduced for the others) with binlog enabled, for example:

tiup playground nightly --tiflash 0 --db 1 --kv 1 --pd 1 --pump 1

Then:

set @@tidb_enable_clustered_index=off; 
drop table if exists t;
create table t (id int primary key, v int) partition by range (id) (partition p0 values less than (4), partition p_rest values less than maxvalue);
insert into t values (5, 50);
update t set id = 2 where id = 5;

And the following error is reported:

ERROR 1105 (HY000): EncodeRow error: data and columnID count not match 4 vs 3

@mjonss
Copy link
Contributor

mjonss commented Apr 19, 2022

Great! So the row already includes the HandleID, which is then added a second time here, which causes the bug. Still need to make sure if the handle is always included in the row in the situation where !t.meta.PKIsHandle && !t.meta.IsCommonHandle is true.

@bb7133
Copy link
Member

bb7133 commented Apr 19, 2022

Great! So the row already includes the HandleID, which is then added a second time here, which causes the bug. Still need to make sure if the handle is always included in the row in the situation where !t.meta.PKIsHandle && !t.meta.IsCommonHandle is true.

Exactly, it looks like that this code expect the data without HandleID, but this is not the case for partitions when the data is moved from one partition to another, in which AddRecord and RemoveRecord are called.

Still need to make sure if the handle is always included in the row in the situation where !t.meta.PKIsHandle && !t.meta.IsCommonHandle is true.

@tiancaiamao could you confirm this and see if you have any idea on how to fix it?

@mjonss
Copy link
Contributor

mjonss commented Apr 19, 2022

Great! So the row already includes the HandleID, which is then added a second time here, which causes the bug. Still need to make sure if the handle is always included in the row in the situation where !t.meta.PKIsHandle && !t.meta.IsCommonHandle is true.

Exactly, it looks like that this code expect the data without HandleID, but this is not the case for partitions when the data is moved from one partition to another, in which AddRecord and RemoveRecord are called.

It seems it is corrected for normal deletes like this, so I think we can just use that fix too?

@mjonss
Copy link
Contributor

mjonss commented May 5, 2022

Lowered the severity, since binlog is being deprecated, and there are other limitation between binlog and other features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects the 5.4.x(LTS) versions. component/tablepartition This issue is related to Table Partition of TiDB. severity/moderate sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants