data lost after scale-in 3x tikv node #2400

Tammyxia · 2021-07-28T11:43:16Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do? If possible, provide a recipe for reproducing the error.

This is upstream TiDB cluster:

So TiKV node quantity totally is 12, and 2x CDC server, 2x changefeed, workload write to upstream tidb cluster:
$./bin/go-ycsb load mysql -P ./workloads/oncall2929 -p mysql.host=172.16.6.25 -p mysql.port=4000 -p operationcount=5000000 -p autocommit=1 --threads=200 -p databasename=test123 -p tablename=test0

Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.25:2379
[
{
"id": "kafka-task-23",
"summary": {
"state": "normal",
"tso": 426624137162391554,
"checkpoint": "2021-07-28 11:12:09.483",
"error": null
}
},
{
"id": "replication-task-24",
"summary": {
"state": "normal",
"tso": 426624137162391554,
"checkpoint": "2021-07-28 11:12:09.483",
"error": null
}
}
]

scale-in 3x tikv node: tiup cluster scale-in 360UP -N 172.16.6.12:20170, tiup cluster scale-in 360UP -N 172.16.6.13:20171, tiup cluster scale-in 360UP -N 172.16.6.13:20170
Check tiup audit log:
fGcys7dWFBX:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.12%3A20170
fGcys7dWFBX:2021-07-28T11:15:03.550+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.12:20170"}
fGcyt3HVY3p:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.13%3A20171
fGcyt3HVY3p:2021-07-28T11:15:19.819+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.13:20171"}
fGcyvgT9nRj:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.13%3A20170
fGcyvgT9nRj:2021-07-28T11:15:38.292+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.13:20170"}
Stop workload I/O, waiting for sync task completing.
Check data consistency.

What did you expect to see?

both data consistency.

What did you see instead?

50 rows data lost in downstream tidb:

upstream 6.24:
MySQL [test123]> select count() from test0;
+-----------+
| count() |
+-----------+
| 333895176 |
+-----------+
1 row in set (8.26 sec)

downstream 6.30:
MySQL [test123]> select count() from test0;
+-----------+
| count() |
+-----------+
| 333895126 |
+-----------+
1 row in set (8.96 sec)

table stucture is confidential
The lost row's id is between "> select * from test0 where id>'16585095' and id < '16585198';"->here upstream has 50 rows, upstream has 0 rows.
Check data lost datatime in downstream tidb is system: 2021-07-28 11:20:02.247 +0800 CST
$curl http://172.16.6.30:10080/mvcc/key/test123/test0$p20210508$/16585000 ->"start_ts": 426624261094637572
$ tiup ctl:v4.0.14 pd -u http://72.16.6.30:2379 tso 426624261094637572 ->system: 2021-07-28 11:20:02.247 +0800 CST

Versions of the cluster
- Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
```
4.0.14
```
- TiCDC version (execute cdc version):
```
4.0.14
```

The text was updated successfully, but these errors were encountered:

overvenus · 2021-07-29T05:57:33Z

Could you upload TiCDC and TiKV logs?

Tammyxia · 2021-07-29T07:30:50Z

tikv and cdc logs have been uploaded to qa-minio/nfs/CDC_issue_log/CDC_issue_log/2400

overvenus · 2021-07-31T05:32:04Z

Update:

The root cause is still unknown, here is what I found for now.

# upstream
mysql> select * from test0 where id>'16585095' and id < '16585198' limit 4;
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| id       | m_req_all | text                                                                                                                                                                                                                                        | create_date | aid       | atime    | create_time         |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| 16585096 |         1 | aLdZREKkzTLeUpmotmKKbPQZWMyeONeynOQzrSjbMOcYXWVJklbiWZsTdwRAUphGmHgHptxarGnaQlVGyYIiqZEovnXuUMIphqJuIArVVjsIffAVfVYNCZdKTOoZeOfGSZuXpUiSqUTMTALLMyXWUHsEnOqNthGKZSjCuQTsKUSnXfiuMPocEpWqnYcLDbwADCvkqDMNogntsuAnVhLsQDhnQvAFiSCYeKDBesOfilz | 2021-05-08  | 100132649 | 12:01:00 | 2021-07-28 11:18:03 |
|*16585097 |         1 | gsaLyOmzhYrSAGNpNufLzmlaNFUmHxSnxuyKwWzSHovXpINtfUWMbyeypzrUsTMZcKkoobNvMAkLCEelaqZopoMvlaeGmeuzqSEHKlrBSYLVhEoumMuKjxDVuMktnQBdQuZUBAZAOFvuBbTjmWUxJUFjsitgLcnlPBMlfBLNXKHHGHBoSrAOWJuxlndfMIczoTrHyUxWoLHdTVNYcFbfpYheoVxVLCZfLgqyMXJKrkO | 2021-05-08  | 100132650 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585098 |         1 | MKnCjhAwKHvNuJyCEifcQybmXsiqjaPgVZyPNgyDHcRikMZkeWQrdPEsKyjysJGnAfZEUBaxVLvTKWvUelSurrOznVabaQJujoZlvfMZrIUuoIkiEtqHnysDdKdqNtAodQMvHJhhoNmkWpdLAoZkFGJYjgvtjCHElcIGnuZsUwVEezoZZAJvQqIltTMpGpDcOStZMGGEroYIGLgXYOnWyUucLDqPGfwnEGnveeNnvuA | 2021-05-08  | 100132651 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585099 |         1 | ltSCDBeWDMJcwhjIEHaQREDiFvwAAtKwstuoPYPayHctnKJMQIuHNxyrGhlnXVnZLNoiHHpsRZvZtufUXyIKLVklTtJdKZgnJFcKvQNTWzuxKUAAhhWEVQszUOzPSlgKQLqasgqjSsNgZygVVdPCcJNJiWjgddpydGtPpWBIBTDJugNOGukmaDLctRzDwZHMUlJwCzfOIkAhniVjqShlCCfhwvgSzdkFJBjWjpumvSF | 2021-05-08  | 100132652 | 12:01:00 | 2021-07-28 11:18:03 |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+

# downstream
mysql> select * from test0 where id>'16585095' and id < '16585198';
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| id       | m_req_all | text                                                                                                                                                                                                                                        | create_date | aid       | atime    | create_time         |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| 16585096 |         1 | aLdZREKkzTLeUpmotmKKbPQZWMyeONeynOQzrSjbMOcYXWVJklbiWZsTdwRAUphGmHgHptxarGnaQlVGyYIiqZEovnXuUMIphqJuIArVVjsIffAVfVYNCZdKTOoZeOfGSZuXpUiSqUTMTALLMyXWUHsEnOqNthGKZSjCuQTsKUSnXfiuMPocEpWqnYcLDbwADCvkqDMNogntsuAnVhLsQDhnQvAFiSCYeKDBesOfilz | 2021-05-08  | 100132649 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585197 |         1 | RknbktFTIZIVOMnITTCLmoBuylBphYrxPSGCMSuKOdiuGDSEfyYumrvOslQhSxzSsbIQeMDNwaxtBqZTVfmrHwSftlepBRJqlhgUZxHBViYgxcANtEBSjoOqdwKmJfSyQlHAZXRFkeqeYzVCzlDSPUDGUgJCZBrzrVgShyAUxKRTpYkVeHItSUebxpBkCTFrJwHuINrDmHCBTKGKRrQceVJkGtDLDiWvooAeBGvGrjB | 2021-05-08  | 100132700 | 12:01:00 | 2021-07-28 11:18:03 |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
2 rows in set (0.00 sec)

mysql> select _tidb_rowid from test0 PARTITION (`p20210508`) where id=16585097;
+-------------+
| _tidb_rowid |
+-------------+
|    16585497 |
+-------------+
1 row in set (0.00 sec)

➜ curl http://172.16.6.24:10080/mvcc/key/test123/test0\(p20210508\)/16585497
{
 "key": "748000000000000E905F728000000000FD1319",
 "region_id": 1497431,
 "value": {
  "info": {
   "writes": [
    {
     "start_ts": 426624230013796386,
     "commit_ts": 426624231625982140
    }
   ],
   "values": [
    {
     "start_ts": 426624230013796386,
     "value": "gAAHAAAAAQIDBAUGBwQABQDwAPgA/AAEAQwBiRH9AAFnc2FMeU9temhZclNBR05wTnVmTHptbGFORlVtSHhTbnh1eUt3V3pTSG92WHBJTnRmVVdNYnlleXB6clVzVE1aY0trb29iTnZNQWtMQ0VlbGFxWm9wb012bGFlR21ldXpxU0VIS2xyQlNZTFZoRW91bU11S2p4RFZ1TWt0blFCZFF1WlVCQVpBT0Z2dUJiVGptV1V4SlVGanNpdGdMY25sUEJNbGZCTE5YS0hIR0hCb1NyQU9XSnV4bG5kZk1JY3pvVHJIeVV4V29MSGRUVk5ZY0ZiZnBZaGVvVnhWTENaZkxncXlNWEpLcmtPAAAAAACQqRkq5/cFANjuQFgnAAAAAACDNDiqGQ=="
    }
   ]
  }
}

# commit_ts
➜ pd-ctl tso 426624231625982140
system:  2021-07-28 11:18:09.833 +0800 CST
logic:  188

➜ mok 748000000000000E905F728000000000FD1319
"748000000000000E905F728000000000FD1319"
└─## decode hex key
  └─"t\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031"
    └─## table row key
      ├─table: 3728
      └─row: 16585497

➜ tikv-ctl --to-escaped 748000000000000E905F728000000000FD1319
t\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031

➜ tikv-ctl --encode 't\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031'
748000000000000eff905f728000000000fffd13190000000000fa

The above shows one of the lost row id=16585097 (whose _tidb_rowid is 16585497):

Its encoded key is 748000000000000eff905f728000000000fffd13190000000000fa
It is committed at 2021-07-28 11:18:09.833 426624231625982140

│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=247] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624229908938818]       │
│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_cache.go:842] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=1496186] [currIdx=1] [leaderStoreID=7]                                                                                         │
│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=247] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624229908938818]         │
│[2021/07/28 11:18:20.497 +08:00] [INFO] [client.go:1039] ["EventFeed disconnected"] [regionID=1496186] [requestID=41389] [span="[748000000000000eff905f728000000000ffef71390000000000fa, 748000000000000eff9060000000000000f9)"] [checkpoint=426624234077028424] [error="[CDC:┤
ErrEventFeedEventError]epoch_not_match:<current_regions:<id:1497427 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\357q9\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000┤
\\000\\000\\377\\367\\227@\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:<id:1497428 store_id:15 > peers:<id:1497429 store_id:13 > peers:<id:1497430 store_id:7 > > current_regions:<id:1497431 start_key:\"t\\200\\000\\000\\000\\000\\000\┤
\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\367\\227@\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\377\\275]\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:│
<id:1497432 store_id:15 > peers:<id:1497433 store_id:13 > peers:<id:1497434 store_id:7 > > current_regions:<id:1497435 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\377\\275]\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\│
000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\001\\377\\007\\343e\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:<id:1497436 store_id:15 > peers:<id:1497437 store_id:13 > peers:<id:1497438 store_id:7 > > current_regions:<│
id:1496186 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\001\\377\\007\\343e\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\222\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:1109│
 version:6134 > peers:<id:1496187 store_id:15 > peers:<id:1496188 store_id:13 > peers:<id:1496689 store_id:7 > > > "]                                                                                                                                                          │
│[2021/07/28 11:18:20.497 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=248] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624234077028424]       │

So the data lose should occurs between 2021/07/28 11:18:09.905 +08:00 and 2021/07/28 11:18:20.497 +08:00.

amyangfei · 2021-10-29T03:00:12Z

Current status

We have not reproduced this bug
In user's production environment, we recommend to pause changefeed before cluster scale-in or scale-out as a workaround.
More works about replication correctness will be added later.

overvenus · 2021-12-15T04:24:44Z

We have fixed several correctness bugs e.g., #3545 , and it does not happen after since, change to severity/moderate

overvenus · 2022-03-15T12:56:02Z

We also have fixed in TiKV side that may cause data lose, e.g., #4782.

I'm going to close the issue, feel free to reopen if it happens again.

Tammyxia added type/bug The issue is confirmed as a bug. severity/critical labels Jul 28, 2021

overvenus self-assigned this Jul 29, 2021

lonng assigned ben1009 and unassigned overvenus Aug 12, 2021

ben1009 added a commit that referenced this issue Aug 17, 2021

add debug log for #2400

fa16346

ben1009 mentioned this issue Aug 17, 2021

[WIP] debug: add debug log for #2400 #2554

Closed

amyangfei added the subject/correctness Denotes an issue or pull request is related to correctness. label Nov 8, 2021

AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Dec 6, 2021

overvenus added severity/moderate and removed severity/critical labels Dec 15, 2021

overvenus closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data lost after scale-in 3x tikv node #2400

data lost after scale-in 3x tikv node #2400

Tammyxia commented Jul 28, 2021 •

edited

Loading

overvenus commented Jul 29, 2021

Tammyxia commented Jul 29, 2021

overvenus commented Jul 31, 2021

amyangfei commented Oct 29, 2021

overvenus commented Dec 15, 2021

overvenus commented Mar 15, 2022

data lost after scale-in 3x tikv node #2400

data lost after scale-in 3x tikv node #2400

Comments

Tammyxia commented Jul 28, 2021 • edited Loading

Bug Report

overvenus commented Jul 29, 2021

Tammyxia commented Jul 29, 2021

overvenus commented Jul 31, 2021

amyangfei commented Oct 29, 2021

overvenus commented Dec 15, 2021

overvenus commented Mar 15, 2022

Tammyxia commented Jul 28, 2021 •

edited

Loading