Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data lost after scale-in 3x tikv node #2400

Closed
Tammyxia opened this issue Jul 28, 2021 · 6 comments
Closed

data lost after scale-in 3x tikv node #2400

Tammyxia opened this issue Jul 28, 2021 · 6 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate subject/correctness Denotes an issue or pull request is related to correctness. type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented Jul 28, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • This is upstream TiDB cluster:

image

So TiKV node quantity totally is 12, and 2x CDC server, 2x changefeed, workload write to upstream tidb cluster:
$./bin/go-ycsb load mysql -P ./workloads/oncall2929 -p mysql.host=172.16.6.25 -p mysql.port=4000 -p operationcount=5000000 -p autocommit=1 --threads=200 -p databasename=test123 -p tablename=test0

Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.6.25:2379
[
{
"id": "kafka-task-23",
"summary": {
"state": "normal",
"tso": 426624137162391554,
"checkpoint": "2021-07-28 11:12:09.483",
"error": null
}
},
{
"id": "replication-task-24",
"summary": {
"state": "normal",
"tso": 426624137162391554,
"checkpoint": "2021-07-28 11:12:09.483",
"error": null
}
}
]

  • scale-in 3x tikv node: tiup cluster scale-in 360UP -N 172.16.6.12:20170, tiup cluster scale-in 360UP -N 172.16.6.13:20171, tiup cluster scale-in 360UP -N 172.16.6.13:20170

  • Check tiup audit log:
    fGcys7dWFBX:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.12%3A20170
    fGcys7dWFBX:2021-07-28T11:15:03.550+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.12:20170"}
    fGcyt3HVY3p:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.13%3A20171
    fGcyt3HVY3p:2021-07-28T11:15:19.819+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.13:20171"}
    fGcyvgT9nRj:%2Froot%2F.tiup%2Fcomponents%2Fcluster%2Fv1.5.1%2Ftiup-cluster scale-in 360UP -N 172.16.6.13%3A20170
    fGcyvgT9nRj:2021-07-28T11:15:38.292+0800 INFO Execute command {"command": "tiup cluster scale-in 360UP -N 172.16.6.13:20170"}

  • Stop workload I/O, waiting for sync task completing.

  • Check data consistency.

  1. What did you expect to see?
  • both data consistency.
  1. What did you see instead?
  • 50 rows data lost in downstream tidb:

upstream 6.24:
MySQL [test123]> select count() from test0;
+-----------+
| count(
) |
+-----------+
| 333895176 |
+-----------+
1 row in set (8.26 sec)

downstream 6.30:
MySQL [test123]> select count() from test0;
+-----------+
| count(
) |
+-----------+
| 333895126 |
+-----------+
1 row in set (8.96 sec)

  • table stucture is confidential
  • The lost row's id is between "> select * from test0 where id>'16585095' and id < '16585198';"->here upstream has 50 rows, upstream has 0 rows.
  • Check data lost datatime in downstream tidb is system: 2021-07-28 11:20:02.247 +0800 CST
    $curl http://172.16.6.30:10080/mvcc/key/test123/test0\(p20210508\)/16585000 ->"start_ts": 426624261094637572
    $ tiup ctl:v4.0.14 pd -u http://72.16.6.30:2379 tso 426624261094637572 ->system: 2021-07-28 11:20:02.247 +0800 CST
  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      4.0.14
      
    • TiCDC version (execute cdc version):

      4.0.14
      
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/critical labels Jul 28, 2021
@overvenus
Copy link
Member

Could you upload TiCDC and TiKV logs?

@Tammyxia
Copy link
Author

tikv and cdc logs have been uploaded to qa-minio/nfs/CDC_issue_log/CDC_issue_log/2400

@overvenus overvenus self-assigned this Jul 29, 2021
@overvenus
Copy link
Member

Update:

The root cause is still unknown, here is what I found for now.

# upstream
mysql> select * from test0 where id>'16585095' and id < '16585198' limit 4;
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| id       | m_req_all | text                                                                                                                                                                                                                                        | create_date | aid       | atime    | create_time         |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| 16585096 |         1 | aLdZREKkzTLeUpmotmKKbPQZWMyeONeynOQzrSjbMOcYXWVJklbiWZsTdwRAUphGmHgHptxarGnaQlVGyYIiqZEovnXuUMIphqJuIArVVjsIffAVfVYNCZdKTOoZeOfGSZuXpUiSqUTMTALLMyXWUHsEnOqNthGKZSjCuQTsKUSnXfiuMPocEpWqnYcLDbwADCvkqDMNogntsuAnVhLsQDhnQvAFiSCYeKDBesOfilz | 2021-05-08  | 100132649 | 12:01:00 | 2021-07-28 11:18:03 |
|*16585097 |         1 | gsaLyOmzhYrSAGNpNufLzmlaNFUmHxSnxuyKwWzSHovXpINtfUWMbyeypzrUsTMZcKkoobNvMAkLCEelaqZopoMvlaeGmeuzqSEHKlrBSYLVhEoumMuKjxDVuMktnQBdQuZUBAZAOFvuBbTjmWUxJUFjsitgLcnlPBMlfBLNXKHHGHBoSrAOWJuxlndfMIczoTrHyUxWoLHdTVNYcFbfpYheoVxVLCZfLgqyMXJKrkO | 2021-05-08  | 100132650 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585098 |         1 | MKnCjhAwKHvNuJyCEifcQybmXsiqjaPgVZyPNgyDHcRikMZkeWQrdPEsKyjysJGnAfZEUBaxVLvTKWvUelSurrOznVabaQJujoZlvfMZrIUuoIkiEtqHnysDdKdqNtAodQMvHJhhoNmkWpdLAoZkFGJYjgvtjCHElcIGnuZsUwVEezoZZAJvQqIltTMpGpDcOStZMGGEroYIGLgXYOnWyUucLDqPGfwnEGnveeNnvuA | 2021-05-08  | 100132651 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585099 |         1 | ltSCDBeWDMJcwhjIEHaQREDiFvwAAtKwstuoPYPayHctnKJMQIuHNxyrGhlnXVnZLNoiHHpsRZvZtufUXyIKLVklTtJdKZgnJFcKvQNTWzuxKUAAhhWEVQszUOzPSlgKQLqasgqjSsNgZygVVdPCcJNJiWjgddpydGtPpWBIBTDJugNOGukmaDLctRzDwZHMUlJwCzfOIkAhniVjqShlCCfhwvgSzdkFJBjWjpumvSF | 2021-05-08  | 100132652 | 12:01:00 | 2021-07-28 11:18:03 |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+

# downstream
mysql> select * from test0 where id>'16585095' and id < '16585198';
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| id       | m_req_all | text                                                                                                                                                                                                                                        | create_date | aid       | atime    | create_time         |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
| 16585096 |         1 | aLdZREKkzTLeUpmotmKKbPQZWMyeONeynOQzrSjbMOcYXWVJklbiWZsTdwRAUphGmHgHptxarGnaQlVGyYIiqZEovnXuUMIphqJuIArVVjsIffAVfVYNCZdKTOoZeOfGSZuXpUiSqUTMTALLMyXWUHsEnOqNthGKZSjCuQTsKUSnXfiuMPocEpWqnYcLDbwADCvkqDMNogntsuAnVhLsQDhnQvAFiSCYeKDBesOfilz | 2021-05-08  | 100132649 | 12:01:00 | 2021-07-28 11:18:03 |
| 16585197 |         1 | RknbktFTIZIVOMnITTCLmoBuylBphYrxPSGCMSuKOdiuGDSEfyYumrvOslQhSxzSsbIQeMDNwaxtBqZTVfmrHwSftlepBRJqlhgUZxHBViYgxcANtEBSjoOqdwKmJfSyQlHAZXRFkeqeYzVCzlDSPUDGUgJCZBrzrVgShyAUxKRTpYkVeHItSUebxpBkCTFrJwHuINrDmHCBTKGKRrQceVJkGtDLDiWvooAeBGvGrjB | 2021-05-08  | 100132700 | 12:01:00 | 2021-07-28 11:18:03 |
+----------+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+----------+---------------------+
2 rows in set (0.00 sec)

mysql> select _tidb_rowid from test0 PARTITION (`p20210508`) where id=16585097;
+-------------+
| _tidb_rowid |
+-------------+
|    16585497 |
+-------------+
1 row in set (0.00 sec)

➜ curl http://172.16.6.24:10080/mvcc/key/test123/test0\(p20210508\)/16585497
{
 "key": "748000000000000E905F728000000000FD1319",
 "region_id": 1497431,
 "value": {
  "info": {
   "writes": [
    {
     "start_ts": 426624230013796386,
     "commit_ts": 426624231625982140
    }
   ],
   "values": [
    {
     "start_ts": 426624230013796386,
     "value": "gAAHAAAAAQIDBAUGBwQABQDwAPgA/AAEAQwBiRH9AAFnc2FMeU9temhZclNBR05wTnVmTHptbGFORlVtSHhTbnh1eUt3V3pTSG92WHBJTnRmVVdNYnlleXB6clVzVE1aY0trb29iTnZNQWtMQ0VlbGFxWm9wb012bGFlR21ldXpxU0VIS2xyQlNZTFZoRW91bU11S2p4RFZ1TWt0blFCZFF1WlVCQVpBT0Z2dUJiVGptV1V4SlVGanNpdGdMY25sUEJNbGZCTE5YS0hIR0hCb1NyQU9XSnV4bG5kZk1JY3pvVHJIeVV4V29MSGRUVk5ZY0ZiZnBZaGVvVnhWTENaZkxncXlNWEpLcmtPAAAAAACQqRkq5/cFANjuQFgnAAAAAACDNDiqGQ=="
    }
   ]
  }
}

# commit_ts
➜ pd-ctl tso 426624231625982140
system:  2021-07-28 11:18:09.833 +0800 CST
logic:  188

➜ mok 748000000000000E905F728000000000FD1319
"748000000000000E905F728000000000FD1319"
└─## decode hex key
  └─"t\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031"
    └─## table row key
      ├─table: 3728
      └─row: 16585497

➜ tikv-ctl --to-escaped 748000000000000E905F728000000000FD1319
t\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031

➜ tikv-ctl --encode 't\200\000\000\000\000\000\016\220_r\200\000\000\000\000\375\023\031'
748000000000000eff905f728000000000fffd13190000000000fa

The above shows one of the lost row id=16585097 (whose _tidb_rowid is 16585497):

  • Its encoded key is 748000000000000eff905f728000000000fffd13190000000000fa
  • It is committed at 2021-07-28 11:18:09.833 426624231625982140
│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=247] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624229908938818]       │
│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_cache.go:842] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=1496186] [currIdx=1] [leaderStoreID=7]                                                                                         │
│[2021/07/28 11:18:09.905 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=247] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624229908938818]         │
│[2021/07/28 11:18:20.497 +08:00] [INFO] [client.go:1039] ["EventFeed disconnected"] [regionID=1496186] [requestID=41389] [span="[748000000000000eff905f728000000000ffef71390000000000fa, 748000000000000eff9060000000000000f9)"] [checkpoint=426624234077028424] [error="[CDC:┤
ErrEventFeedEventError]epoch_not_match:<current_regions:<id:1497427 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\357q9\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000┤
\\000\\000\\377\\367\\227@\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:<id:1497428 store_id:15 > peers:<id:1497429 store_id:13 > peers:<id:1497430 store_id:7 > > current_regions:<id:1497431 start_key:\"t\\200\\000\\000\\000\\000\\000\┤
\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\367\\227@\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\377\\275]\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:│
<id:1497432 store_id:15 > peers:<id:1497433 store_id:13 > peers:<id:1497434 store_id:7 > > current_regions:<id:1497435 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\000\\377\\377\\275]\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\│
000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\001\\377\\007\\343e\\000\\000\\000\\000\\000\\372\" region_epoch:<conf_ver:1109 version:6134 > peers:<id:1497436 store_id:15 > peers:<id:1497437 store_id:13 > peers:<id:1497438 store_id:7 > > current_regions:<│
id:1496186 start_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\220_r\\200\\000\\000\\000\\001\\377\\007\\343e\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\016\\377\\222\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:1109│
 version:6134 > peers:<id:1496187 store_id:15 > peers:<id:1496188 store_id:13 > peers:<id:1496689 store_id:7 > > > "]                                                                                                                                                          │
│[2021/07/28 11:18:20.497 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=248] [regionID=1496186] [startKey=748000000000000eff905f728000000000ffef71390000000000fa] [endKey=748000000000000eff9060000000000000f9] [checkpointTs=426624234077028424]       │

So the data lose should occurs between 2021/07/28 11:18:09.905 +08:00 and 2021/07/28 11:18:20.497 +08:00.

@lonng lonng assigned ben1009 and unassigned overvenus Aug 12, 2021
ben1009 added a commit that referenced this issue Aug 17, 2021
@amyangfei
Copy link
Contributor

Current status

  • We have not reproduced this bug
  • In user's production environment, we recommend to pause changefeed before cluster scale-in or scale-out as a workaround.
  • More works about replication correctness will be added later.

@amyangfei amyangfei added the subject/correctness Denotes an issue or pull request is related to correctness. label Nov 8, 2021
@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Dec 6, 2021
@overvenus
Copy link
Member

We have fixed several correctness bugs e.g., #3545 , and it does not happen after since, change to severity/moderate

@overvenus
Copy link
Member

We also have fixed in TiKV side that may cause data lose, e.g., #4782.

I'm going to close the issue, feel free to reopen if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. severity/moderate subject/correctness Denotes an issue or pull request is related to correctness. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants