-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After upgrade from 5.4 to 7.5, my TiFlash data gone after the TiKV node disconnected #8777
Comments
Hi @uzuki27 , thanks for reporting the problem
|
Hi @JaySon-Huang, thank for reply,
and then
and database drop
end with
After that is ERROR what I post above, my TiFlash data is gone althought its' region still there.
|
@uzuki27 The logging you post here is physically dropping data for those given databases. Before that, those databases should be marked as "tombstone" around 2024/02/22 09:48 +07:00 (from the logging |
@JaySon-Huang, i attached log around that time, and as you said, it marked some db as tombstone and have "Database xxx dropped during ..." log. Above its some row, we have
but the tables in this list include the table i haven't set it to have TiFlash replica. |
@JaySon-Huang I just realized that TiFlash re-sync schema begin with
and it around the time that the disconnected (evicted) TiKV node begin be re-elected leader regions again |
@uzuki27 Is there any special task like |
@JaySon-Huang I don't think that it have any triggered task because each time the TiFlash table drop occurred, it was an difference time in day. I have only daily br backup db job at 2:30 AM (+07:00). I will continue to investigating log around that time and will send them to you if I find anything suspicious. |
Hi, @uzuki27. The "Meets a schema diff with regenerate_schema_map flag" error means the In most cases, |
@CalvinNeo I see then it's really strange because i don't have any intend to run any task like that. |
@uzuki27 I still don't figure out why tiflash meets Normally, after users executed DDL statements, the TiDB server will generate some "SchemaDiff" to indicate what happens. And TiFlash will follow the "SchemaDiff" to figure out what DDL operations need to be executed. However, in the DDL framwork refactor since v7.2(#7437), there is a mistake in the implementation. In the following codes, only newly created database that does not exist in TiFlash local are put in tiflash/dbms/src/TiDB/Schema/SchemaBuilder.cpp Lines 1269 to 1391 in fe6621b
|
@uzuki27 We will fix this issue ASAP and sorry for any negative impact it has caused you. |
Thank you for your enthusiastic support, I'm really looking forward to it and gladly to test it. |
@uzuki27 Hi, I still want to figure out why tiflash meets Because I found that unstable network between TiFlash and TiKV does not make TiFlash run into |
@JaySon-Huang I have just scanned on all node of working cluster and haven't found any job, task that excute |
@uzuki27 get it. Thanks |
@JaySon-Huang After reproducing the issue twice more, I can tell the |
@uzuki27 It would be helpful if you could provide the DDL actions on TiDB when TiFlash receive For example, the following TiFlash logging shows that
Then you can filter what DDL is executed in the tidb logging by the following keywords: For example, following logging show that what DDL was executed between
|
@JaySon-Huang then, back to this TiFlash log
version between 762861 and 762867 node1:
and node2
|
another attempt:
tidb node1:
tidb node2:
|
Is there exist some tidb logging like "resolveLock rollback" or "handle ddl job failed"? |
@JaySon-Huang, yes around these log we have a lot of
|
@uzuki27 OK, I got it. It is related to another TiFlash bug: #8578. When multiple DDL jobs run concurrently, or the TiKV is not stable, TiDB could generate some "empty" schema diff. And the TiFlash bug #8578, will try to access the |
Both of these bugs are fixed in the next patch version v7.5.1, which is planned to be released in early March. You can try to upgrade to the new version. |
Thank you, I'm really looking forward to it. |
/found community |
Bug Report
I’m facing an issue with TiFlash after upgrade my TiDB cluster from v5.4 to v7.5.
A days after the upgrade, I realized that the data became out of sync between TiKV and TiFlash. The query via TiFlash returned almost empty data. When show tiflash_replica, all of my table progress is back to 0 or something not 1 but avaiable still 1.
After that, I decided to set all replica to 0, scale-in all 2 of my TiFlash node, waiting end of Pending Offline, prune its and scale-out them into cluster again. After set replica to 2 for my tables again, the replication is run normally. After replication of all tables completed, my query on TiFlash is on-sync again. But after about 1 days, the issue occurred again I did more 2 times scale-in scale-out, the issue still hasn’t been resolved.
1. Minimal reproduce step (Required)
After several attempts, I realized they all shared the same scenario. After all of tables had completed the replication about 1 days. 1 of TiKV node evicted all leader region and then rebalance again
The evicted TiKV node log raised some ERROR like this
and re-join with this log
I don’t think it have OOM kill or restart occurred here. After some minutes, TiKV have this log
Following that, 2 TiFlash node have some strange log and data on TiFlash become empty.
and then
As image below, data on 2 TiFlash (10.0.0.4-5) node is gone but region size still not decrease. All TiFlash query at this time is return an not accurate result but TiKV query is normal.
If set replica to 0 and re-set them to 2 again, to TiFlash node become Disconnect and then Down with some not-sync error.
Retry scale-in scale-out at 3rd attemp still have same issue.
2. What did you expect to see? (Required)
TiFlash not being lost data and return correct query result
3. What did you see instead (Required)
TiFlash data un-sync with TiFlash lead to wrong TiFlash query
4. What is your TiFlash version? (Required)
v7.5.0
The text was updated successfully, but these errors were encountered: