-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tendermint wrong Block.Header.AppHash bug #1090
Comments
Found a similar issue in |
Report 2
Discord link: https://discordapp.com/channels/798583171548840026/842550346865180682/949734501385379850 |
Since
It suggests around 40k So not having 40k |
Report 3
Discord link: https://discordapp.com/channels/798583171548840026/842550346865180682/953176901294366800 |
Report 4
Discord link: https://discordapp.com/channels/798583171548840026/842550346865180682/954245585546911794 |
Analysis of Report 1:
This is first happening during It is failing because +2/3 of validators prevoted for the block. However, our node detects both in From the logs:
Where
Based on the above, I can think of the following cases: Case 1
Case 2
Case 3
UPDATE 1: I queried height
Therefore, it is case 2. We can see in the log that the node committed incorrect state. Also, verified previous heights (3490888, 3490887) against the logs from |
Since in other reports, this bug is happening during a replay, I am wondering if the problem is case 3) from the linked |
Tagging potential unreleased fixes: IAVL: SDK: Osmosis: |
The above fixes are released, waiting for user feedback |
Separated the most important details from the latest app hash log (the original was 1.5GB): TODOs:
|
There was another log shared with us: https://drive.google.com/file/d/1Uh_hY-mbc1UAiad6NyFGL0ghBdnffqC3/view?usp=sharing It says These are the transactions that could have caused the error. |
With regards to these logs, the store that committed wrong data was The logs:
Actual:
All other stores had correct However, since we observed I also don't think that the error is specific to the Currently, there is still no way to reproduce this. From the discussion with the team, we will focus on our e2e test and hope to expose this problem there. At the same time, we will wait for more logs from node operators to determine if the issue is specific to the |
If there is a data race/inconsistency in IAVL, then yes, |
Here are 3 more instances of this issue. The node has the following characteristics:
|
I feel like the heavy querying could have something to do with the IAVL weirdness |
I addressed some of the problems related to concurrency between commit and queries in In addition to e2e tests with ibc in Osmosis, I'll be adding more unit tests at all layers related to committing and querying at the same time. Hopefully, that helps to expose more data races, if any |
Thanks for all this @mircea-c . I'll inspect the logs I'm wondering if there could be anything else special about your setup? I'm asking this because since |
@p0mvn I'm hoping we don't run some special snowflake of a node 🙂 The only not as common setup is that these nodes run in docker containers. We have 13 other networks that run in this setup though, and none of them have had an issue as this one so far. The only configuration options we change from default are pruning which is set to You can have a look at our complete node config here: https://github.com/cephalopodequipment/config/tree/main/cosmos-sdk/0.44.x |
@mircea-c are you currently able to state sync? |
The issue was reported by mintscan running Use patterns:
|
@mircea-c sorry to hear that. We are working on our e2e tests to expose this problem still. Unfortunately, we don't have a better way to approach this at the moment since we can't deterministically reproduce this. However, we just pre-released |
@p0mvn Is this issue still open? |
It is much less apparent but, as far as I know, it is still present in v8. We would like to keep this open until we have the ability to deterministically reproduce this |
It exists in v12 too.
|
I don't think we've been running into this issue for the same (IAVL) reasons anymore. Therefore, I'm going to close this. Please let me know if there's anything I'm unaware of |
Context
Report 1
100/0/10
Logs
Relevant Slack Thread:
The text was updated successfully, but these errors were encountered: