-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM during prune causing state hash conflict on restart #1042
Comments
Manual tests to mimic OOM during prune
|
This leads us to conclude that even if OOM, from which we can't recover, occurs during prune, the node should still be able to continue on restart on the latest v7.0.3 change. |
Closing this issue for now |
hrmm, do we know we're not starting the defer logic in these simulated panic cases? |
I will test with |
Sweet, ty! |
Tested several times with extra logs, every time there is no issue on restart: This is how it logs normally:
With
With
I think that tendermint has a mechanism to work around the possibility of not flushing the commit version in the SDK. This mechanism is in replay.go. We previously thought that not flushing the commit was the main issue. However, it wasn't because tendermint detects if app version does not match its store version and catches up the app if needed overwriting the already written but unflushed commit. The main issue IMO was iterating over fast storage on replay. It was not fixed at the time we made that assumption but it is now. To verify that my guess is valid I logged all of
In conclusion, tendermint has a mechanism to replay the commit even if we fail to flush. When we manage to flush the committed metadata after prune failure, it replays the latest commit on the mock app before committing the state. When we don't flush the committed metadata after prune failure in commit, tendermint replays the latest commit on the actual app. That's why it works in both cases. I don't think the fact that we replay an already app |
Amazing, I agree, it should be idempotent! Thanks for the detailed check here, glad that Tendermint handles this so well! So it does indeed just look like it was that iterator bug that caused this problem. Thank you for the detailed check, and improving our understanding of how this works! 🙏 |
No description provided.
The text was updated successfully, but these errors were encountered: