core/state, trie: fix trie flush order for proper pruning #25581
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The trie commit rework PR made an unfortunate assumption (https://github.com/ethereum/go-ethereum/pull/25320/files#diff-8348a172e9fd3d3eb93c445d4ca58b8753b0f6d626c7af7db3b30b820d0788daR772) that the flush order doens't matter across storage tries and the account trie.
Unfortunately, this is false, because even though we do reference counting internally, the trie dirty cache drips trie nodes to disk when it's full, and the drip order is the insertion order. If the insertion order doesn't adhere to a strict child -> parent relationship, then we can end up with dangling storage trie nodes in the dirty cache, and missing subtries on disk.
As long as Geth keeps running, it is fine because the dirty cache won't just flat out drop the data, and will eventually leak it to disk. But on shutdown it will happen that the unreferenced dangling storage nodes will be dropped. Upon restart, there might be nodes missing from disk.
E.g.
This PR fixes the insertion order so all storage tries are flushed first and only then the account trie. It also adds a test which creates a bunch of contracts with random slots, flushes them partially to disk and terminates. If the account true is not added to the dirty cache last, the partial flush will result in data loss.