Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM during prune causing state hash conflict on restart #1042

Closed
Tracked by #1016
p0mvn opened this issue Mar 4, 2022 · 8 comments
Closed
Tracked by #1016

OOM during prune causing state hash conflict on restart #1042

p0mvn opened this issue Mar 4, 2022 · 8 comments
Labels
needs-investigation T:task ⚙️ A task belongs to a story

Comments

@p0mvn
Copy link
Member

p0mvn commented Mar 4, 2022

No description provided.

@p0mvn
Copy link
Member Author

p0mvn commented Mar 7, 2022

Manual tests to mimic OOM during prune

  1. Forced a panic during prune before database commit in DeleteVersionsRange iavl:
ESC[90m8:18PMESC[0m ESC[32mINFESC[0m Starting MConnection service ESC[36mimpl=ESC[0mMConn{137.184.190.127:26656} ESC[36mmodule=ESC[0mp2p 
ESC[36mpeer=ESC[0m{"id":"4ab030b7fd75ed895c48bcc899b99c17a396736b","ip":"137.184.190.127","port":26656}
ESC[90m8:18PMESC[0m ESC[32mINFESC[0m Added peer ESC[36mmodule=ESC[0mp2p ESC[36mpeer=ESC[0m{"Data":{},"Logger":{}}
panic: force panic in DeleteVersionRange before commit

goroutine 105 [running]:
github.com/cosmos/iavl.(*MutableTree).DeleteVersionsRange(0x1b8c5a0, 0xc000f22880, 0xc000f228b0)
        github.com/cosmos/iavl@v0.17.3/mutable_tree.go:895 +0x59
github.com/cosmos/iavl.(*MutableTree).DeleteVersions(0xc00017a480, {0xc0016b6700, 0xc000b6d4b0, 0xe0})
        github.com/cosmos/iavl@v0.17.3/mutable_tree.go:880 +0x2ca
github.com/cosmos/cosmos-sdk/store/iavl.(*Store).DeleteVersions(...)
        github.com/cosmos/cosmos-sdk@v0.45.1/store/iavl/store.go:210
github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc00017a480)
        github.com/cosmos/cosmos-sdk@v0.45.1/store/rootmulti/store.go:437 +0x10d
github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc00017a480)
        github.com/cosmos/cosmos-sdk@v0.45.1/store/rootmulti/store.go:415 +0x20e
github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000d09040)
        github.com/cosmos/cosmos-sdk@v0.45.1/baseapp/abci.go:308 +0x231
github.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000fac6c0)
        github.com/tendermint/tendermint@v0.34.15/abci/client/local_client.go:264 +0xb6
github.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0x0)
        github.com/tendermint/tendermint@v0.34.15/proxy/app_conn.go:93 +0x22
github.com/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc000c05a80, 0x7}}, {0xc000c05a90, 0xb}, 0x310f7e, 0x3393e8, {{0xc002b09720, ...}, ...}, ...}, ...)
        github.com/tendermint/tendermint@v0.34.15/state/execution.go:228 +0x269
github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc000c05a80, 0x7}}, {0xc000c05a90, 0xb}, 0x310f7e, 0x3393e8, {{0xc002b09720, ...}, ...}, ...}, ...)
        github.com/tendermint/tendermint@v0.34.15/state/execution.go:180 +0x6ee
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc000d01dc0, 0x0)
        github.com/tendermint/tendermint@v0.34.15/blockchain/v0/reactor.go:398 +0xb7b
created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart
        github.com/tendermint/tendermint@v0.34.15/blockchain/v0/reactor.go:110 +0x85
  • Outcome after fixing: restarted and continued without issues
  1. Forced a panic during prune after database commit in DeleteVersionsRange iavl:
36A818E1B7A8CF3AEEAFEDFD7AC2AED7
ESC[90m9:56PMESC[0m ESC[32mINFESC[0m executed block ESC[36mheight=ESC[0m3381400 ESC[36mmodule=ESC[0mstate ESC[36mnum_invalid_txs=ESC[0m0 
ESC[36mnum_valid_txs=ESC[0m0
ESC[90m9:56PMESC[0m ESC[1mESC[31mERRESC[0mESC[0m CONSENSUS FAILURE!!! ESC[36merr=ESC[0m"panic after Commit() in DeleteVersionsRange" ESC[36
mmodule=ESC[0mconsensus ESC[36mstack=ESC[0m"goroutine 415 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngit.luolix.top/
tendermint/tendermint/consensus.(*State).receiveRoutine.func2()\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:726 +0x4c\n
panic({0x1b0d920, 0x2749fb0})\n\truntime/panic.go:1038 +0x215\ngit.luolix.top/cosmos/iavl.(*MutableTree).DeleteVersionsRange(0xc000d7a780, 0xc0
06abdf18, 0xc006abdf48)\n\tgit.luolix.top/cosmos/iavl@v0.17.3/mutable_tree.go:899 +0x91\ngit.luolix.top/cosmos/iavl.(*MutableTree).DeleteVersions(0
xc000c8c480, {0xc00197cb00, 0xc000c91a60, 0xb0})\n\tgit.luolix.top/cosmos/iavl@v0.17.3/mutable_tree.go:880 +0x2ca\ngit.luolix.top/cosmos/cosmos-sdk
/store/iavl.(*Store).DeleteVersions(...)\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.1/store/iavl/store.go:210\ngit.luolix.top/cosmos/cosmos-sdk/stor
e/rootmulti.(*Store).pruneStores(0xc000c8c480)\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.1/store/rootmulti/store.go:437 +0x10d\ngit.luolix.top/cosm
os/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc000c8c480)\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.1/store/rootmulti/store.go:415 +0x20e\ngi
thub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc0001e31e0)\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.1/baseapp/abci.go:308 +0x231\ngit.luolix.top/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000fd7680)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/abci/client/local_client.go:264 +0xb6\ngit.luolix.top/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0x0)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/proxy/app_conn.go:93 +0x22\ngit.luolix.top/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc000b7ce80, 0x7}}, {0xc000b7cea0, 0xb}, 0x310f7e, 0x339898, {{0xc001bb5ce0, ...}, ...}, ...}, ...)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/state/execution.go:228 +0x269\ngit.luolix.top/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc000b7ce80, 0x7}}, {0xc000b7cea0, 0xb}, 0x310f7e, 0x339898, {{0xc001bb5ce0, ...}, ...}, ...}, ...)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/state/execution.go:180 +0x6ee\ngit.luolix.top/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc000bc3180, 0x339898)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:1635 +0x9fd\ngit.luolix.top/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc000bc3180, 0x339898)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:1546 +0x305\ngit.luolix.top/tendermint/tendermint/consensus.(*State).addProposalBlockPart(0xc000bc3180, 0x39c, {0xc000bb43c0, 0xc000e91c00})\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:1919 +0x92e\ngit.luolix.top/tendermint/tendermint/consensus.(*State).handleMsg(0xc000bc3180, {{0x27741a0, 0xc002b9fe30}, {0xc000bb43c0, 0x0}})\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:820 +0x17c\ngit.luolix.top/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc000bc3180, 0x0)\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:762 +0x419\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\tgit.luolix.top/tendermint/tendermint@v0.34.15/consensus/state.go:378 +0x12f\n"
ESC[90m9:56PMESC[0m ESC[32mINFESC[0m Stopping baseWAL service ESC[36mimpl=ESC[0m{"Logger":{}} ESC[36mmodule=ESC[0mconsensus ESC[36mwal=ESC[0m/root/.osmosisd/data/cs.wal/wal
ESC[90m9:56PMESC[0m ESC[32mINFESC[0m Stopping Group service ESC[36mimpl=ESC[0m{"Dir":"/root/.osmosisd/data/cs.wal","Head":{"ID":"ghPPGW2OM86C:/root/.osmosisd/data/cs.wal/wal","Path":"/root/.osmosisd/data/cs.wal/wal"},"ID":"group:ghPPGW2OM86C:/root/.osmosisd/data/cs.wal/wal","Logger":{}} ESC[36mmodule=ESC[0mconsensus ESC[36mwal=ESC[0m/root/.osmosisd/data/cs.wal/wal
  • Outcome after fixing: restarted and continued without issues

@p0mvn
Copy link
Member Author

p0mvn commented Mar 7, 2022

This leads us to conclude that even if OOM, from which we can't recover, occurs during prune, the node should still be able to continue on restart on the latest v7.0.3 change.

@p0mvn
Copy link
Member Author

p0mvn commented Mar 7, 2022

Closing this issue for now

@p0mvn p0mvn closed this as completed Mar 7, 2022
@ValarDragon
Copy link
Member

hrmm, do we know we're not starting the defer logic in these simulated panic cases?

@p0mvn
Copy link
Member Author

p0mvn commented Mar 8, 2022

I will test with os.Exit instead of panic just to make sure. os.Exit doesn't allow the option to recover and exits immediately, similar to how OOM would work

@ValarDragon
Copy link
Member

Sweet, ty!

@ValarDragon ValarDragon reopened this Mar 8, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Mar 8, 2022

Tested several times with extra logs, every time there is no issue on restart:

This is how it logs normally:

3:31AM INF commit start, height - 3385399
3:31AM INF commit end, height - 3385399
3:31AM INF flushing metadata
3:31AM INF flushing metadata finished
...

With panic() in DeleteVersionsRange(...) (after committing to database during prune):

4:46AM INF commit start, height - 3386300
4:46AM INF commit end, height - 3386300
4:46AM INF prune start, height - 3386300
4:46AM INF flushing metadata
4:46AM INF flushing metadata finished
4:46AM ERR CONSENSUS FAILURE!!! err="panic in DeleteVersionsRange" module=consensus stack="goroutine 264 [running]:\nruntime/debug.Stack()\
  • Note that it flushes. On restart it currently works fine

With os.Exit(1) in DeleteVersionsRange(...) (after committing to database during prune):

3:31AM INF executed block height=3385400 module=state num_invalid_txs=0 num_valid_txs=0
3:31AM INF commit start, height - 3385400
3:31AM INF commit end, height - 3385400
3:31AM INF prune start, height - 3385400
3:31AM INF os.Exit(1)
  • Note that it does not flush. When restarted, it still works as expected.

I think that tendermint has a mechanism to work around the possibility of not flushing the commit version in the SDK. This mechanism is in replay.go.

We previously thought that not flushing the commit was the main issue. However, it wasn't because tendermint detects if app version does not match its store version and catches up the app if needed overwriting the already written but unflushed commit.

The main issue IMO was iterating over fast storage on replay. It was not fixed at the time we made that assumption but it is now.

To verify that my guess is valid I logged all of storeBlockHeight, stateBlockHeight and appBlockHeight in replay.go under 3 circumstances:

  1. On normal restart: storeBlockHeight: 3385963, stateBlockHeight: 3385963, appBlockHeight: 3385963
  • note that they are all equal, as a result, there is no need to catch them up to match them
  • The replay logic takes this branch and never actually replays because all heights are matching
  1. On restart with panic() and flush: storeBlockHeight: 3386300, stateBlockHeight: 3386299, appBlockHeight: 3386300
  • note that stateBlockHeight is 1 less than the other 2
  • Here, the replay logic takes this branch. It detects that the app has already written the commit so it uses a mock app to replay the commit and save state.
  1. On restart with os.Exit and no flush: storeBlockHeight: 3386100, stateBlockHeight: 3386099, appBlockHeight: 3386099
  • note that storeBlockHeight is 1 greater than the other 2
  • In this case, the replay logic takes this branch and replays the last block on the app.

In conclusion, tendermint has a mechanism to replay the commit even if we fail to flush. When we manage to flush the committed metadata after prune failure, it replays the latest commit on the mock app before committing the state. When we don't flush the committed metadata after prune failure in commit, tendermint replays the latest commit on the actual app. That's why it works in both cases.

I don't think the fact that we replay an already app Commit()ed commit is an issue. It should be idempotent and overwrite the already saved transactions in iavl with the same values during replay. Please let me know what you think @ValarDragon

@ValarDragon
Copy link
Member

Amazing, I agree, it should be idempotent! Thanks for the detailed check here, glad that Tendermint handles this so well! So it does indeed just look like it was that iterator bug that caused this problem.

Thank you for the detailed check, and improving our understanding of how this works! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation T:task ⚙️ A task belongs to a story
Projects
Archived in project
Development

No branches or pull requests

2 participants