Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uneven store heights prevent pruning and state sync #13933

Closed
faddat opened this issue Nov 20, 2022 · 44 comments
Closed

uneven store heights prevent pruning and state sync #13933

faddat opened this issue Nov 20, 2022 · 44 comments
Labels
T:Bug T: Dev UX UX for SDK developers (i.e. how to call our code) T: UX

Comments

@faddat
Copy link
Contributor

faddat commented Nov 20, 2022

Hi, this issue is to track a set of issues, and may become an epic.

Pruning and/ or state sync

  • dig
  • konstellation
    • code too old for valid comparisons
  • akash
  • stride

guess

This is because of issues in the CMD folder, and possible validator misconfiguration.

I think that this can be fixed by assisting crypto.org with migrating, then compacting their archival node state, so that they can use fast node and pebble, which makes archives much more performant.

Some of these issues were introduced here:

https://github.com/cosmos/iavl/commits/master#:~:text=feat%3A%20Add%20skipFastStorageUpgrade%20to%20MutableTree%20to%20control%20fast%20s%E2%80%A6

@faddat
Copy link
Contributor Author

faddat commented Nov 21, 2022

Related

notional-labs/dig#423

@faddat
Copy link
Contributor Author

faddat commented Nov 21, 2022

Related

notional-labs/dig#426

@chillyvee
Copy link
Contributor

Comdex v5.0.0 also has broken pruning
https://github.com/comdex-official/comdex/releases/tag/v5.0.0

@chillyvee
Copy link
Contributor

chillyvee commented Nov 21, 2022

Konstellation snashot/pruning broke because feegrant and authz (in rootmulti) are the wrong height (tree.ndb is nil)
Team notified, no response

@chillyvee
Copy link
Contributor

chillyvee commented Nov 21, 2022

Comdex snashot/pruning broke because lendV1 store (in rootmulti) is the wrong height (tree.ndb is nil)
Team notified, no repsonse

@faddat
Copy link
Contributor Author

faddat commented Nov 22, 2022

.... stride has this issue, as well, in v3.0.1

@chillyvee
Copy link
Contributor

chillyvee commented Nov 22, 2022

Stride just hit the same problem with their authz store being uninitialized and caused issues for validators.
vishal is aware, says it makes sense, and is trying to fix it

@faddat
Copy link
Contributor Author

faddat commented Nov 22, 2022

probably because of the uneven iavl state height issue that @chillyvee is mentioning here, state sync broke on stride:

git clone https://github.com/stride-labs/stride
git checkout v3.0.0
bash scripts/statesync.bash

I don't remember having issues with uneven state heights in the past when adding modules.

@Reecepbcups
Copy link
Member

@faddat Juno was having block header apphash issue as well back on Nov 7th.

Golden , Windpower, TedCrypto, Coldchain - fixed when bumped GO 1.19.2 from 1.18.3

@faddat
Copy link
Contributor Author

faddat commented Nov 22, 2022

@Reecepbcups really good.

Let's break that out separately, though. I'll maybe do issue/pr for it, doing some 'splain... I think I know why that occurs....

The mixed go runtime issue is now here:

#13976

@chillyvee 's PR related to this issue is here:

#13935

@chillyvee
Copy link
Contributor

another validator also reports statesync for konstellation still doesnt work (maybe a store with information with the wrong height?)

@faddat
Copy link
Contributor Author

faddat commented Nov 22, 2022

@chillyvee -- my feeling is that you've correctly identified the problem.

Not certain just yet.

Here's my stride working branch:

https://github.com/Stride-Labs/stride/tree/no-state-breaks

That said, I figure that the fix is somewhere between here, iavl, and ibc-go, or such.

@faddat
Copy link
Contributor Author

faddat commented Nov 23, 2022

@faddat faddat changed the title Multi Chain Issue tracker for validator reports uneven store heights prevent pruning and state sync Nov 23, 2022
@Highlander-maker
Copy link

This is a great thing to be doing. Happy to run some tests and provide feedback when we can. Looping in @qf3l3k to the chat.

@chillyvee
Copy link
Contributor

chillyvee commented Nov 24, 2022

@faddat

Likely root cause

It's likely along the lines of adding a KV Store Key without adding the keeper in the migration

https://github.com/Stride-Labs/stride/blob/v3.0.0/app/app.go#L328

	keys := sdk.NewKVStoreKeys(
		authzkeeper.StoreKey,

That will probably continue on to confuse all the other stores/exports/snapshots/etc

To operate correctly a chain MUST do all the following at the same time:

  • KVStoreKey
  • define keeper
  • app.SetStoreLoader(upgradetypes.UpgradeStoreLoader(upgradeInfo.Height, storeUpgrades))

To resolve current breakages:
Upgrade and formally add keeper + add stores in upgrade handler

To completely ruin your node:
Removing keys from keys := sdk.NewKVStoreKeys will cause an instant AppHash mismatch. Don't do it.

Current breakages

For chains like Konstellation/Stride where authz/feegrant/etc keepers are not initialized, statesync restore apphashes will not match. They should probably upgrade and officially issue a store upgrade

https://github.com/Stride-Labs/stride/pull/384/files#diff-aeec7ddb4cf8b7e9e6e57625770d369cce5f377cc350b42c3aca0427bba8dfbcR52-R55

Surprisingly not broken

Dig statesyncs with patches probably because the uninitialized icahost store at least has a keeper defined.
notional-labs/dig#426

Decline to fix

Probably possible to emulate a broken KV store during snapshot, but that seems like a waste of time if a proper upgrade can be issued instead. (but we tried to fix it anyway, see next message)

A better fix

Check that each key actually has a keeper before entering consensus. This will allow nodes to start, upgrade/migrate, then check kv-store matches before starting up. This is an exercise left to the reader :)

The failure mode in this case is that a chain node could halt, giving time for teams to release a new version with a migration to formally add the missing store.

  1. Start node. Add NewKVStoreKeys() without a keeper
  2. Execute new sanity check. Gracefully exit on mismatch of NewKVStoreKeys vs keepers
  3. Chain team can issue patch
  4. Start node again, execute additional migration to add store
  5. Enter consensus

@faddat
Copy link
Contributor Author

faddat commented Nov 25, 2022

I like your proposal CV.

@chillyvee
Copy link
Contributor

chillyvee commented Nov 25, 2022

This is a potential fix for snapshotting until chains execute the final keeper + store migrations

Root cause

Snapshots are missing IAVL nodes when heights do not match.

Fix

Snapshots can now be created even with mismatched rootmulti/store heights

Snapshot Patch

After checkout of chain code

 go mod edit -replace github.com/cosmos/iavl=github.com/chillyvee/iavl@v0.19.4-blunt.3
 go mod edit -replace github.com/cosmos/cosmos-sdk=github.com/chillyvee/cosmos-sdk@cv0.45.9snap.2
 go mod tidy

Tested it against Konstellation v0.6.0 and it works.

Isolate from peers without snapshot patched

You will need to isolate your peers to avoid getting broken snapshots from others.

This means setting the following on the restoring node (not the source node)

[p2p]
persistent_peers = "only your patched snapshot source node"

pex = false

Tested

Konstellation v0.6.0 snapshot and restore now operates properly

Need assistance

Since stride 3.0.0/comdex 5.0.0 are also on cosmos-sdk v0.45.9, I think the patch will work there as well.

For now the code seems functional, but isn't clean enough to upstream.

Would appreciate some help testing before I submit the PRs.

@chillyvee
Copy link
Contributor

For Konstellation

 go mod edit -replace github.com/cosmos/iavl=github.com/chillyvee/iavl@v0.19.4-blunt.3
 go mod edit -replace github.com/cosmos/cosmos-sdk=github.com/chillyvee/cosmos-sdk@cv0.45.9snap.2
 go mod tidy

For Comdex

git checkout https://github.com/chillyvee/comdex
cd comdex
git checkout cv5.0.0.fixsnap
go mod edit -replace github.com/cosmos/iavl=github.com/chillyvee/iavl@v0.19.4-blunt.3
go mod edit -replace github.com/cosmos/cosmos-sdk=github.com/chillyvee/cosmos-sdk@cv0.45.9snap.2
go mod tidy

For Stride

go mod edit -replace github.com/cosmos/iavl=github.com/chillyvee/iavl@v0.19.4-blunt.3
go mod edit -replace github.com/cosmos/cosmos-sdk=github.com/chillyvee/cosmos-sdk@cv0.45.9snap.2.stride
go mod tidy

@tac0turtle
Copy link
Member

super interesting breakdown!! thank you for this. It seems there is a massive ux issue that leads to broken issues because they are not well documented as well.

I like your approach, I believe this is the right move, we can add some of the footman preventers in previous releases as well. Lmk how I can help you

@tac0turtle tac0turtle added T:Bug T: UX T: Dev UX UX for SDK developers (i.e. how to call our code) labels Nov 27, 2022
@chillyvee
Copy link
Contributor

super interesting breakdown!! thank you for this. It seems there is a massive ux issue that leads to broken issues because they are not well documented as well.

I like your approach, I believe this is the right move, we can add some of the footman preventers in previous releases as well. Lmk how I can help you

Need another day or two to clean up some code and send in another few helpful PRs. Once you see those, we might find a pattern we can enhance :) Thank you tac0turtle!

@chillyvee
Copy link
Contributor

Stride v4 patch

git clone https://github.com/Stride-Labs/stride
cd stride
git checkout v4.0.2
go mod edit -replace github.com/cosmos/iavl=github.com/chillyvee/iavl@v0.19.4-blunt.3
go mod edit -replace github.com/cosmos/cosmos-sdk=github.com/chillyvee/cosmos-sdk@cv0.45.11snap.5       
go mod tidy
make install

@chillyvee
Copy link
Contributor

Lum network broke on v1.3.0 due to 'feeibc' store height mismatch.

@dsmello
Copy link

dsmello commented Jan 16, 2023

Hello everyone, I'm having a similar issue at the starname chain (cosmos-SDK v0.45.9), but in my case, I got an empty app hash.
Do you think it's related? I'll add more detail when that starts in my case. And I can also try to implement the "sanity check" to the app object if anyone could describe what's needed.

The error at state sync starts at starname v0.10.14, where was:

  • cosmos SDK v0.42.5 → v0.42.9
  • wasmvm → v0.16.0
  • tendermint v0.34.10 → v0.34.11 → v0.34.12

And now we're using the cosmos v0.45.9

@alexanderbez
Copy link
Contributor

@chillyvee for whatever reason I cannot find the comment and/or PR, but I swear I commented somewhere that the root fix to many of these issues wrt to pruning and upgrading with new stores is that when a store is loaded for the first time, i.e. it does not currently exist, we set it's initial version to the current version and not 0. Does this ring a bell?

@chillyvee
Copy link
Contributor

I got an empty app hash.

@dsmello - Haven't seen an empty app hash before. Can you share some logs? Was there a recent upgrade and if so what was the tag of the version before, and the version of the upgrade? Is this the repo? https://github.com/iov-one/starnamed

@chillyvee
Copy link
Contributor

the root fix
@alexanderbez

Prevention:

  1. Don't make the mistake in the first place and add all new stores via migration to make the new store height match
  • Future: Prevent adding stores outside of a migration/enforce height matching

Fixing broken appstate:

  1. Not possible to adjust heights because iavl enforces increment by 1 and prevents fixing

  2. if the broken store is empty, you can delete the broken store and make the new one the right way

  • If you are using a cosmos-sdk module you will need to split this into 2 upgrades. 1 for the store delete, and 1 for the store add
  • If you are using an external module, then you can delete the old store and migrate in the new store
  1. If the broken store is not empty, then there must also be data migration
  • In the case of a cosmos-sdk module - not sure where to store the data between the first and second fix migration
  • In the case of a custom module - maybe it's possible to copy data between the old and new store?

@dsmello
Copy link

dsmello commented Jan 17, 2023

Hi @chillyvee,
Yes, that's our repo.

The current version of starnamed is v0.11.6 (wasmd -> ; cosmos-SDK -> v0.45.9 ; tendermint -> v0.34.21)
We also had a State Sync issue at version v0.10.14, but at that version, the issue was a different appHash instead of an empty appHash.

We migrate from v0.10.13 -> v0.11.6 (Cosmos-SDK v0.42.5 -> v0.45.9), adding these modules (authzKeeper, feegrant, escrowtypes, and the burner module), the only one with doesn't have a store was the burner module.

This is the error :

ERR appHash verification failed actual= expected="�=[u}A��|�N�9\rd7\x7fl�\x1c�i��,��\x00]Kkj" module=statesync

I tried to do some debugging about and:

  • The error was rise when the vendor/github.com/tendermint/tendermint/statesync/syncer.go(verifyApp:478) runs:
    resp, err := s.connQuery.InfoSync(proxy.RequestInfo) and the resp.LastBlockAppHash was empty.
RESPONSE: %+vdata:"starnamed" version:"v0.11" last_block_height:12634200  module=statesync
SNAPSHOT: %+v&{12645200 1 1 [185 116 185 254 4 253 237 156 245 219 141 194 180 138 121 209 0 236 44 176 80 47 252 87 27 253 64 19 134 157 60 243] [10 32 185 116 185 254 4 253 237 156 245 219 141 194 180 138 121 209 0 236 44 176 80 47 252 87 27 253 64 19 134 157 60 243] [133 61 91 117 125 65 241 186 124 193 78 140 57 13 100 55 127 108 129 28 215 105 235 173 44 163 155 0 93 75 107 106]}

And what logs did you need?

[p2p]
persistent_peers = "0a539ece9f24b682a245327966b62cee9f11161e@35.210.33.5:26656"

[statesync]
enable = true
rpc_servers = "http://35.210.33.5:26657,http://35.210.33.5:26657"
trust_height = 12562301
trust_hash = "6E4FBCF45010667D230382A150461E0DD4467A7E2BEA8840607135D766657E2C"
trust_period = "2000h0m0s"

@alexanderbez
Copy link
Contributor

@chillyvee Ok so we have #14410 and the author doesn't seem to be too responsive. @catShaark if you're OK with it, I will open another PR.

I would like to see a single PR address these points.

@chillyvee
Copy link
Contributor

I would like to see a single PR address these points.

@alexanderbez - Correct the idea of #14410 (or any replacement) will prevent the root cause of the issue we commonly see.

@chillyvee
Copy link
Contributor

We also had a State Sync issue at version v0.10.14, but at that version, the issue was a different appHash instead of an empty appHash.

Overall an interesting situation. No additional logs needed but we need to pick a direction.

Didn't see your ID on the committers list for starnamed, so I assume you are not a developer for that repo and a validator only? For a full fix, it would be good to bring in one of the core developers.

How about taking a look at a copy of the data from a working node. Is anybody sharing pruned zip files of the data directory?

@dsmello
Copy link

dsmello commented Jan 17, 2023

I never added myself to the contributor's list and need to fix the git. I'll do that this week.

The current version is v0.11.6

Yes, you can use the following:

*I still needed to gain experience in debugging the cosmos. What I'm supposed to look at in the data folder?

@chillyvee
Copy link
Contributor

chillyvee commented Jan 19, 2023

@dsmello - The issue starname is facing is NOT the one in this thread.

It looks like your stores are not in the array (rs.stores is EMPTY) for snapshotting at this point in the code:
https://github.com/cosmos/cosmos-sdk/blob/v0.45.9/store/rootmulti/store.go#L694

It seems somehow starname is able to define all the stores and operate without setting all the related fields.

That means that any restore with statesync results in no data and no apphash.

@dsmello
Copy link

dsmello commented Jan 20, 2023

Hi @chillyvee, Thanks for the help.
I'll try to debug deeper at this point to see if I find something, and if the bug was related to the cosmos, I'll open a PR for that issue 😄. Monday, I open an issue with that bug to leave some kind of history for that kind of problem to others. Tks guys.

@tac0turtle
Copy link
Member

Part of the issue is solved, notional submitted a fix to prevent chains from starting when adding a store at a different height. How can we help teams recover from this state they are in? Should we write a tool? I think people need to upgrade their chains to recover right?

@faddat
Copy link
Contributor Author

faddat commented Feb 6, 2023

I think that there is a related issue:

  "message": "\ngit.luolix.top/cosmos/cosmos-sdk/baseapp.(*BaseApp).createQueryContext\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/baseapp/abci.go:670\ngit.luolix.top/cosmos/cosmos-sdk/baseapp.(*BaseApp).handleQueryGRPC\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/baseapp/abci.go:592\ngit.luolix.top/cosmos/cosmos-sdk/baseapp.(*BaseApp).Query\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/baseapp/abci.go:441\ngit.luolix.top/tendermint/tendermint/abci/client.(*localClient).QuerySync\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/abci/client/local_client.go:256\ngit.luolix.top/tendermint/tendermint/proxy.(*appConnQuery).QuerySync\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/proxy/app_conn.go:159\ngit.luolix.top/tendermint/tendermint/rpc/core.ABCIQuery\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/rpc/core/abci.go:20\ngit.luolix.top/tendermint/tendermint/rpc/client/local.(*Local).ABCIQueryWithOptions\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/rpc/client/local/local.go:87\ngit.luolix.top/cosmos/cosmos-sdk/client.Context.queryABCI\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/client/query.go:94\ngit.luolix.top/cosmos/cosmos-sdk/client.Context.QueryABCI\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/client/query.go:57\ngit.luolix.top/cosmos/cosmos-sdk/client.Context.Invoke\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/client/grpc_query.go:81\ngit.luolix.top/cosmos/cosmos-sdk/x/bank/types.(*queryClient).AllBalances\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/x/bank/types/query.pb.go:917\ngit.luolix.top/cosmos/cosmos-sdk/x/bank/types.request_Query_AllBalances_0\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/x/bank/types/query.pb.gw.go:139\ngit.luolix.top/cosmos/cosmos-sdk/x/bank/types.RegisterQueryHandlerClient.func2\n\tgit.luolix.top/cosmos/cosmos-sdk@v0.45.13-rc.1.0.20230205152327-472729e40bfc/x/bank/types/query.pb.gw.go:684\ngit.luolix.top/grpc-ecosystem/grpc-gateway/runtime.(*ServeMux).ServeHTTP\n\tgit.luolix.top/grpc-ecosystem/grpc-gateway@v1.16.0/runtime/mux.go:240\ngit.luolix.top/gorilla/mux.(*Router).ServeHTTP\n\tgit.luolix.top/gorilla/mux@v1.8.0/mux.go:210\ngit.luolix.top/tendermint/tendermint/rpc/jsonrpc/server.maxBytesHandler.ServeHTTP\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/rpc/jsonrpc/server/http_server.go:256\ngit.luolix.top/tendermint/tendermint/rpc/jsonrpc/server.RecoverAndLogHandler.func1\n\tgit.luolix.top/tendermint/tendermint@v0.34.26/rpc/jsonrpc/server/http_server.go:229\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991\nfailed to load state at height 69; version mismatch on immutable IAVL tree; version does not exist. Version has either been pruned, or is for a future block height (latest height: 69): invalid request",

Related issue:

@alexanderbez
Copy link
Contributor

Part of the issue is solved, notional submitted a fix to prevent chains from starting when adding a store at a different height. How can we help teams recover from this state they are in? Should we write a tool? I think people need to upgrade their chains to recover right?

I think they would need a separate upgrade and corresponding handler to fix the broken state. Either, perhaps, by renaming it or by deleting it and creating it again?

@chillyvee
Copy link
Contributor

This just hit kujira testnet v0.8.0 in the alliance store.

@tac0turtle
Copy link
Member

da, i dont think we can backport the fix due to some chains having this issue. they will get hit with errors. we should write documentation around this

@alexanderbez
Copy link
Contributor

Mhhh wait, why is this issue still open? Doesn't #14410 fix it? Also, why can't we backport?

@tac0turtle
Copy link
Member

i believe a chain with uneven store heights wouldnt be able to start. Is that right?

@alexanderbez
Copy link
Contributor

Ahh yes, the fix is NOT retroactive indeed. So existing chains that already have this issue will need to manually patch it in an upgrade.

@chillyvee
Copy link
Contributor

i believe a chain with uneven store heights wouldnt be able to start. Is that right?

Chain runs, but pruning, snapshots and statesync are all broken without the patches we issued.

@tac0turtle
Copy link
Member

ah okay, then we can backport, to help prevent future issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:Bug T: Dev UX UX for SDK developers (i.e. how to call our code) T: UX
Projects
None yet
Development

No branches or pull requests

7 participants