-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: removed potential sources of non-determinism in upgrades #10189
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomtau did you try and test this branch? Does it solve your undeterminism? Overall I'm okay to merge this in master.
But if there actually is undeterminism (on 0.44), we should actually think of a backport strategy, because it means it's a state breaking change.
I tried and tested it -- one relevant integration test ("test_manual_upgrade_all " in test-upgrade flow which invoked 0.42.9-based binary->(latest) binary upgrade proposal) that used to be flaky (one validator would sometimes crash with an app hash mismatch immediately after the upgrade, the other would continue producing blocks) seems more stable now: https://github.com/crypto-org-chain/chain-main/runs/3630325664?check_suite_focus=true However, it could have just been "lucky" -- so the source of non-determinism is better to be verified by someone who knows the store and x/upgrade internals more inside out... (the other possibility is that it was resolved by changes in v0.44.0...release/v0.44.x ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @tomtau! Thank you for tagging me in the review @robert-zaremba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely okay to merge to master! (pending changelog entry #10189 (comment))
test that used to be flaky [...] seems more stable now:
But did this test fail with non-determinism at least once after the fix? If yes, it means that non-determinism comes from somewhere else, right?
I'm thinking if we should create a 0.45 (on top of 0.44) with this fix asap, or wait until we exactly pinpoint the cause of non-determinism.
Codecov Report
@@ Coverage Diff @@
## master #10189 +/- ##
=======================================
Coverage 63.65% 63.65%
=======================================
Files 573 573
Lines 53761 53796 +35
=======================================
+ Hits 34222 34246 +24
- Misses 17590 17601 +11
Partials 1949 1949
|
@AmauryM With the random migration order, my guess is that the execution options are:
But not sure if it makes a difference. I was trying to look into archived data dirs, but the challenge is that:
I will look if something can be dug out of Tendermint's internal DBs... But if someone has any suggestions for diagnosing this, feel free to share them or DM me on Discord for pointers if you'd like to look at the archived data dirs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK. Let's double check if we are fine to backport it into 0.44
5d2ff99
to
e671364
Compare
We were discussing with @AmauryM if this changeset is indeed breaking for v0.44. (context: #10189 (comment)) This changeset is breaking if network upgrades from 0,.42 to 0.44 without specifying a binary (eg some nodes will run with 0.44 and some with potential 0.44.1). This should only happen if we have uncoordinated, not voted upgrade. If we assume that upgrades are done using x/upgrade, and we specify an app binary there, and the proposal will pass, then with a social consensus (and via cosmovisor) all chains should update to that binary. So the this changeset is not a breaking change for 0.44. However it is potentially a breaking change for uncoordinated (or badly coordinated) updated from 0.42. We already have problems in the ecosystem with tools and developers trying to follow the Cosmos SDK release cycle. If we want to be fully conservative then we should consider all sort of breaking changes and tag 0.45. If we only take into account the release/v0.44 then I think we are fine with tagging 0.44.1. Thoughts? |
@zmanian , @jackzampolin - do you have an infrastructure and provisioning scripts to test 0.42 -> 0.44 upgrade with and without this patch? @tomtau - my understanding is that you still need to further validate that there is no other error which caused your original issue? |
@robert-zaremba I did more tests: https://github.com/crypto-org-chain/chain-main/runs/3659675674?check_suite_focus=true#step:5:26 and it seems all right ("test_manual_upgrade_all " in test-upgrade flow tests 0.42.9-based binary->(latest) binary upgrade proposal). @JayT106 looked if more info can be extracted from Tendermint's internal DBs, but not much luck. |
One small update on the root cause -- it looks like it may be due to the "auth" migrations posted above: #10189 (comment) What I did:
I managed to reproduce the crash (with the post-upgrade block app hash mismatch) as well as "correct" upgrade (with the post-upgrade block app hash matching)... After manually inspecting the root hashes from store infos, all of them matched except for "acc" store (AA1F7FD43F7E9DD77397EF0C9BBCD83B4745523848F535C0D92B9985F886D7BA vs 2A91ADF1259F0F85CABAA1C5DD63A999F27D2C21EB433CD602805AA64DB58D46)... and "acc" is the store key of the "auth" module. |
calling
vs
|
shape:
vs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find @tomtau! So the order of RunMigrations does matter indeed. I preferred the previous ordering using OrderInitGenesis (offers more flexibility), though alphabetical is okay too.
@AmauryM I tried to fetch and decode the proto payload for that For the migration execution order, maybe good to confirm this by people who worked on or reviewed #8865 -- both sorting and genesis init order works, so I assume "auth" should be querying pre-migrated bank and staking? |
forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades
2e3d644
to
a950fbd
Compare
@AmauryM both of them seemed to work, but I have just tried to re-test the OrderInitGenesis one in our app integration tests with extra sanity checks checks (e.g. so probably the easiest is to keep this consistent with the alphabetical. |
@AmauryM @robert-zaremba I just tried to debug-log the order of migrations and what's being passed to
If I understand #8865 correctly, either should be fine (i.e. vesting accounts would be able to delegate multiple times), as |
I think we nailed the issue. Shall we merge this PR? |
It feels to me that |
Sure, I think this makes sense. Although does the order really matter? As long as it's deterministic. |
forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades Co-authored-by: Robert Zaremba <robert@zaremba.ch> (cherry picked from commit f757c90) # Conflicts: # CHANGELOG.md
BTW, huge thanks for @tomtau for finding this, digging into tree hashes to find the root cause, and coming up with a solution! 🙏 |
#10189) (#10253) * fix: removed potential sources of non-determinism in upgrades (#10189) forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades Co-authored-by: Robert Zaremba <robert@zaremba.ch> (cherry picked from commit f757c90) # Conflicts: # CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com> Co-authored-by: Robert Zaremba <robert@zaremba.ch>
cosmos#10189) (cosmos#10253) * fix: removed potential sources of non-determinism in upgrades (cosmos#10189) forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades Co-authored-by: Robert Zaremba <robert@zaremba.ch> (cherry picked from commit f757c90) # Conflicts: # CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com> Co-authored-by: Robert Zaremba <robert@zaremba.ch>
cosmos#10189) (cosmos#10253) * fix: removed potential sources of non-determinism in upgrades (cosmos#10189) forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades Co-authored-by: Robert Zaremba <robert@zaremba.ch> (cherry picked from commit f757c90) # Conflicts: # CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com> Co-authored-by: Robert Zaremba <robert@zaremba.ch>
… value This pass exists to curtail non-determinism in the cosmos-sdk which stemmed from iterating over maps during upgrades and that caused a chaotic debug for weeks. With this change, we'll now enforce and report failed iterations, with the rule being that a map in a range should involve ONLY one of these 2 operations: * for k := range m { delete(m, k) } for fast map clearing * for k := range m { keys = append(keys, k) } to retrieve keys & sort thus we shall get this report: ```shell [gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:19) [gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:27) ``` from the code below: ```go package main func main() { m := map[string]int{ "a": 0, "b": 1, "c": 2, "d": 3, } makeMap := func() map[string]string { return nil } keys := make([]string, 0, len(m)) for k := range m { keys = append(keys, k) } values := make([]int, 0, len(m)) for _, value := range m { values = append(values, value) } type kv struct { k, v interface{} } kvL := make([]*kv, 0, len(m)) for k, v := range m { kvL = append(kvL, &kv{k, v}) } for k := range m { delete(m, k) } for k := range makeMap() { delete(m, k) } for k := range do() { delete(m, k) } } func do() map[string]string { return nil } ``` Updates cosmos/cosmos-sdk#10189 Updates cosmos/cosmos-sdk#10188 Updates cosmos/cosmos-sdk#10190
… value This pass exists to curtail non-determinism in the cosmos-sdk which stemmed from iterating over maps during upgrades and that caused a chaotic debug for weeks. With this change, we'll now enforce and report failed iterations, with the rule being that a map in a range should involve ONLY one of these 2 operations: * for k := range m { delete(m, k) } for fast map clearing * for k := range m { keys = append(keys, k) } to retrieve keys & sort thus we shall get this report: ```shell [gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:19) [gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:27) ``` from the code below: ```go package main func main() { m := map[string]int{ "a": 0, "b": 1, "c": 2, "d": 3, } makeMap := func() map[string]string { return nil } keys := make([]string, 0, len(m)) for k := range m { keys = append(keys, k) } values := make([]int, 0, len(m)) for _, value := range m { values = append(values, value) } type kv struct { k, v interface{} } kvL := make([]*kv, 0, len(m)) for k, v := range m { kvL = append(kvL, &kv{k, v}) } for k := range m { delete(m, k) } for k := range makeMap() { delete(m, k) } for k := range do() { delete(m, k) } } func do() map[string]string { return nil } ``` Updates cosmos/cosmos-sdk#10189 Updates cosmos/cosmos-sdk#10188 Updates cosmos/cosmos-sdk#10190
cosmos#10189) (cosmos#10253) * fix: removed potential sources of non-determinism in upgrades (cosmos#10189) forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades Co-authored-by: Robert Zaremba <robert@zaremba.ch> (cherry picked from commit f757c90) # Conflicts: # CHANGELOG.md * Update CHANGELOG.md Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com> Co-authored-by: Robert Zaremba <robert@zaremba.ch>
Description
Closes: #10188
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!
to the type prefix if API or client breaking changeCHANGELOG.md
Reviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...
!
in the type prefix if API or client breaking change