Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: removed potential sources of non-determinism in upgrades #10189

Merged
merged 2 commits into from
Sep 28, 2021

Conversation

tomtau
Copy link
Contributor

@tomtau tomtau commented Sep 17, 2021

Description

Closes: #10188

  1. iterate upgrade migrations by initgenesis order
  2. forced deterministic iteration order in x/upgrade and store during upgrades

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomtau did you try and test this branch? Does it solve your undeterminism? Overall I'm okay to merge this in master.

But if there actually is undeterminism (on 0.44), we should actually think of a backport strategy, because it means it's a state breaking change.

types/module/module.go Show resolved Hide resolved
x/upgrade/keeper/keeper.go Show resolved Hide resolved
@tomtau
Copy link
Contributor Author

tomtau commented Sep 18, 2021

@tomtau did you try and test this branch? Does it solve your undeterminism? Overall I'm okay to merge this in master.

But if there actually is undeterminism (on 0.44), we should actually think of a backport strategy, because it means it's a state breaking change.

I tried and tested it -- one relevant integration test ("test_manual_upgrade_all " in test-upgrade flow which invoked 0.42.9-based binary->(latest) binary upgrade proposal) that used to be flaky (one validator would sometimes crash with an app hash mismatch immediately after the upgrade, the other would continue producing blocks) seems more stable now:

https://github.com/crypto-org-chain/chain-main/runs/3630325664?check_suite_focus=true
https://github.com/crypto-org-chain/chain-main/runs/3630344579?check_suite_focus=true

However, it could have just been "lucky" -- so the source of non-determinism is better to be verified by someone who knows the store and x/upgrade internals more inside out... (the other possibility is that it was resolved by changes in v0.44.0...release/v0.44.x )

Copy link
Collaborator

@odeke-em odeke-em left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @tomtau! Thank you for tagging me in the review @robert-zaremba.

store/rootmulti/store.go Outdated Show resolved Hide resolved
x/upgrade/keeper/keeper.go Show resolved Hide resolved
Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely okay to merge to master! (pending changelog entry #10189 (comment))

test that used to be flaky [...] seems more stable now:

But did this test fail with non-determinism at least once after the fix? If yes, it means that non-determinism comes from somewhere else, right?

I'm thinking if we should create a 0.45 (on top of 0.44) with this fix asap, or wait until we exactly pinpoint the cause of non-determinism.

@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #10189 (7417121) into master (548c986) will increase coverage by 0.00%.
The diff coverage is 68.42%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #10189   +/-   ##
=======================================
  Coverage   63.65%   63.65%           
=======================================
  Files         573      573           
  Lines       53761    53796   +35     
=======================================
+ Hits        34222    34246   +24     
- Misses      17590    17601   +11     
  Partials     1949     1949           
Impacted Files Coverage Δ
types/module/module.go 66.17% <0.00%> (-3.78%) ⬇️
store/rootmulti/store.go 71.65% <100.00%> (+0.53%) ⬆️
x/upgrade/keeper/keeper.go 81.11% <100.00%> (+0.93%) ⬆️

@tomtau
Copy link
Contributor Author

tomtau commented Sep 21, 2021

But did this test fail with non-determinism at least once after the fix? If yes, it means that non-determinism comes from somewhere else, right?
I'm thinking if we should create a 0.45 (on top of 0.44) with this fix asap, or wait until we exactly pinpoint the cause of non-determinism.

@AmauryM
It didn't fail so far in those few executions. As for the potential cause, I was looking at the upgrade migrations -- the one in "auth" is calling "bank" and "staking": https://github.com/cosmos/cosmos-sdk/blob/master/x/auth/migrations/v043/store.go

With the random migration order, my guess is that the execution options are:

  1. auth is querying pre-migrated bank and pre-migrated staking
  2. auth is querying pre-migrated bank and post-migrated staking
  3. auth is querying post-migrated bank and pre-migrated staking
  4. auth is querying post-migrated bank and post-migrated staking

But not sure if it makes a difference. I was trying to look into archived data dirs, but the challenge is that:

  • the pre-upgrade state matches in both nodes;
  • the post-upgrade state is written in the proposing node's "application.db", but naturally it's not in the other node's "application.db" that disagreed on the proposed block (i.e. crashed with the app hash mismatch error).

I will look if something can be dug out of Tendermint's internal DBs... But if someone has any suggestions for diagnosing this, feel free to share them or DM me on Discord for pointers if you'd like to look at the archived data dirs.

Copy link
Collaborator

@robert-zaremba robert-zaremba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK. Let's double check if we are fine to backport it into 0.44

types/module/module.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Show resolved Hide resolved
@robert-zaremba
Copy link
Collaborator

We were discussing with @AmauryM if this changeset is indeed breaking for v0.44. (context: #10189 (comment))

This changeset is breaking if network upgrades from 0,.42 to 0.44 without specifying a binary (eg some nodes will run with 0.44 and some with potential 0.44.1). This should only happen if we have uncoordinated, not voted upgrade.

If we assume that upgrades are done using x/upgrade, and we specify an app binary there, and the proposal will pass, then with a social consensus (and via cosmovisor) all chains should update to that binary.
However if someone will like to hack / run something different then he is breaking the social consensus.

So the this changeset is not a breaking change for 0.44. However it is potentially a breaking change for uncoordinated (or badly coordinated) updated from 0.42.

We already have problems in the ecosystem with tools and developers trying to follow the Cosmos SDK release cycle. If we want to be fully conservative then we should consider all sort of breaking changes and tag 0.45. If we only take into account the release/v0.44 then I think we are fine with tagging 0.44.1.

Thoughts?

@robert-zaremba
Copy link
Collaborator

@zmanian , @jackzampolin - do you have an infrastructure and provisioning scripts to test 0.42 -> 0.44 upgrade with and without this patch?

@tomtau - my understanding is that you still need to further validate that there is no other error which caused your original issue?

@tomtau
Copy link
Contributor Author

tomtau commented Sep 24, 2021

@robert-zaremba I did more tests: https://github.com/crypto-org-chain/chain-main/runs/3659675674?check_suite_focus=true#step:5:26 and it seems all right ("test_manual_upgrade_all " in test-upgrade flow tests 0.42.9-based binary->(latest) binary upgrade proposal). @JayT106 looked if more info can be extracted from Tendermint's internal DBs, but not much luck.

@tomtau
Copy link
Contributor Author

tomtau commented Sep 24, 2021

One small update on the root cause -- it looks like it may be due to the "auth" migrations posted above: #10189 (comment)

What I did:

  1. recompiled a 0.44-based binary with a small patch to print out store roots: tomtau@dcac289
  2. wrote a script to run the upgrade procedure in a loop (reset data dir, sync with old binary, switch to the new binary)

I managed to reproduce the crash (with the post-upgrade block app hash mismatch) as well as "correct" upgrade (with the post-upgrade block app hash matching)... After manually inspecting the root hashes from store infos, all of them matched except for "acc" store (AA1F7FD43F7E9DD77397EF0C9BBCD83B4745523848F535C0D92B9985F886D7BA vs 2A91ADF1259F0F85CABAA1C5DD63A999F27D2C21EB433CD602805AA64DB58D46)... and "acc" is the store key of the "auth" module.

@tomtau
Copy link
Contributor Author

tomtau commented Sep 24, 2021

calling iaviewer data ... "s/k:acc/" on the two stores indeed has a diff (for 01FE8479E809F8247445066A63843EFE6BCC6F4216):

Got version: 55
Printing all keys with hashed values (to detect diff)
  0127F576CAFBB263ED44BE8BD094F66114DA268777
    F9F3A8AC45312425796096574D97AD1705576198BED9370E29C555CDE8345421
  012C01E3362E56675E4E1018DC39BF3CA54A5A9B01
    DADAEF44A465435EBC850F9CCC23578BAFFDFC16B5CCA0A8F502B1790128F104
  013D94E62F0BAB86531982F02200568B78D1AB3EE5
    E1823CD32F549B8D1630B9774D62F60D581EFB9CE93778D11B5E98AB9E8A3C71
  0148B5211DF582170CB8759EB27565D7DBA40A:CA
    3482347A4663FEC56CF7F5627FF0F458F7C647831A26A3A92C4B3682BFE03E41
  014FEA76427B8345861E80A3540A8A9D936FD39391
    0E8994C8AE491F7C0295F6D3059FA7F34112E8A883FBE9B85ECD9C52DBC575A8
  015911B844D7BC224654FE0DCD16BABD2D253F2FDF
    DFFF18B03DA110CC0EDE865E8502F8240439414D27E3DB2115B77BD6B484973A
  015E89DB685F288C9801E62F8459D590EC0BE2CB87
    0B3CEB327C4895AEEC53037D74463B8C644971F137E48FC2DF98E32E652A41E7
  01766B2D6BA10CABCBC07F0E3457F3E9D59356A374
    32B3304714718058B88513028FA927914C733CAE71AE3D42A0A05CA299BE78A3
  017B5FE22B5446F7C62EA27B8BD71CEF94E03F3DF2
    1FE5E7BA13F20410B8B2253B45FB595F1AA142B3589B84663AA29AE8376E6AF6
  0193354845030274CD4BF1686ABD60AB28EC52E1A7
    A4430146B864694BE1BA5E6B7CFBFF0D5EFAFF8E843E36A3A92A2804422ACD87
  01CD:660C753563A08BEEF2A1E52F086C5462BC38
    2985E644348D8678F702ADF08CEF9A8890C8582D7B6F777C35CA970D20D5F90A
  01D635BF1C46D3C450A8A1723F8688834A5D02C906
    59487CF645D323520CB92AFCA06453139848C6D4EB739D6980508E1BE88B63C8
  01DC6F17BBEC824FFF8F86587966B2047DB6AB7367
    0FFA5F25A98D8C18A3048003AF19F103BBA3EC688D5CF51148EA62A9EDE3F595
  01F1829676DB577682E944FC3493D451B67FF3E29F
    43E58E79A96F17852EA596318BCADBCF1C2023D91C7A5D87EBE32B8B6A607DDF
  01FAAFB3CEF99DD295DEE15332242D8B7DFEEA869F
    424E3816C922D272E81138979DC9BED9E967D2B8E8C903ADD5417072227A04D7
  01FE8479E809F8247445066A63843EFE6BCC6F4216
    82CECFA239C643E58EAFB240EFB284300307EBB3D1AEC80AE9DE5877D8ED0B51
  01FE888CDCBEB3418F08D1FC71DBCC32159F15D4C1
    F1D6B2E8C1A7BD978749B6AC80DEB9F99163D31BBB05B18202C8FF6224239F58
  globalAccountNumber
    6AE8E5E931FC9014AC7947E0B7FD4C0058A458A246FCF625D367A071A8FD2349
Hash: AA1F7FD43F7E9DD77397EF0C9BBCD83B4745523848F535C0D92B9985F886D7BA
Size: 12

vs

Got version: 55
Printing all keys with hashed values (to detect diff)
  0127F576CAFBB263ED44BE8BD094F66114DA268777
    F9F3A8AC45312425796096574D97AD1705576198BED9370E29C555CDE8345421
  012C01E3362E56675E4E1018DC39BF3CA54A5A9B01
    DADAEF44A465435EBC850F9CCC23578BAFFDFC16B5CCA0A8F502B1790128F104
  013D94E62F0BAB86531982F02200568B78D1AB3EE5
    E1823CD32F549B8D1630B9774D62F60D581EFB9CE93778D11B5E98AB9E8A3C71
  0148B5211DF582170CB8759EB27565D7DBA40A:CA
    3482347A4663FEC56CF7F5627FF0F458F7C647831A26A3A92C4B3682BFE03E41
  014FEA76427B8345861E80A3540A8A9D936FD39391
    0E8994C8AE491F7C0295F6D3059FA7F34112E8A883FBE9B85ECD9C52DBC575A8
  015911B844D7BC224654FE0DCD16BABD2D253F2FDF
    DFFF18B03DA110CC0EDE865E8502F8240439414D27E3DB2115B77BD6B484973A
  015E89DB685F288C9801E62F8459D590EC0BE2CB87
    0B3CEB327C4895AEEC53037D74463B8C644971F137E48FC2DF98E32E652A41E7
  01766B2D6BA10CABCBC07F0E3457F3E9D59356A374
    32B3304714718058B88513028FA927914C733CAE71AE3D42A0A05CA299BE78A3
  017B5FE22B5446F7C62EA27B8BD71CEF94E03F3DF2
    1FE5E7BA13F20410B8B2253B45FB595F1AA142B3589B84663AA29AE8376E6AF6
  0193354845030274CD4BF1686ABD60AB28EC52E1A7
    A4430146B864694BE1BA5E6B7CFBFF0D5EFAFF8E843E36A3A92A2804422ACD87
  01CD:660C753563A08BEEF2A1E52F086C5462BC38
    2985E644348D8678F702ADF08CEF9A8890C8582D7B6F777C35CA970D20D5F90A
  01D635BF1C46D3C450A8A1723F8688834A5D02C906
    59487CF645D323520CB92AFCA06453139848C6D4EB739D6980508E1BE88B63C8
  01DC6F17BBEC824FFF8F86587966B2047DB6AB7367
    0FFA5F25A98D8C18A3048003AF19F103BBA3EC688D5CF51148EA62A9EDE3F595
  01F1829676DB577682E944FC3493D451B67FF3E29F
    43E58E79A96F17852EA596318BCADBCF1C2023D91C7A5D87EBE32B8B6A607DDF
  01FAAFB3CEF99DD295DEE15332242D8B7DFEEA869F
    424E3816C922D272E81138979DC9BED9E967D2B8E8C903ADD5417072227A04D7
  01FE8479E809F8247445066A63843EFE6BCC6F4216
    A51E31327DBEF1C9C7C0D1D882FE6B94F5C9F0164877AF59F75851590CCDC5FC
  01FE888CDCBEB3418F08D1FC71DBCC32159F15D4C1
    F1D6B2E8C1A7BD978749B6AC80DEB9F99163D31BBB05B18202C8FF6224239F58
  globalAccountNumber
    6AE8E5E931FC9014AC7947E0B7FD4C0058A458A246FCF625D367A071A8FD2349
Hash: 2A91ADF1259F0F85CABAA1C5DD63A999F27D2C21EB433CD602805AA64DB58D46
Size: 12

@tomtau
Copy link
Contributor Author

tomtau commented Sep 24, 2021

shape:

Got version: 55
        *4 0127F576CAFBB263ED44BE8BD094F66114DA268777
      -3 DC6DBC392E4D9235A3B7D4EACE8C9367C4A57E96AD3FE1F7E9C0A54596315D93
        *4 012C01E3362E56675E4E1018DC39BF3CA54A5A9B01
    -2 6F818D2A4A573DAA313C4FE5CE6C15E4D728AD735C48715FA9AB1B2F29EF5D26
        *4 013D94E62F0BAB86531982F02200568B78D1AB3EE5
      -3 B9E714E0FC6A0CE87E63A7A0716711C685DDA3CF25D1EFB3DF6229FC01F760CB
        *4 0148B5211DF582170CB8759EB27565D7DBA40A:CA
  -1 6E1B7ABC23F4804B8D14F99D1FC37F0A05B773DBB24C28B1B3B0CA309DAE6A9A
        *4 014FEA76427B8345861E80A3540A8A9D936FD39391
      -3 733DCC46C71B35AB067C52AF5140FCF9EA96C9E81B6F2059F849E02AC083748F
        *4 015911B844D7BC224654FE0DCD16BABD2D253F2FDF
    -2 BC6109F5644109CF045E04D3DFA6E9FFB3D453F8F7664BA1A9DBB0BB8E386FE4
        *4 015E89DB685F288C9801E62F8459D590EC0BE2CB87
      -3 300A30534D7E:83F7459D22A20A1BA9F290B41486CD4AF9F5E64F7AE98097B0
        *4 01766B2D6BA10CABCBC07F0E3457F3E9D59356A374
-0 AA1F7FD43F7E9DD77397EF0C9BBCD83B4745523848F535C0D92B9985F886D7BA
        *4 017B5FE22B5446F7C62EA27B8BD71CEF94E03F3DF2
      -3 82638731C7A410ECD1D1054BF52BB9D8F7FA96A1961A97AC00C8A809D3BAA138
        *4 0193354845030274CD4BF1686ABD60AB28EC52E1A7
    -2 C5AE8EABAC5E5227F6C65D338248FF48B6051C027AD81AA97998C3DA21A96591
        *4 01CD:660C753563A08BEEF2A1E52F086C5462BC38
      -3 94306D438ED49E33590663D9233EC2D850AAC5B71CF0F3196E2FF3BC027B46D5
        *4 01D635BF1C46D3C450A8A1723F8688834A5D02C906
  -1 4C40BB57525746D7D455FAD0B3B241C7D2CA665BA8D75EF6E994D2FE756327CD
        *4 01DC6F17BBEC824FFF8F86587966B2047DB6AB7367
      -3 E00CDAD0EE7F177C6BBBA1B63B7B6D273618224D3606DB522B9BE2060135E513
        *4 01F1829676DB577682E944FC3493D451B67FF3E29F
    -2 BE6B4FD18DA59B400F90E4DF8C05201B0986DE7113F69273EA72D20EBC960795
          *5 01FAAFB3CEF99DD295DEE15332242D8B7DFEEA869F
        -4 FCDED9028C21A2FB95:398CD3204701162AA631BA2EC30D15E127E5C39B763F
          *5 01FE8479E809F8247445066A63843EFE6BCC6F4216
      -3 76D027C13327F48C91630722C56036A251A367E5BEA4555F0F92101E46FEC426
          *5 01FE888CDCBEB3418F08D1FC71DBCC32159F15D4C1
        -4 2E4E5AE736232C497F38FDB22604E218A1FC0E2487BAF47125540DEAEC71E6BA
          *5 globalAccountNumber

vs

Got version: 55
        *4 0127F576CAFBB263ED44BE8BD094F66114DA268777
      -3 DC6DBC392E4D9235A3B7D4EACE8C9367C4A57E96AD3FE1F7E9C0A54596315D93
        *4 012C01E3362E56675E4E1018DC39BF3CA54A5A9B01
    -2 6F818D2A4A573DAA313C4FE5CE6C15E4D728AD735C48715FA9AB1B2F29EF5D26
        *4 013D94E62F0BAB86531982F02200568B78D1AB3EE5
      -3 B9E714E0FC6A0CE87E63A7A0716711C685DDA3CF25D1EFB3DF6229FC01F760CB
        *4 0148B5211DF582170CB8759EB27565D7DBA40A:CA
  -1 6E1B7ABC23F4804B8D14F99D1FC37F0A05B773DBB24C28B1B3B0CA309DAE6A9A
        *4 014FEA76427B8345861E80A3540A8A9D936FD39391
      -3 733DCC46C71B35AB067C52AF5140FCF9EA96C9E81B6F2059F849E02AC083748F
        *4 015911B844D7BC224654FE0DCD16BABD2D253F2FDF
    -2 BC6109F5644109CF045E04D3DFA6E9FFB3D453F8F7664BA1A9DBB0BB8E386FE4
        *4 015E89DB685F288C9801E62F8459D590EC0BE2CB87
      -3 300A30534D7E:83F7459D22A20A1BA9F290B41486CD4AF9F5E64F7AE98097B0
        *4 01766B2D6BA10CABCBC07F0E3457F3E9D59356A374
-0 2A91ADF1259F0F85CABAA1C5DD63A999F27D2C21EB433CD602805AA64DB58D46
        *4 017B5FE22B5446F7C62EA27B8BD71CEF94E03F3DF2
      -3 82638731C7A410ECD1D1054BF52BB9D8F7FA96A1961A97AC00C8A809D3BAA138
        *4 0193354845030274CD4BF1686ABD60AB28EC52E1A7
    -2 C5AE8EABAC5E5227F6C65D338248FF48B6051C027AD81AA97998C3DA21A96591
        *4 01CD:660C753563A08BEEF2A1E52F086C5462BC38
      -3 94306D438ED49E33590663D9233EC2D850AAC5B71CF0F3196E2FF3BC027B46D5
        *4 01D635BF1C46D3C450A8A1723F8688834A5D02C906
  -1 309B2EDCFDB1A9ED2CBA32B01C49C1C54C21721BF65FF36E69889FA2881E40BD
        *4 01DC6F17BBEC824FFF8F86587966B2047DB6AB7367
      -3 E00CDAD0EE7F177C6BBBA1B63B7B6D273618224D3606DB522B9BE2060135E513
        *4 01F1829676DB577682E944FC3493D451B67FF3E29F
    -2 F8A3C4BFCAF36B35B9DD887F66BF13CA70AA71FCBEDFC6846DA1980BF4BACE4D
          *5 01FAAFB3CEF99DD295DEE15332242D8B7DFEEA869F
        -4 6BFE8D15E1A284AFD9CB854AEDAB3B50488014E0E79B955F85819D80C9682840
          *5 01FE8479E809F8247445066A63843EFE6BCC6F4216
      -3 682971A580DCBC1D3DCBF3C442746C6E4CED8C3B63D449C15280E7AB9442AAB8
          *5 01FE888CDCBEB3418F08D1FC71DBCC32159F15D4C1
        -4 2E4E5AE736232C497F38FDB22604E218A1FC0E2487BAF47125540DEAEC71E6BA
          *5 globalAccountNumber

Copy link
Contributor

@amaury1093 amaury1093 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find @tomtau! So the order of RunMigrations does matter indeed. I preferred the previous ordering using OrderInitGenesis (offers more flexibility), though alphabetical is okay too.

x/upgrade/keeper/keeper.go Outdated Show resolved Hide resolved
types/module/module.go Show resolved Hide resolved
store/rootmulti/store.go Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
@tomtau
Copy link
Contributor Author

tomtau commented Sep 24, 2021

@AmauryM I tried to fetch and decode the proto payload for that 01FE8479E809F8247445066A63843EFE6BCC6F4216 key -- it's a DelayedVestingAccount and the only difference between them is that one of them has delegated_free and delegated_vesting empty, and the other one has a single entry in each.

For the migration execution order, maybe good to confirm this by people who worked on or reviewed #8865 -- both sorting and genesis init order works, so I assume "auth" should be querying pre-migrated bank and staking?

forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades
@tomtau
Copy link
Contributor Author

tomtau commented Sep 27, 2021

I preferred the previous ordering using OrderInitGenesis (offers more flexibility), though alphabetical is okay too.

@AmauryM both of them seemed to work, but I have just tried to re-test the OrderInitGenesis one in our app integration tests with extra sanity checks checks (e.g. len(m.Modules) == len(m. OrderInitGenesis)) and it failed on these extra checks -- AFAIK it's probably because of x/params that apps won't add to the OrderInitGenesis, see e.g. https://github.com/cosmos/gaia/blob/main/app/app.go#L480 .

so probably the easiest is to keep this consistent with the alphabetical.

@tomtau
Copy link
Contributor Author

tomtau commented Sep 27, 2021

@AmauryM @robert-zaremba I just tried to debug-log the order of migrations and what's being passed to TrackDelegation during the auth migration:

  • if "staking" is migrated before "auth", then balance and delegations will have values;
  • if "auth" is migrated before "staking", then balance and delegations will be empty.

If I understand #8865 correctly, either should be fine (i.e. vesting accounts would be able to delegate multiple times), as SetAccount is called anyway... but not sure of the correctness -- I guess in the latter case, the query results for the vesting account may be incorrect until someone does some operations on it (but not sure of the overall vesting accounting correctness).

@robert-zaremba
Copy link
Collaborator

I think we nailed the issue. Shall we merge this PR?

@robert-zaremba
Copy link
Collaborator

It feels to me that auth is more "fundamental" and staking should depend on auth. So migrating auth before staking makes more sense to me.
@alexanderbez , @aaronc do you have any opinion?

@alexanderbez
Copy link
Contributor

It feels to me that auth is more "fundamental" and staking should depend on auth. So migrating auth before staking makes more sense to me. @alexanderbez , @aaronc do you have any opinion?

Sure, I think this makes sense. Although does the order really matter? As long as it's deterministic.

@amaury1093
Copy link
Contributor

auth is more fundamental to me too, so should be first. But unfortunately in #8865 auth has vesting accounts which tracks delegations. So auth also depends on staking. Anyways, it's messy, and should be cleaned up by #9958

Let's put automerge then for this PR.

@amaury1093 amaury1093 added A:automerge Automatically merge PR once all prerequisites pass. backport/0.44.x labels Sep 28, 2021
@amaury1093 amaury1093 merged commit f757c90 into cosmos:master Sep 28, 2021
mergify bot pushed a commit that referenced this pull request Sep 28, 2021
forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades

Co-authored-by: Robert Zaremba <robert@zaremba.ch>
(cherry picked from commit f757c90)

# Conflicts:
#	CHANGELOG.md
@amaury1093
Copy link
Contributor

BTW, huge thanks for @tomtau for finding this, digging into tree hashes to find the root cause, and coming up with a solution! 🙏

amaury1093 pushed a commit that referenced this pull request Sep 29, 2021
#10189) (#10253)

* fix: removed potential sources of non-determinism in upgrades (#10189)

forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades

Co-authored-by: Robert Zaremba <robert@zaremba.ch>
(cherry picked from commit f757c90)

# Conflicts:
#	CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com>
Co-authored-by: Robert Zaremba <robert@zaremba.ch>
evan-forbes pushed a commit to evan-forbes/cosmos-sdk that referenced this pull request Oct 12, 2021
cosmos#10189) (cosmos#10253)

* fix: removed potential sources of non-determinism in upgrades (cosmos#10189)

forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades

Co-authored-by: Robert Zaremba <robert@zaremba.ch>
(cherry picked from commit f757c90)

# Conflicts:
#	CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com>
Co-authored-by: Robert Zaremba <robert@zaremba.ch>
evan-forbes pushed a commit to evan-forbes/cosmos-sdk that referenced this pull request Nov 1, 2021
cosmos#10189) (cosmos#10253)

* fix: removed potential sources of non-determinism in upgrades (cosmos#10189)

forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades

Co-authored-by: Robert Zaremba <robert@zaremba.ch>
(cherry picked from commit f757c90)

# Conflicts:
#	CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com>
Co-authored-by: Robert Zaremba <robert@zaremba.ch>
odeke-em added a commit to cosmos/gosec that referenced this pull request Nov 10, 2021
… value

This pass exists to curtail non-determinism in the cosmos-sdk
which stemmed from iterating over maps during upgrades and that
caused a chaotic debug for weeks. With this change, we'll now
enforce and report failed iterations, with the rule being that
a map in a range should involve ONLY one of these 2 operations:
* for k := range m { delete(m, k) } for fast map clearing
* for k := range m { keys = append(keys, k) } to retrieve keys & sort

thus we shall get this report:
```shell
[gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:19)
[gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:27)
```

from the code below:

```go
package main

func main() {
	m := map[string]int{
		"a": 0,
		"b": 1,
		"c": 2,
		"d": 3,
	}

	makeMap := func() map[string]string { return nil }

	keys := make([]string, 0, len(m))
	for k := range m {
		keys = append(keys, k)
	}

	values := make([]int, 0, len(m))
	for _, value := range m {
		values = append(values, value)
	}

	type kv struct {
		k, v interface{}
	}
	kvL := make([]*kv, 0, len(m))
	for k, v := range m {
		kvL = append(kvL, &kv{k, v})
	}

	for k := range m {
		delete(m, k)
	}

	for k := range makeMap() {
		delete(m, k)
	}

	for k := range do() {
		delete(m, k)
	}
}

func do() map[string]string { return nil }
```

Updates cosmos/cosmos-sdk#10189
Updates cosmos/cosmos-sdk#10188
Updates cosmos/cosmos-sdk#10190
odeke-em added a commit to cosmos/gosec that referenced this pull request Nov 15, 2021
… value

This pass exists to curtail non-determinism in the cosmos-sdk
which stemmed from iterating over maps during upgrades and that
caused a chaotic debug for weeks. With this change, we'll now
enforce and report failed iterations, with the rule being that
a map in a range should involve ONLY one of these 2 operations:
* for k := range m { delete(m, k) } for fast map clearing
* for k := range m { keys = append(keys, k) } to retrieve keys & sort

thus we shall get this report:
```shell
[gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:19)
[gosec] 2021/11/09 03:18:57 Rule error: *sdk.mapRanging => the value in the range statement should be nil: want: for key := range m (main.go:27)
```

from the code below:

```go
package main

func main() {
	m := map[string]int{
		"a": 0,
		"b": 1,
		"c": 2,
		"d": 3,
	}

	makeMap := func() map[string]string { return nil }

	keys := make([]string, 0, len(m))
	for k := range m {
		keys = append(keys, k)
	}

	values := make([]int, 0, len(m))
	for _, value := range m {
		values = append(values, value)
	}

	type kv struct {
		k, v interface{}
	}
	kvL := make([]*kv, 0, len(m))
	for k, v := range m {
		kvL = append(kvL, &kv{k, v})
	}

	for k := range m {
		delete(m, k)
	}

	for k := range makeMap() {
		delete(m, k)
	}

	for k := range do() {
		delete(m, k)
	}
}

func do() map[string]string { return nil }
```

Updates cosmos/cosmos-sdk#10189
Updates cosmos/cosmos-sdk#10188
Updates cosmos/cosmos-sdk#10190
JeancarloBarrios pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 28, 2024
cosmos#10189) (cosmos#10253)

* fix: removed potential sources of non-determinism in upgrades (cosmos#10189)

forced deterministic iteration order in upgrade migrations, x/upgrade and store during upgrades

Co-authored-by: Robert Zaremba <robert@zaremba.ch>
(cherry picked from commit f757c90)

# Conflicts:
#	CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Tomas Tauber <2410580+tomtau@users.noreply.github.com>
Co-authored-by: Robert Zaremba <robert@zaremba.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. C:x/upgrade T:Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix: upgrade non-determinism
5 participants