-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upgrade module #4233
Add upgrade module #4233
Conversation
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Codecov Report
@@ Coverage Diff @@
## master #4233 +/- ##
==========================================
- Coverage 54.61% 53.61% -1.01%
==========================================
Files 299 298 -1
Lines 18177 17897 -280
==========================================
- Hits 9928 9596 -332
- Misses 7464 7550 +86
+ Partials 785 751 -34 |
@rigelrozanski I was just thinking I'd mention after our chat last week that the approach here is really agnostic to whether it's governance or validator signaling that makes the upgrade happen. I think that really comes down to a larger governance discussion. All this module does is coordinate chain halts and restarts based on whoever calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea and good start on providing migration/upgrade tooling.
I would love more comments and thoughts from the cosmos-sdk team, as kill the blockchain, dump start, and start a new chain doesn't seem like a viable long-term solution.
I wonder what happens to all the exchanges next time this happens....
The app must then integrate the upgrade keeper with its governance module as appropriate. The governance module | ||
should call ScheduleUpgrade to schedule an upgrade and ClearUpgradePlan to cancel a pending upgrade. | ||
|
||
Performing Upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea. And wonderful documentation. Both these package-level comments, as well as all the comments on types.
For halting, this works well. However, I think we need a more complete strategy for upgrades. I will expand on this o the issue. But yeah, this seems a nice first step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file need to get updated with the changes that have been made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to eventually transition or duplicate part of this godoc to a new docs/
guideline for how to upgrade live chain. cc: @gamarin2 @hschoenburg
x/upgrade/keeper.go
Outdated
|
||
upgradeTime := keeper.plan.Time | ||
upgradeHeight := keeper.plan.Height | ||
if (!upgradeTime.IsZero() && !blockTime.Before(upgradeTime)) || upgradeHeight <= blockHeight { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice logic. I think the feature switch needs to be more integrated in the rest of the framework.
I think this is a great beginning to the upgrade plans for cosmos-sdk. What this does provide is a way to halt the chain (in the future at a predefined time/height schedule - smart). And a way to restart the chain when the new software is deployed. However, if I understand properly, it still requires - chain halts, validators deploy new software, chain restarts (as soon as >2/3 are live on the new code). Also, if I want to replay the chain, in order to reproduce it, I will have to run v1 until it halts, then replace with v2, then start up again. TL;DR: looks great. key addition is some switch so when syncing a chain later v2 of the software can dete I think a nice addition here would be something like "feature switches". (which may be possible with the current infrastructure, using the Better than storing the config files, one could store this information on-chain. This means v2 will query for "cip 21 enabled" for example and decide whether to run in v1 or v2 mode. Then, when replaying a chain, v2 will run in v1 mode until the halt point. Since the handler is already registered, it will immediately execute Then we also need a way to see if users are sending v1 or v2 messages, whether models are v1 or v2. We recently added migrations support in weave and one side-effect was adding a Metadata/Schema attribute to top-level Message and Model structs (which are the ones that we first get on de-serialization). We then check this against which are enabled, possibly do on-the-fly upgrades, or simply reject false versions. This code has not been tried out on a live chain yet, and likely has many points for improvement. Glad to get feedback and cross-pollination on such techniques. |
brain dump from concepts discussed during an sdk core-dev call on chain upgrades:
|
I have been working on having a smooth solution to this using NixOS which may be a solution for nodes that are willing to run that. I could see something similar being done in a docker-based deployment scenario or even with some sort of binary manager that downloaded the correct binaries at upgrade points. Validators, however, might have more complicated setups that wouldn't be covered by these approaches. What I what suggest as a general principle is at a minimum let's create an easy replay recipe for casual users that want to run a full node. The replay path for validators may or may not be as simple.
I think the biggest challenge with doing a feature switch type approach is that it places quite a bit of burden on engineers to correctly code the feature switches. I actually think it would be good if projects were coded with that sort of discipline, but it might not be too realistic near term. Using some sort of binary switching (like I'm doing with NixOS) would make things a bit easier so that the exact same binary would be replayed at each phase of the upgrade. |
Why does panic'ing in an ABCI handler like |
You're correct I believe it should. What are you thinking? I'll also mention a point I forgot earlier: We need the capability for the validator set to hard fork the hub without governance approval under a broken-governance or last-resort scenario. Kinda like the ability to have manual controls for the blockchain need be |
I agree and the overhead of testing multiple code paths and transitions is rather high. I love the idea of triggering a os-level (NixOS / docker / etc) switch of the binary at some point. Like we register binaries with tags Great idea |
Well just that panic'ing in the ABCI app may be all that's needed to stop both the app and Tendermint. There might not be any special changes required at the Tendermint level
Does Tendermint have some sort of "backdoor" that allows one to set the expected validator set outside of the ABCI process? Would that be maybe the main functionality needed to support a hard-fork? Another similar scenario that's occurred to me is what if some indeterminism causes a consensus failure when there is no bad behavior, just bad code. I think a similar hard-fork like fix would be needed, but in this case you might need to delete the last block because it causes consensus failure on the ABCI side. I think for this just the ability to import blocks only up to a certain height would support this. Although, maybe it's not needed because the failure is only on the ABCI app side and the consensus failure can probably be fixed with an app upgrade without having to delete the failing Tendermint block. |
I second @aaronc here Seems like it is doable. And if this upgrade path only covers 95% of the case. Not DAO hack revert state and fend off Etc fork craziness. For such extreme cases, some custom upgrade coordination would be needed. But that should be the exception not the rule. How are you going to adjust the inflation rate calculation logic gracefully? State dump and chain restart? I think this proposal would work there and provide a much nicer experience |
Yes, this PR is for the happy path. I agree there should be some mechanism to support the unhappy path but let's make that a separate issue.
@ethanfrey I'm not really familiar with how this happens. My hope has been to avoid state dumps because that approach causes transaction history to be lost (not viable for our use case). |
Sorry for my unclear comment. I was asking how any changes to the cosmos hub can be made in the current state? I think this proposal would allow a way of gracefully upgrading binaries at the proper locations and thus not requiring a state dump/chain restart, as was done on the last hub upgrade. So far that seems the only existing path, and I encourage the core dev team to support useful tools that cover 90+% of upgrades, rather than challenge due to some possible edge cases they would not work (and which would have to revert to current extreme upgrade path). Basically, I am asking.... @rigelrozanski why is this proposal frozen for many weeks without any real feedback? (except a braindump saying you are generally cool with this line of thought) If there is a serious design or code error here, please point it out. If not, it would be great to have a path forward on this. (I have also been the victim of my PRs hanging months with little to no feedback, and I think this doesn't encourage open source contributions outside of the core team. If you (cosmos/icf/all in bits) wants open source contributions from the community, it would be good to give a bit more feedback, direction, and support to such initiatives as this. I think some healthy feedback here could help evolve a very nice solution with input from all parties). |
After discussing a bit with @zmanian , one thing that is clear to me now that wasn't before is that it will be a while before the Tendermint block structure is stable. So while it may be possible to restructure state smoothly without creating a new chain, rewriting Tendermint blocks is impossible because signatures will be invalid. So, this upgrade approach could still be useful for cases where an upgrade is doesn't involve any breaking changes on the Tendermint side. We are planning to test this with a public https://github.com/regen-network/regen-ledger testnet, hopefully as soon as next week. An alternate idea proposed by @AFDudley for maintaining the continuity of transactions even when a new chain needs to start from height 0 is including some reference to the block hash and chain-id of the previous chain in the genesis file of the new chain. Then some sort of transaction indexer could build up a continuous transaction history. But again, it doesn't sound like this issue negates the usefulness of this "happy path" upgrade support in cases where it will work. |
9a2b150
to
0889aa6
Compare
We discussed the plans for doing a test of this upgrade module with Regen Network's testnet in our community meeting today. The planned timing is as follows:
We also discussed governance deciding on a predetermined time vs upgrade signalling as proposed here: #1079 (comment). It was brought up that the downside of the signalling approach is that it sort of forces validators to race to get the upgrade and could produce anxiety because you can't predict how quickly others will upgrade and if you are in the last third you could get slashed for being slow. So a pre-determined time or block-height decided in the governance process seemed preferable to those present because it gives a sense of predictability and allows for planning. In Berlin, I chatted briefly with @ebuchman about Tendermint stability. While there are some important changes coming, it seems like there is the possibility and willingness to do this in such a way that "happy path" upgrades could still be possible. We could make that process easier to manage by using Prototool breaking change checker on the Tendermint block .proto definitions. |
4b99328
to
aba301f
Compare
4d95253
to
9009405
Compare
Please note that this PR will soon depend on #4724 in order to perform store migrations that can't be done within the ABCI methods (because the store won't even load without these migrations). #4724 handles cases when Also as a follow-up to our discussion last week @sunnya97, I want to point out that having a managing process that downloads new binaries would work well on top of this upgrade module approach. At a very basic level the managing process could watch stdout of the |
x/upgrade/internal/types/keys.go
Outdated
|
||
// QuerierKey is used to handle abci_query requests | ||
QuerierKey = ModuleName | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is not gofmt
-ed with -s
(from gofmt
)
x/upgrade/internal/keeper/keeper.go
Outdated
upgradeHandlers map[string]types.UpgradeHandler | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is not gofmt
-ed with -s
(from gofmt
)
@bez all issues should be addressed now. along with a few more detected while integrating with gaia. Also, please check out cosmos/gaia#184 and run through the demo upgrade procedure (you will want a machine where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
This is a reopening of PR #3979 which was closed when the develop branch was removed. See that PR for previous discussion.
Upgrading live chains has been previously discussed in #1079 and there is a WIP spec in #2116. Neither of these provide an actual implementation of how to coordinate a live chain upgrade on the software level. My understanding and experience with Tendermint chains is that without a software coordination mechanism, validators can easily get into inconsistent state because they all need to be stopped at precisely the same point in the state machine cycle.
This PR provides a module for performing live chain upgrades that has been developed for Regen Ledger and tested against our testnets. It may or may not be what Cosmos SDK wants, but just sharing it in case it is...
This module attempts to take a minimalist approach to coordinating a live chain upgrade and can be integrated with any governance mechanism. Here are a few of its features:
BeginBlock
when an upgrade is required and doesn't allow it to restart until new software with the expected upgrade is startedThis PR doesn't currently include any integration with the Cosmos gov module, but that could be easily done if this upgrade method works for Cosmos hub.
docs/
) - includes through go package docssdkch add [section] [stanza] [message]
Files changed
in the github PR explorerFor Admin Use: