Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: add a signalling mechanism for coordinated upgrades #2832

Merged
merged 39 commits into from
Nov 30, 2023

Conversation

cmwaters
Copy link
Contributor

@cmwaters cmwaters commented Nov 9, 2023

Overview

As per ADR-018, this PR extends the existing minimal upgrade module with a signalling mechanism.

Validators are expected to submit an on-chain message to signal that they wish to change version of the network. When a quorum has signalled the next version the upgrade module signals to the app that it is ready to switch to the next state machine. If the app version is not supported the node will panic. Note that this feature does not currently support downgrading. The only permissible app version change is the very next increment. To cancel the upgrade, the same validators need only to submit an on-chain message with the current version they are on.

There are some remaining design decisions that I would appreciate feedback on:

  1. The current default signalling quorum is 5/6. That is halfway between 2/3 and 3/3. The reason being is that during an upgrade process, it's impossible to maintain 1/3 fault tolerance for both omission and commission faults. The best we can achieve is 1/6. Imagine it being set to 2/3. Say 2/3 - 1 had v3 and wanted to move to v3 and the other 1/3 only had v2 and did not want to upgrade. A malicious node with just 1 voting power could halt the network by signalling for v3 and then refusing to vote afterwards. This context is all to ask the question whether we want this parameter to be modified by governance or hard-coded to the version. I initially have set it as a parameter but I think it may be better to have it hard-coded to the version (since we should practically never change this parameter)
  2. The way for tallying the votes is very inefficient. Every block, at the end, we loop through all votes, query the voting power of each node and tally accordingly. This is fine as a safe naive path but we may want to reduce the unnecessary load on the network. There are a few options:
    a. Use diffs. Whenever there is a voting power change to a validator we add a hook that fires with the old voting power and the new voting power and the upgrade module listens to this hook and updates the total accordingly. This would require modifying the staking module and may cause problems down the track with needing to maintain the staking fork. It's also more prone to errors.
    b. Use epochs. Tally all the signal votes once every 1000 blocks or once a day. There is no need to tally continuously and we spare computation by reducing the frequency that we need to update. Most major upgrades are relatively time agnostic and can be done at any point. Emergency major upgrades operate differently and would involve the network halting temporarily.

Some remaining work that will be done in follow up PRs:

  • add integration tests for the signalling upgrade process
  • add CLI commands and update readme
  • add support for automation (use separate upgrade keys with authz)

Checklist

  • New and updated code has appropriate documentation
  • New and updated code has new and/or updated testing
  • Required CI checks are passing
  • Visual proof for any user facing features like CLI or documentation updates
  • Linked issues closed with keywords

@evan-forbes evan-forbes added this to the v2 milestone Nov 13, 2023
Base automatically changed from cal/major-upgrades to main November 20, 2023 14:40
Copy link
Contributor

coderabbitai bot commented Nov 20, 2023

Walkthrough

Walkthrough

The changes involve a significant update to the upgrade module of a blockchain application, introducing new package imports, altering existing functions, and adding new ones. The upgrade logic has been enhanced to handle version upgrades more effectively, with new parameters, tallying logic, and end-block processing. The changes also include the introduction of protocol buffer files for defining gRPC services and messages related to upgrade parameters and version signaling.

Changes

File(s) Change Summary
app/app.go Import of upgradetypes and v1 packages, modification of ModuleEncodingRegisters, replacement of upgrade.StoreKey with upgradetypes.StoreKey, addition of upgrade.NewAppModule, and logic changes in EndBlocker function.
app/version.go, x/upgrade/types/params.go Addition of imports, new global variables for module version maps, addition of DefaultInitialVersion, and inclusion of upgrade module in version maps with consistency checks.
proto/celestia/upgrade/v1/... New protocol buffer files for celestia.upgrade.v1 package, defining gRPC services, request and response messages, and annotations for HTTP routing.
x/upgrade/ibc.go, x/upgrade/interfaces.go, x/upgrade/keeper.go, x/upgrade/module.go New functions and interfaces for IBC compatibility, StakingKeeper interface, updates to Keeper struct, and significant changes to upgrade package logic and control flow.
x/upgrade/tally_test.go Additions to test file for upgrade package, including new test functions, mockStakingKeeper struct, and initialization logic for testing.
x/upgrade/types/codec.go, x/upgrade/types/errors.go, x/upgrade/types/msgs.go Functions for codec registration, new error variable ErrInvalidVersion, and custom message type MsgSignalVersion for signaling upgrades.

Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

@cmwaters cmwaters requested a review from rootulp as a code owner November 28, 2023 13:39
x/upgrade/interfaces.go Show resolved Hide resolved
proto/celestia/upgrade/v1/tx.proto Outdated Show resolved Hide resolved
x/upgrade/types/msgs.go Show resolved Hide resolved
app/version.go Outdated Show resolved Hide resolved
x/upgrade/types/params.go Outdated Show resolved Hide resolved
x/upgrade/types/params.go Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@celestia-bot celestia-bot requested a review from a team November 28, 2023 14:02
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
x/upgrade/types/params.go Outdated Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
x/upgrade/types/params.go Outdated Show resolved Hide resolved
@rootulp
Copy link
Collaborator

rootulp commented Nov 28, 2023

Hmm proto-gen is still failing so I can push a commit that tries to resolve.

Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall really great work! I left a few questions to understand more. I think my only blocking feedback pertains to the threshold power calculation. I think it could be problematic if it rounds down.

x/upgrade/keeper.go Show resolved Hide resolved
Comment on lines 172 to 173
// upgrade to a new version. It converts the signal quorum parameter which
// is a number between 0 and math.MaxUint32 representing a fraction and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] I thought signalQuorum was a decimal potentially in the range of 0 to 1. How does a number between 0 and math.MaxUint32 get translated into a fraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was originally a uint32 but I changed it to use the sdk's decimal struct and forgot to update the documentation. Thanks for pointing it out

x/upgrade/keeper.go Show resolved Hide resolved
Comment on lines +88 to +94
store := sdkCtx.KVStore(k.storeKey)
iterator := store.Iterator(nil, nil)
defer iterator.Close()
for ; iterator.Valid(); iterator.Next() {
valAddress := sdk.ValAddress(iterator.Key())
power := k.stakingKeeper.GetLastValidatorPower(sdkCtx, valAddress)
version := VersionFromBytes(iterator.Value())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[extremely optional] wow does Cosmos SDK not provide any convenience methods for iterating over this store which seems like a map from validator address => validator version?

If there aren't any convenience methods, would it make sense to extract methods that enable using basic map operations on this so that the code can look something like:

for valAddress, version := range validatorToVersionMap() {
		power := k.stakingKeeper.GetLastValidatorPower(sdkCtx, valAddress)
		// ...
}

}

const (
consensusVersion uint64 = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just directly taken from the existing upgrade module (since it shares the same name). It should probably actually be 3 instead of 2 now that I think of it

Comment on lines +10 to +11
// RegisterLegacyAminoCodec registers the upgrade types on the LegacyAmino codec.
func RegisterLegacyAminoCodec(cdc *codec.LegacyAmino) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] does this module need to support legacy Amino codec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so but I'm not entirely sure what legacy amino is for. IIRC, it's needed for ledger signing right? I'm not sure if our other modules support the legacy amino encoding

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Legacy Amino is needed for Ledger but in my experience the Cosmos SDK docs are poor at explaining when modules should implement this and if it's actually deprecated / slated for removal.

x/upgrade/types/errors.go Show resolved Hide resolved
})
require.NoError(t, err)
require.EqualValues(t, 30, res.VotingPower)
require.EqualValues(t, 100, res.Threshold)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] this usage of Threshold makes me realize that the variable name ThresholdPower may make the code clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I can modify that

x/upgrade/tally_test.go Outdated Show resolved Hide resolved
x/upgrade/keeper.go Outdated Show resolved Hide resolved
@celestia-bot celestia-bot requested a review from a team November 29, 2023 14:20
Comment on lines +47 to +50
func (AppModuleBasic) RegisterGRPCGatewayRoutes(clientCtx client.Context, mux *runtime.ServeMux) {
if err := types.RegisterQueryHandlerClient(context.Background(), mux, types.NewQueryClient(clientCtx)); err != nil {
panic(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling the error returned by RegisterQueryHandlerClient without causing a panic, to improve the robustness of the application.

Comment on lines +92 to +95
// LegacyQuerierHandler registers a query handler to respond to the module-specific queries
func (am AppModule) LegacyQuerierHandler(_ *codec.LegacyAmino) sdk.Querier {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for LegacyQuerierHandler should be updated to indicate that it is a no-op.

- // LegacyQuerierHandler registers a query handler to respond to the module-specific queries
+ // LegacyQuerierHandler is a no-op.

Committable suggestion

IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
// LegacyQuerierHandler registers a query handler to respond to the module-specific queries
func (am AppModule) LegacyQuerierHandler(_ *codec.LegacyAmino) sdk.Querier {
return nil
}
// LegacyQuerierHandler is a no-op.
func (am AppModule) LegacyQuerierHandler(_ *codec.LegacyAmino) sdk.Querier {
return nil
}

Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Excited for this.

Documenting from a call a future improvement: clearing out the list of validator votes for a version after a successful upgrade. That way shouldUpgrade doesn't return true for subsequent blocks.

func (k Keeper) GetVotingPowerThreshold(ctx sdk.Context) sdkmath.Int {
quorum := k.SignalQuorum(ctx)
// contract: totalVotingPower should not exceed MaxUit64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[typo]

Suggested change
// contract: totalVotingPower should not exceed MaxUit64
// contract: totalVotingPower should not exceed MaxUint64

totalVotingPower := k.stakingKeeper.GetLastTotalPower(ctx)
return quorum.MulInt(totalVotingPower).RoundInt()
thresholdFraction := SignalThreshold(ctx.BlockHeader().Version.App)
return totalVotingPower.MulRaw(thresholdFraction.Numerator).QuoRaw(thresholdFraction.Denominator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to investigate / enforce that this never overflows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref: #2878

Copy link
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dope!

to document here notes from a sync discussion, we could also use a "crank" tx instead of checking if the upgrade is ready at some interval.

added #2875 and #2874 so I think we can merge and continue

Comment on lines +22 to +25
type Fraction struct {
Numerator int64
Denominator int64
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be done later, but could we have a quick doc here for this type since its exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah okay, maybe I just make it private. There's no need to have it public as it's not queryable


// RegisterLegacyAminoCodec registers the upgrade types on the LegacyAmino codec.
func RegisterLegacyAminoCodec(cdc *codec.LegacyAmino) {
cdc.RegisterConcrete(upgradetypes.Plan{}, "cosmos-sdk/Plan", nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might just want to leave a mental note to try out signing the tx on a ledger to make sure that we don't need to register the new Msg type with the legacy codec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ref: #2876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants