Skip to content

Commit

Permalink
docs: update Gaia processes (#2818)
Browse files Browse the repository at this point in the history
* update Gaia processes

* Update STATE-COMPATIBILITY.md

Co-authored-by: Simon Noetzlin <simon.ntz@gmail.com>

* update cosmos sdk link

---------

Co-authored-by: Simon Noetzlin <simon.ntz@gmail.com>
  • Loading branch information
mpoke and sainoe authored Nov 30, 2023
1 parent 3419724 commit 7ec28dd
Show file tree
Hide file tree
Showing 3 changed files with 277 additions and 35 deletions.
13 changes: 9 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,8 @@ where:
- `section` is one of
`dependencies`, `improvements`, `features`, `bug-fixes`, `state-breaking`, `api-breaking`,
and _**if multiple apply, create multiple files**_;
and _**if multiple apply, create multiple files**_,
not necessarily with the same `short-description` or content;
- `pr-number` is the PR number;
- `short-description` is a short (4 to 6 word), hyphen separated description of the change;
- `component` is used for changes that affect one of the components defined in the [config](.changelog/config.toml), e.g., `tests`, `globalfee`.
Expand Down Expand Up @@ -272,14 +273,18 @@ where `${description}` is a detailed description of the changelog entry.
For example,
```bash
# add an entry for bumping IBC to v4.4.2
unclog add -i "2554-bump-ibc" -p 2554 -s "dependencies" -m "Bump [ibc-go](https://github.com/cosmos/ibc-go) to [v4.4.2](https://github.com/cosmos/ibc-go/releases/tag/v4.4.2)"
unclog add -i "2554-bump-ibc" -p 2554 -s dependencies -m "Bump [ibc-go](https://github.com/cosmos/ibc-go) to [v4.4.2](https://github.com/cosmos/ibc-go/releases/tag/v4.4.2)"

# add an entry for changing the global fee module;
# note that the entry is added to both state-breaking and api-breaking sections
unclog add -i "2424-params" -p 2424 -c globalfee -s "state-breaking" -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params"
unclog add -i "2424-params" -p 2424 -c globalfee -s "api-breaking" -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params"
unclog add -i "2424-params" -p 2424 -c globalfee -s state-breaking -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params"
unclog add -i "2424-params" -p 2424 -c globalfee -s api-breaking -m "Add \`bypass-min-fee-msg-types\` and \`maxTotalBypassMinFeeMsgGagUsage\` to globalfee params"
```
**Note:** `unclog add` requires an editor. This can be set either by configuring
an `$EDITOR` environment variable or by manually specify an editor binary path
via the `--editor` flag.
**Note:** Changelog entries should answer the question: "what is important about this
change for users to know?" or "what problem does this solve for users?". It
should not simply be a reiteration of the title of the associated PR, unless the
Expand Down
88 changes: 57 additions & 31 deletions RELEASE_PROCESS.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Release Process

- [Release Process](#release-process)
- [Breaking Changes](#breaking-changes)
- [Major Release Procedure](#major-release-procedure)
- [Changelog](#changelog)
- [Creating a new release branch](#creating-a-new-release-branch)
- [Cutting a new release](#cutting-a-new-release)
- [Update the changelog on main](#update-the-changelog-on-main)
- [Release Notes](#release-notes)
- [Tagging Procedure](#tagging-procedure)
- [Test building artifacts](#test-building-artifacts)
Expand All @@ -19,20 +21,19 @@ This document outlines the release process for Cosmos Hub (Gaia).
Gaia follows [semantic versioning](https://semver.org), but with the following deviations to account for state-machine and API breaking changes:

- State-machine breaking changes will result in an increase of the major version X (X.y.z).
- Emergency releases & API breaking changes (changes in node interactions e.g. queries) will result in an increase of the minor version Y (x.Y.z | x > 0).
- Emergency releases & API breaking changes will result in an increase of the minor version Y (x.Y.z | x > 0).
- All other changes will result in an increase of the patch version Z (x.y.Z | x > 0).

**State compatibility**:
It is critical for the patch releases to be state-machine compatible with prior releases in the same minor version.
For example, v9.2.1 must be compatible with v9.2.0.
**Note:** In case a major release is deprecated before ending up on the network (due to potential bugs),
it is replaced by a minor release (eg: `v14.0.0``v14.1.0`).
As a result, this minor release is considered state-machine breaking.

Minor releases will be handled on a case-by-case basis, but generally should only arise in case of safety or security issues that require a co-ordinated network upgrade without a governance process.
### Breaking Changes

This is to ensure determinism, i.e. that given the same input, the nodes will always produce the same output.
State-incompatibility is allowed for major upgrades because all nodes in the network perform it at the same time.
Therefore, after the upgrade, the nodes continue functioning in a deterministic way.
A change is considered to be ***state-machine breaking*** if it requires a coordinated upgrade for the network to preserve [state compatibility](./STATE-COMPATIBILITY.md).
Note that when bumping the dependencies of [Cosmos SDK](https://github.com/cosmos/cosmos-sdk), [IBC](https://github.com/cosmos/ibc-go), and [ICS](https://github.com/cosmos/interchain-security) we will only treat patch releases as non state-machine breaking.

**Note**: State-machine breaking changes include changes that impact the amount of gas needed to execute a transaction as it results in a different `apphash` after the code is executed.
A change is considered to be ***API breaking*** if it modifies the provided API. This includes events, queries, CLI interfaces.

## Major Release Procedure

Expand All @@ -41,7 +42,7 @@ A _major release_ is an increment of the first number (eg: `v9.1.0` → `v10.0.0
**Note**: Generally, PRs should target either `main` or a long-lived feature branch (see [CONTRIBUTING.md](./CONTRIBUTING.md#pull-requests)).
An exception are PRs open via the Github mergify integration (i.e., backported PRs).

* Once the team feels that `main` is _**feature complete**_, we create a `release/vY` branch (going forward known a release branch),
* Once the team feels that `main` is _**feature complete**_, we create a `release/vY` branch (going forward known as release branch),
where `Y` is the version number, with the minor and patch part substituted to `x` (eg: 11.x).
* Update the [GitHub mergify integration](./.mergify.yml) by adding instructions for automatically backporting commits from `main` to the `release/vY` using the `A:backport/vY` label.
* **PRs targeting directly a release branch can be merged _only_ when exceptional circumstances arise**.
Expand Down Expand Up @@ -76,43 +77,68 @@ Unreleased changes are collected on `main` in `.changelog/unreleased/`.
However, `.changelog/` on `main` contains also existing releases (e.g., `v10.0.0`).
Thus, when creating a new release branch (e.g., `release/v11.x`), the following steps are necessary:

- move to the release branch, e.g., `release/v11.x`
- create a new release branch, e.g., `release/v11.x`
```bash
git checkout main
git pull
git checkout -b release/v11.x
```
- delete all the sub-folders in `.changelog/` except `unreleased/`
```bash
find ./.changelog -mindepth 1 -maxdepth 1 -type d -not -name unreleased | xargs rm -r
```
- replace the content of `.changelog/epilogue.md` with the following text
```md
## Previous Versions
```md
## Previous Versions
[CHANGELOG of previous versions](https://github.com/cosmos/gaia/blob/main/CHANGELOG.md)
```
[CHANGELOG of previous versions](https://github.com/cosmos/gaia/blob/main/CHANGELOG.md)
```
- push the release branch upstream
```bash
git push
```

#### Cutting a new release

Before cutting a _**release candidate**_ (e.g., `v11.0.0-rc0`), the following steps are necessary:

- move to the release branch, e.g., `release/v11.x`
```bash
git checkout release/v11.x
```
- move all entries in ".changelog/unreleased" to the release version, e.g., `v11.0.0`, i.e.,
```bash
unclog release v11.0.0
```
- update the CHANGELOG.md, i.e.,
```bash
unclog build > CHANGELOG.md
```
```bash
unclog release v11.0.0
```
- update `CHANGELOG.md`, i.e.,
```bash
unclog build > CHANGELOG.md
```
- open a PR (from this new created branch) against the release branch, e.g., `release/v11.x`

Now you can cut the release candidate, e.g., v11.0.0-rc0 (follow the [Tagging Procedure](#tagging-procedure)).

❗Once the **final release** is cut, the new changelog section must be added to main:
#### Update the changelog on main

Once the **final release** is cut, the new changelog section must be added to main:

- checkout a new branch from the `main` branch, i.e.,
```bash
git checkout main
git pull
git checkout -b <username>/backport_changelog
```
- bring the new changelog section from the release branch into this branch, e.g.,
```bash
git checkout release/v11.x .changelog/v11.0.0
```
```bash
git checkout release/v11.x .changelog/v11.0.0
```
- remove duplicate entries that are both in `.changelog/unreleased/` and the new changelog section, e.g., `.changelog/v11.0.0`
- update the CHANGELOG.md
```bash
unclog build > CHANGELOG.md
```
- update `CHANGELOG.md`, i.e.,
```bash
unclog build > CHANGELOG.md
```
- open a PR (from this new created branch) against `main`

### Release Notes

Release notes will be created using the `RELEASE_NOTES.md` from the release branch.
Expand Down
211 changes: 211 additions & 0 deletions STATE-COMPATIBILITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# State-Compatibility

- [State-Compatibility](#state-compatibility)
- [Scope](#scope)
- [Validating State-Compatibility](#validating-state-compatibility)
- [AppHash](#apphash)
- [LastResultsHash](#lastresultshash)
- [Major Sources of State-incompatibility](#major-sources-of-state-incompatibility)
- [Creating Additional State](#creating-additional-state)
- [Changing Proto Field Definitions](#changing-proto-field-definitions)
- [Returning Different Errors Given Same Input](#returning-different-errors-given-same-input)
- [Variability in Gas Usage](#variability-in-gas-usage)
- [Secondary Limitations To Keep In Mind](#secondary-limitations-to-keep-in-mind)
- [Network Requests to External Services](#network-requests-to-external-services)
- [Randomness](#randomness)
- [Parallelism and Shared State](#parallelism-and-shared-state)
- [Hardware Errors](#hardware-errors)


It is critical for the patch and minor releases to be state-machine compatible with prior releases in the same minor version.
For example, v13.0.2 must be state-machine compatible with v13.0.1.
_An exception are minor releases that are either emergency releases or replacements of deprecated major releases_.

This is to ensure **determinism**, i.e., given the same input, the nodes will always produce the same output.

State-incompatibility is allowed for major upgrades because all nodes in the network perform it at the same time. Therefore, after the upgrade, the nodes continue functioning in a deterministic way.

## Scope

The state-machine scope includes the following areas:

- All ICS messages including:
- Every msg's ValidateBasic method
- Every msg's MsgServer method
- Net gas usage, in all execution paths
- Error result returned
- State changes (namely every store write)
- AnteHandlers in "DeliverTx" mode
- All `BeginBlock`/`EndBlock` logic

The following are **NOT** in the state-machine scope:

- Events
- Queries that are not whitelisted
- CLI interfaces

## Validating State-Compatibility

CometBFT ensures state compatibility by validating a number of hashes that can be found [here](https://github.com/cometbft/cometbft/blob/v0.38.2/proto/tendermint/types/types.proto#L59-L66).

`AppHash` and `LastResultsHash` are the common sources of problems stemming from our work.
To avoid these problems, let's now examine how these hashes work.

### AppHash

**Note:** The following explanation is simplified for clarity.

An app hash is a hash of hashes of every store's Merkle root that is returned by ABCI's `Commit()` from Cosmos-SDK to CometBFT.
Cosmos-SDK [takes an app hash of the application state](https://github.com/cosmos/cosmos-sdk/blob/v0.47.6/store/rootmulti/store.go#L468), and propagates it to CometBFT which, in turn, compares it to the app hash of the rest of the network.
Then, CometBFT ensures that the app hash of the local node matches the app hash of the network.

### LastResultsHash

`LastResultsHash` is the root hash of all results from the transactions in the block returned by the ABCI's `DeliverTx`.

The [`LastResultsHash`](https://github.com/cometbft/cometbft/blob/v0.34.29/types/results.go#L47-L54)
in CometBFT [v0.34.29](https://github.com/cometbft/cometbft/releases/tag/v0.34.29) contains:

1. Tx `GasWanted`

2. Tx `GasUsed`
> `GasUsed` being Merkelized means that we cannot freely reorder methods that consume gas.
> We should also be careful of modifying any validation logic since changing the
> locations where we error or pass might affect transaction gas usage.
>
> There are plans to remove this field from being Merkelized in a subsequent CometBFT release,
> at which point we will have more flexibility in reordering operations / erroring.
3. Tx response `Data`

> The `Data` field includes the proto marshalled Tx response. Therefore, we cannot
> change these in patch releases.
4. Tx response `Code`

> This is an error code that is returned by the transaction flow. In the case of
> success, it is `0`. On a general error, it is `1`. Additionally, each module
> defines its custom error codes.
>
> As a result, it is important to avoid changing custom error codes or change
> the semantics of what is valid logic in transaction flows.
Note that all of the above stem from `DeliverTx` execution path, which handles:

- `AnteHandler`'s marked as deliver tx
- `msg.ValidateBasic`
- execution of a message from the message server

The `DeliverTx` return back to the CometBFT is defined [here](https://github.com/cosmos/cosmos-sdk/blob/d11196aad04e57812dbc5ac6248d35375e6603af/baseapp/abci.go#L293-L303).

## Major Sources of State-incompatibility

### Creating Additional State

By erroneously creating database entries that exist in Version A but not in
Version B, we can cause the app hash to differ across nodes running
these versions in the network. Therefore, this must be avoided.

### Changing Proto Field Definitions

For example, if we change a field that gets persisted to the database,
the app hash will differ across nodes running these versions in the network.

Additionally, this affects `LastResultsHash` because it contains a `Data` field that is a marshaled proto message.

### Returning Different Errors Given Same Input

```go
// Version A
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
if amount.IsNegative() {
return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive or zero")
}
return nil
}
```

```go
// Version B
func (sk Keeper) validateAmount(ctx context.Context, amount math.Int) error {
if amount.IsNegative() || amount.IsZero() {
return sdkerrors.Wrap(sdkerrors.ErrInvalidRequest, "amount must be positive")
}
return nil
}
```

Note that now an amount of 0 can be valid in "Version A", but invalid in "Version B".
Therefore, if some nodes are running "Version A" and others are running "Version B",
the final app hash might not be deterministic.

Additionally, a different error message does not matter because it
is not included in any hash. However, an error code `sdkerrors.ErrInvalidRequest` does.
It translates to the `Code` field in the `LastResultsHash` and participates in
its validation.

### Variability in Gas Usage

For transaction flows (or any other flow that consumes gas), it is important
that the gas usage is deterministic.

Currently, gas usage is being Merklized in the state. As a result, reordering functions
becomes risky.

Suppose my gas limit is 2000 and 1600 is used up before entering
`someInternalMethod`. Consider the following:

```go
func someInternalMethod(ctx sdk.Context) {
object1 := readOnlyFunction1(ctx) # consumes 1000 gas
object2 := readOnlyFunction2(ctx) # consumes 500 gas
doStuff(ctx, object1, object2)
}
```

- It will run out of gas with `gasUsed = 2600` where 2600 getting merkelized
into the tx results.

```go
func someInternalMethod(ctx sdk.Context) {
object2 := readOnlyFunction2(ctx) # consumes 500 gas
object1 := readOnlyFunction1(ctx) # consumes 1000 gas
doStuff(ctx, object1, object2)
}
```

- It will run out of gas with `gasUsed = 2100` where 2100 is getting merkelized
into the tx results.

Therefore, we introduced a state-incompatibility by merklezing diverging gas
usage.

## Secondary Limitations To Keep In Mind

### Network Requests to External Services

It is critical to avoid performing network requests to external services
since it is common for services to be unavailable or rate-limit.

Imagine a service that returns exchange rates when clients query its HTTP endpoint.
This service might experience downtime or be restricted in some geographical areas.

As a result, nodes may get diverging responses where some
get successful responses while others errors, leading to state breakage.

### Randomness

Randomness cannot be used in the state machine, as the state machine must be deterministic.

**Note:** Iteration order over `map`s is non-deterministic, so to be deterministic
you must gather the keys, and sort them all prior to iterating over all values.

### Parallelism and Shared State

Threads and Goroutines might preempt differently in different hardware. Therefore,
they should be avoided for the sake of determinism. Additionally, it is hard
to predict when the multi-threaded state can be updated.

### Hardware Errors

This is out of the developer's control but is mentioned for completeness.

0 comments on commit 7ec28dd

Please sign in to comment.