Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speed up crisis invariant check (juno genesis) #12886

Closed

Conversation

Cashmaney
Copy link

@Cashmaney Cashmaney commented Aug 10, 2022

Description

Looking at the time it takes to handle the crisis invariant check when starting a node, the vast majority of time taken by the invariant check is due to the can-withdraw function, which takes ages because it sequentially loops every validator and every delegator. I switched the looping to be concurrent instead, resulting in a pretty huge speedup.

This check isn't part of consensus or anything, so non-deterministic concurrency should be fair game here seeing as how rewards are independent between validators.

I'm not super familiar with the internals of x/crisis, but it doesn't seem to me that I broke anything in doing this - behaviour may differ since the code will panic on the first invariant error it encounters (which may be different between executions if there are multiple), but I feel this is a compromise that we should be willing to make. Still, I might be missing something here, so feel free to correct me if that isn't the case.

When testing using Pheonix-2 genesis file

On my 24-core AMD Ryzen 9 3900x:
Before:

TBD - will update when it finishes

After:

12:27PM INF starting node with ABCI Tendermint in-process
12:28PM INF Starting multiAppConn service impl=multiAppConn module=proxy
12:28PM INF Starting localClient service connection=query impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
12:28PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
12:28PM INF Starting EventBus service impl=EventBus module=events
12:28PM INF Starting PubSub service impl=PubSub module=pubsub
12:28PM INF Starting IndexerService service impl=IndexerService module=txindex
12:28PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0
12:28PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
12:28PM INF asserting crisis invariants inv=1/11 module=x/crisis name=distribution/nonnegative-outstanding
12:28PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/can-withdraw
12:29PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/reference-count
12:29PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/module-account
12:29PM INF asserting crisis invariants inv=5/11 module=x/crisis name=bank/nonnegative-outstanding
12:29PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/total-supply
12:29PM INF asserting crisis invariants inv=7/11 module=x/crisis name=gov/module-account
12:29PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
12:29PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
12:29PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
12:29PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
12:29PM INF asserted all invariants duration=49032.084869 height=4136532 module=x/crisis

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules
  • included the necessary unit and integration tests
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

… a long time running sequentially for each validator
@Cashmaney Cashmaney requested a review from a team as a code owner August 10, 2022 10:27
@github-actions github-actions bot added the C:x/distribution distribution module related label Aug 10, 2022
@alexanderbez
Copy link
Contributor

This is a substantially different approach taken from #12885 (which I don't really understand and fear is too complicated).

I feel like running these calls in parallel should be OK. Can you please post benchmarks pre/post?

@Cashmaney
Copy link
Author

Cashmaney commented Aug 10, 2022

Yeah, it's actually pretty cool to see how different the approaches were.

I posted the bench of the after in the PR (the entire crisis check took about a minute). I gave up on running the before check after about an hour, but I'll let it run for a bit and update the PR later on when it's over

@alexanderbez
Copy link
Contributor

@Cashmaney I meant native go benchmarks that should you the ns/ms delta where you just run the invariant function against some state, preferably Juno's :)

@Cashmaney
Copy link
Author

@alexanderbez You mean like BenchmarkInvariants? The problem is taking a real-world state and creating such a complex context in a test. Is there an easy way to pull a genesis file to simapp and run a test on that? I went with a simpler approach - debug prints inside the function😅

Post:

2022/08/10 17:00:11 CanWithdrawInvariant check took 53.20056586s

Pre:

Now on 1 hour+ ...

I also have the two binaries here - https://github.com/Cashmaney/cosmos-sdk/releases/tag/concurrent-invariant-check

@alexanderbez
Copy link
Contributor

@Cashmaney yes, take the genesis state and load the file into memory and populate the distribution and staking keepers. Then you can run benchmarks off of that.

@amaury1093 amaury1093 self-assigned this Aug 10, 2022
@faddat
Copy link
Contributor

faddat commented Aug 10, 2022

thank you

@Cashmaney
Copy link
Author

Cashmaney commented Aug 10, 2022

Couldn't figure out how to get the simapp working properly with the Juno state, so I just ran another real-world bench on gaia with the cosmoshub-4 genesis state:

Post fix:

2022/08/10 22:24:55 CanWithdrawInvariant took: 10.956378769s

Pre fix:

2022/08/10 23:47:04 CanWithdrawInvariant check took 1h17m46.518379309s

@blazeroni
Copy link
Contributor

hey @Cashmaney, nice solution!

From my testing, the primary bottleneck in the original code is creating the iterator in IterateValidatorSlashEventsBetween. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.

@Cashmaney
Copy link
Author

hey @Cashmaney, nice solution!

From my testing, the primary bottleneck in the original code is creating the iterator in IterateValidatorSlashEventsBetween. It takes about 180ms in the original, single-threaded code. My solution focused on caching those results.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I don't know why the call behavior would change so drastically by wrapping parent calls in go routines. I wonder if anyone can shed some light on what's going on here.

Hey. Thanks - it was really funny to see two radically different solutions posted basically at the same time.

You seemed to have dived a bit deeper than I did. I'm not even sure how and where that function is called from, so I'll leave it to someone more knowledgeable in both the sdk and Go to weigh in on that

@alexanderbez
Copy link
Contributor

@Cashmaney there's no need to create a SimApp, you just need to benchmark CanWithdrawInvariant and to do that you only need a Keeper. I would just create a keeper and call InitGenesis on it perhaps?

Comment on lines +91 to +110
go func(val stakingtypes.ValidatorI, wg *sync.WaitGroup) {
defer wg.Done()

_, _ = k.WithdrawValidatorCommission(ctx, val.GetOperator())

delegationAddrs, ok := valDelegationAddrs[val.GetOperator().String()]
if ok {
for _, delAddr := range delegationAddrs {
if _, err := k.WithdrawDelegationRewards(ctx, delAddr, val.GetOperator()); err != nil {
panic(err)
}
}
}
}

remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remaining) > 0 && remaining[0].Amount.IsNegative() {
return true
}
remainingRewards := k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remainingRewards) > 0 && remainingRewards[0].Amount.IsNegative() {
broken = true
remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
}
}(val, wg)

Check notice

Code scanning / CodeQL

Spawning a Go routine

Spawning a Go routine may be a possible source of non-determinism
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbar3778 is there an annotation we can add to comments/code to suppress such warnings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found one for codeql (the crypto.com version)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking now though

@kocubinski
Copy link
Member

kocubinski commented Aug 16, 2022

I'm in favor of #12885 since it's a general case solution. @alexanderbez I don't think we should be afraid to improve old code, I made an analysis of the cache iterator and it seems OK to me.

@blazeroni
Copy link
Contributor

@kocubinski to be fair, @alexanderbez saw a much different version of the code that was merely a workaround for the actual problem. The assessment of it being too complicated was probably correct.

@kocubinski
Copy link
Member

a much different version of the code that was merely a workaround for the actual problem

Ah, I see there was a force push. This is my first look at the PR, nice simplification.

This solution doesn't directly address that problem, but seems to have fixed it anyway. I checked how long those calls take in this PR and its well under 1ms.

I'd like to uncover why we see these perf improvements here too, but in end I don't see why we can't merge both.

@blazeroni
Copy link
Contributor

@kocubinski Based on my testing, it seems the goroutines help the code find data that will shrink/clear the unsorted cache, allowing the rest of the processing to go much more quickly. How fast that happens depends on the number of cpu cores, and varies slightly from run to run.

mergify bot pushed a commit that referenced this pull request Aug 18, 2022
…2885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file #12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.
mergify bot pushed a commit that referenced this pull request Aug 18, 2022
…2885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file #12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md
mergify bot pushed a commit that referenced this pull request Aug 18, 2022
…2885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file #12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md
@tac0turtle
Copy link
Member

@kocubinski @alexanderbez do we still want this pr as well?

tac0turtle added a commit that referenced this pull request Aug 19, 2022
…ckport #12885) (#12961)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (#12885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file #12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md

* fix conflict

Co-authored-by: blazeroni <blazeroni@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Marko <marbar3778@yahoo.com>
@alexanderbez
Copy link
Contributor

@marbar3778 yeah I don't see why not actually. Let me give this one quick look again, and we can merge 👍

Any objections @kocubinski?

Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few minor nit comments

x/distribution/keeper/invariants.go Outdated Show resolved Hide resolved
x/distribution/keeper/invariants.go Outdated Show resolved Hide resolved
Comment on lines +91 to +110
go func(val stakingtypes.ValidatorI, wg *sync.WaitGroup) {
defer wg.Done()

_, _ = k.WithdrawValidatorCommission(ctx, val.GetOperator())

delegationAddrs, ok := valDelegationAddrs[val.GetOperator().String()]
if ok {
for _, delAddr := range delegationAddrs {
if _, err := k.WithdrawDelegationRewards(ctx, delAddr, val.GetOperator()); err != nil {
panic(err)
}
}
}
}

remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remaining) > 0 && remaining[0].Amount.IsNegative() {
return true
}
remainingRewards := k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remainingRewards) > 0 && remainingRewards[0].Amount.IsNegative() {
broken = true
remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
}
}(val, wg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marbar3778 is there an annotation we can add to comments/code to suppress such warnings?

@Cashmaney
Copy link
Author

LGTM! Just a few minor nit comments

Fixed:)

@alexanderbez
Copy link
Contributor

Would like another approval prior to merge :)

Comment on lines +91 to +110
go func(val stakingtypes.ValidatorI, wg *sync.WaitGroup) {
defer wg.Done()

_, _ = k.WithdrawValidatorCommission(ctx, val.GetOperator())

delegationAddrs, ok := valDelegationAddrs[val.GetOperator().String()]
if ok {
for _, delAddr := range delegationAddrs {
if _, err := k.WithdrawDelegationRewards(ctx, delAddr, val.GetOperator()); err != nil {
panic(err)
}
}
}
}

remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remaining) > 0 && remaining[0].Amount.IsNegative() {
return true
}
remainingRewards := k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remainingRewards) > 0 && remainingRewards[0].Amount.IsNegative() {
broken = true
remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
}
}(val, wg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found one for codeql (the crypto.com version)

Comment on lines +91 to +110
go func(val stakingtypes.ValidatorI, wg *sync.WaitGroup) {
defer wg.Done()

_, _ = k.WithdrawValidatorCommission(ctx, val.GetOperator())

delegationAddrs, ok := valDelegationAddrs[val.GetOperator().String()]
if ok {
for _, delAddr := range delegationAddrs {
if _, err := k.WithdrawDelegationRewards(ctx, delAddr, val.GetOperator()); err != nil {
panic(err)
}
}
}
}

remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remaining) > 0 && remaining[0].Amount.IsNegative() {
return true
}
remainingRewards := k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
if len(remainingRewards) > 0 && remainingRewards[0].Amount.IsNegative() {
broken = true
remaining = k.GetValidatorOutstandingRewardsCoins(ctx, val.GetOperator())
}
}(val, wg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking now though

@faddat
Copy link
Contributor

faddat commented Aug 21, 2022

Oh and teeny nit... I will soon post a PR that fumpts this :)

Cashmaney#1

@faddat
Copy link
Contributor

faddat commented Aug 21, 2022

@alexanderbez hey, I checked on codeql.

Answer is no.

github/codeql#9298

@codecov
Copy link

codecov bot commented Aug 31, 2022

Codecov Report

Merging #12886 (bad3bab) into main (7728516) will decrease coverage by 0.93%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12886      +/-   ##
==========================================
- Coverage   54.10%   53.16%   -0.94%     
==========================================
  Files         662      641      -21     
  Lines       56426    54885    -1541     
==========================================
- Hits        30528    29181    -1347     
+ Misses      23515    23378     -137     
+ Partials     2383     2326      -57     
Impacted Files Coverage Δ
x/distribution/keeper/invariants.go 0.00% <0.00%> (ø)
depinject/one_per_module.go
depinject/internal/codegen/import.go
depinject/internal/codegen/file.go
depinject/simple.go
depinject/supply.go
depinject/module_dep.go
depinject/module_key.go
depinject/internal/codegen/ident.go
... and 15 more

@tac0turtle
Copy link
Member

@alexanderbez are we merging this?

MissingNO57 pushed a commit to fetchai/cosmos-sdk that referenced this pull request Sep 16, 2022
…ckport #12885) (#12961)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (#12885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file cosmos/cosmos-sdk#12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md

* fix conflict

Co-authored-by: blazeroni <blazeroni@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Marko <marbar3778@yahoo.com>
@alexanderbez
Copy link
Contributor

We already merged #12885 -- we can close this PR.

MissingNO57 added a commit to fetchai/cosmos-sdk that referenced this pull request Sep 27, 2022
* fix: update x/mint parameter validation (backport #12384) (#12396)

* chore: optimize get last completed upgrade (#12268)

* feat: improve GetLastCompletedUpgrade

* rename

* use var block

* fix: Simulation is not deterministic due to GenTx (backport #12374) (#12437)

* fix: use go install instead of go get in makefile (#12435)

* chore: fumpt sdk v45 series #12442

* feat: Move AppModule.BeginBlock and AppModule.EndBlock to extension interfaces (backport #12603) (#12638)

* feat: Move AppModule.BeginBlock and AppModule.EndBlock to extension interfaces (#12603)

## Description
Most modules right now have a no-op for AppModule.BeginBlock and AppModule.EndBlock. We should move these methods off the AppModule interface so we have less deadcode, and instead move them to extension interfaces.

1. I added `BeginBlockAppModule` and `EndBlockAppModule` interface.
2. Remove the dead-code from modules that do no implement them
3. Add type casting in the the module code to use the new interface

Closes: #12462

* feat: add message index event attribute to authz message execution (backport #12668) (#12673)

* chore(store): upgrade iavl to v0.19.0 (backport #12626) (#12697)

* feat: Add `GetParamSetIfExists` to prevent panic on breaking param changes (#12724)

* feat: Add convenience method for constructing key to access account's balance for a given denom (backport #12674) (#12745)

* feat: Add convenience method for constructing key to access account's balance for a given denom (#12674)

This PR adds a convenience method for constructing the key necessary to query for the account's balance of a given denom.

I ran into this issue since we are using ABCI query now to perform balance requests because we are also requesting merkle proofs for the returned balance [here](https://github.com/celestiaorg/celestia-node/pull/911/files#diff-0ee31f5a7bd88e9f758e6bebdf3ee36365519e55a451098d9638c39afe5eac42R144).

It would be nice to have a definitive convenience method for constructing the key.

[Ref.](github.com/celestiaorg/celestia-node/pull/911)

(cherry picked from commit a1777a8)

# Conflicts:
#	CHANGELOG.md
#	x/bank/legacy/v043/store.go
#	x/bank/types/keys.go

* Update CHANGELOG.md

* fix conflict

Co-authored-by: rene <41963722+renaynay@users.noreply.github.com>
Co-authored-by: Marko <marbar3778@yahoo.com>

* chore: bump tm in 0.45.x (#12784)

* bump tendermint version

* add changelog entry

* replace on jhump

* updates

* updates

* updates

Co-authored-by: Aleksandr Bezobchuk <aleks.bezobchuk@gmail.com>

* chore: 0.45.7 changelog prep (#12821)

* prepare for release

* modify release notes

* docs(staking): typo in staking/state (backport #12834) (#12836)

* fix(docs): typo in staking/state (#12834)

(cherry picked from commit fe89212)

# Conflicts:
#	x/staking/spec/01_state.md

* updates

Co-authored-by: Ari Rubinstein <arirubinstein@users.noreply.github.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>

* chore: fee payer event (backport #12850) (#12856)

* chore: changelog update (backport #12859) (#12862)

* fix: Use fixed length hex for pointer at FwdCapabilityKey (backport #11737) (#12818)

* fix: Use fixed length hex for pointer at FwdCapabilityKey (backport #11737)

* Update CHANGELOG.md

* add comments and unit tests

* feat: deterministic map iteration (backport #12781) (#12944)

* feat: deterministic map iteration (#12781)

## Description

We should ensure that events emitted are in a deterministic order on any node.

Sorry, the previous [PR-12693](cosmos/cosmos-sdk#12693) was closed due to wrong operation.

Close: #12693

* chore: bump tendermint to `0.34.21` and iavl to `0.19.1` (#12970)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (backport #12885) (#12961)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (#12885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file cosmos/cosmos-sdk#12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md

* fix conflict

Co-authored-by: blazeroni <blazeroni@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Marko <marbar3778@yahoo.com>

* fix: proper error when parsing telemetry configuration (backport #12981) (#12999)

* ci: fix release notes not populated by goreleaser (#13019)

Co-authored-by: Julien Robert <julien@rbrt.fr>

* fix: missing return statement in BaseApp.Query (backport #13046) (#13050)

* fix: missing return statement in BaseApp.Query (#13046)

## Description

Closes: #13040

* chore: v0.45.8 release changelog (#13053)

* chore: v0.45.8 release changelog

* fix issue name

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Robert Zaremba <robert@zaremba.ch>
Co-authored-by: Adu <foriteration@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Jacob Gadikian <jacobgadikian@gmail.com>
Co-authored-by: Federico Kunze Küllmer <31522760+fedekunze@users.noreply.github.com>
Co-authored-by: rene <41963722+renaynay@users.noreply.github.com>
Co-authored-by: Marko <marbar3778@yahoo.com>
Co-authored-by: Aleksandr Bezobchuk <aleks.bezobchuk@gmail.com>
Co-authored-by: Ari Rubinstein <arirubinstein@users.noreply.github.com>
Co-authored-by: yihuang <huang@crypto.com>
Co-authored-by: blazeroni <blazeroni@gmail.com>
omritoptix pushed a commit to dymensionxyz/RDK that referenced this pull request Nov 29, 2022
…ckport #12885) (#12961)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (#12885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file cosmos/cosmos-sdk#12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md

* fix conflict

Co-authored-by: blazeroni <blazeroni@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Marko <marbar3778@yahoo.com>
JeancarloBarrios pushed a commit to agoric-labs/cosmos-sdk that referenced this pull request Sep 28, 2024
…ckport cosmos#12885) (cosmos#12961)

* perf: Amortize clearing unsorted cache entries (Juno genesis fix) (cosmos#12885)

This change fixes a bounty by the Juno team.  Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds.  See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks.

The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337.  Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store.  For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314.  Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range.  This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call.  With a full cache it takes about 180ms on my machine to sort them.

This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place.  This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache.

## Benchmarks
On running the benchmarks included in this change produces:
```shell
name                    old time/op    new time/op    delta
LargeUnsortedMisses-32     21.2s ± 9%      0.0s ± 1%   -99.91%  (p=0.000 n=20+17)

name                    old alloc/op   new alloc/op   delta
LargeUnsortedMisses-32    1.64GB ± 0%    0.00GB ± 0%   -99.83%  (p=0.000 n=19+19)

name                    old allocs/op  new allocs/op  delta
LargeUnsortedMisses-32     20.0k ± 0%     41.1k ± 0%  +105.23%  (p=0.000 n=19+20)
```

## Invariant checks results
This is what the invariant checks for Juno look like with this change (on a Hetzner AX101):

```shell
INF starting node with ABCI Tendermint in-process
4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy
4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client
4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client
4:11PM INF Starting EventBus service impl=EventBus module=events
4:11PM INF Starting PubSub service impl=PubSub module=pubsub
4:11PM INF Starting IndexerService service impl=IndexerService module=txindex
4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16
4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account
4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw
4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count
4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account
4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding
4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply
4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts
4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power
4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation
4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares
4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis
```

## Alternatives
There is another PR which fixes this problem for the Juno genesis file cosmos#12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it.

(cherry picked from commit 4fc1f73)

# Conflicts:
#	CHANGELOG.md

* fix conflict

Co-authored-by: blazeroni <blazeroni@gmail.com>
Co-authored-by: Julien Robert <julien@rbrt.fr>
Co-authored-by: Marko <marbar3778@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:x/distribution distribution module related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants