-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Amortize clearing unsorted cache entries (Juno genesis fix) (backport #12885) #12961
Merged
tac0turtle
merged 3 commits into
release/v0.45.x
from
mergify/bp/release/v0.45.x/pr-12885
Aug 19, 2022
Merged
perf: Amortize clearing unsorted cache entries (Juno genesis fix) (backport #12885) #12961
tac0turtle
merged 3 commits into
release/v0.45.x
from
mergify/bp/release/v0.45.x/pr-12885
Aug 19, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…2885) This change fixes a bounty by the Juno team. Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds. See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks. The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337. Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store. For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314. Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range. This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call. With a full cache it takes about 180ms on my machine to sort them. This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place. This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache. ## Benchmarks On running the benchmarks included in this change produces: ```shell name old time/op new time/op delta LargeUnsortedMisses-32 21.2s ± 9% 0.0s ± 1% -99.91% (p=0.000 n=20+17) name old alloc/op new alloc/op delta LargeUnsortedMisses-32 1.64GB ± 0% 0.00GB ± 0% -99.83% (p=0.000 n=19+19) name old allocs/op new allocs/op delta LargeUnsortedMisses-32 20.0k ± 0% 41.1k ± 0% +105.23% (p=0.000 n=19+20) ``` ## Invariant checks results This is what the invariant checks for Juno look like with this change (on a Hetzner AX101): ```shell INF starting node with ABCI Tendermint in-process 4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy 4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client 4:11PM INF Starting EventBus service impl=EventBus module=events 4:11PM INF Starting PubSub service impl=PubSub module=pubsub 4:11PM INF Starting IndexerService service impl=IndexerService module=txindex 4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16 4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0 4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account 4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding 4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw 4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count 4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account 4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding 4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply 4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts 4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power 4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation 4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares 4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis ``` ## Alternatives There is another PR which fixes this problem for the Juno genesis file #12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it. (cherry picked from commit 4fc1f73) # Conflicts: # CHANGELOG.md
julienrbrt
approved these changes
Aug 18, 2022
19 tasks
JeancarloBarrios
pushed a commit
to agoric-labs/cosmos-sdk
that referenced
this pull request
Sep 28, 2024
…ckport cosmos#12885) (cosmos#12961) * perf: Amortize clearing unsorted cache entries (Juno genesis fix) (cosmos#12885) This change fixes a bounty by the Juno team. Juno's invariant checks took 10 hours during their most recent chain halt. This PR cuts that down to 30 seconds. See https://github.com/CosmosContracts/bounties#improve-speed-of-invariant-checks. The root problem is deep in the `can-withdraw` invariant check, which calls this repeatedly: https://github.com/cosmos/cosmos-sdk/blob/main/x/distribution/keeper/store.go#L337. Iterators have a chain of parents and in this case creates an iterator from the `cachekv` store. For the genesis file, it has a cache of 500,000+ unsorted entries, which are sorted as strings here: https://github.com/cosmos/cosmos-sdk/blob/main/store/cachekv/store.go#L314. Each delegation from `can-withdraw` uses this cache and many of the cache checks miss or are a very small range. This means very few entries get removed from the unsorted cache and they have to be re-sorted on the next call. With a full cache it takes about 180ms on my machine to sort them. This change introduce a minimum number of entries that will get processed and removed from the unsorted list. It's set at the same value that directs the code to sort them in the first place. This ensures the unsorted values get removed in a relative short amount of time, and amortizes the cost to ensure an individual check does not have to process the entire cache. ## Benchmarks On running the benchmarks included in this change produces: ```shell name old time/op new time/op delta LargeUnsortedMisses-32 21.2s ± 9% 0.0s ± 1% -99.91% (p=0.000 n=20+17) name old alloc/op new alloc/op delta LargeUnsortedMisses-32 1.64GB ± 0% 0.00GB ± 0% -99.83% (p=0.000 n=19+19) name old allocs/op new allocs/op delta LargeUnsortedMisses-32 20.0k ± 0% 41.1k ± 0% +105.23% (p=0.000 n=19+20) ``` ## Invariant checks results This is what the invariant checks for Juno look like with this change (on a Hetzner AX101): ```shell INF starting node with ABCI Tendermint in-process 4:11PM INF Starting multiAppConn service impl=multiAppConn module=proxy 4:11PM INF Starting localClient service connection=query impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=snapshot impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=mempool impl=localClient module=abci-client 4:11PM INF Starting localClient service connection=consensus impl=localClient module=abci-client 4:11PM INF Starting EventBus service impl=EventBus module=events 4:11PM INF Starting PubSub service impl=PubSub module=pubsub 4:11PM INF Starting IndexerService service impl=IndexerService module=txindex 4:11PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=v9.0.0-36-g8fd6f16 4:11PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0 4:12PM INF asserting crisis invariants inv=1/11 module=x/crisis name=gov/module-account 4:12PM INF asserting crisis invariants inv=2/11 module=x/crisis name=distribution/nonnegative-outstanding 4:12PM INF asserting crisis invariants inv=3/11 module=x/crisis name=distribution/can-withdraw 4:12PM INF asserting crisis invariants inv=4/11 module=x/crisis name=distribution/reference-count 4:12PM INF asserting crisis invariants inv=5/11 module=x/crisis name=distribution/module-account 4:12PM INF asserting crisis invariants inv=6/11 module=x/crisis name=bank/nonnegative-outstanding 4:12PM INF asserting crisis invariants inv=7/11 module=x/crisis name=bank/total-supply 4:12PM INF asserting crisis invariants inv=8/11 module=x/crisis name=staking/module-accounts 4:12PM INF asserting crisis invariants inv=9/11 module=x/crisis name=staking/nonnegative-power 4:12PM INF asserting crisis invariants inv=10/11 module=x/crisis name=staking/positive-delegation 4:12PM INF asserting crisis invariants inv=11/11 module=x/crisis name=staking/delegator-shares 4:12PM INF asserted all invariants duration=28383.559601 height=4136532 module=x/crisis ``` ## Alternatives There is another PR which fixes this problem for the Juno genesis file cosmos#12886. However, because of its concurrent nature, it happens to hit a large range relatively early, clearing the unsorted entries and allowing the rest of the checks to not sort it. (cherry picked from commit 4fc1f73) # Conflicts: # CHANGELOG.md * fix conflict Co-authored-by: blazeroni <blazeroni@gmail.com> Co-authored-by: Julien Robert <julien@rbrt.fr> Co-authored-by: Marko <marbar3778@yahoo.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an automatic backport of pull request #12885 done by Mergify.
Cherry-pick of 4fc1f73 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/github/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally
Mergify commands and options
More conditions and actions can be found in the documentation.
You can also trigger Mergify actions by commenting on this pull request:
@Mergifyio refresh
will re-evaluate the rules@Mergifyio rebase
will rebase this PR on its base branch@Mergifyio update
will merge the base branch into this PR@Mergifyio backport <destination>
will backport this PR on<destination>
branchAdditionally, on Mergify dashboard you can:
Finally, you can contact us on https://mergify.com