Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Restart application issue #5579

Merged
merged 26 commits into from
Feb 6, 2020
Merged

Fix Restart application issue #5579

merged 26 commits into from
Feb 6, 2020

Conversation

AdityaSripal
Copy link
Member

@AdityaSripal AdityaSripal commented Jan 28, 2020

Description

Fixes the restart issue by making sure that CommitID contains the last hash that was persisted to disk, rather than the hash of the last block that was executed

Improves Pruning flexibility by splitting the old KeepEvery parameter into => KeepEvery and SnapshotEvery parameters.

The KeepEvery parameter defines how often the state gets flushed to disk. The SnapshotEvery parameter defines the snapshot blocks that will remain in disk permanently.
On each flush, the SDK removes the last flushed state unless it is a snapshot version.

closes: #5570


For contributor use:

  • Targeted PR against correct branch (see CONTRIBUTING.md)
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Code follows the module structure standards.
  • Wrote unit and integration tests
  • Updated relevant documentation (docs/) or specification (x/<module>/spec/)
  • Added relevant godoc comments.
  • Added a relevant changelog entry to the Unreleased section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

For admin use:

  • Added appropriate labels to PR (ex. WIP, R4R, docs, etc)
  • Reviewers assigned
  • Squashed all commits, uses message "Merge pull request #XYZ: [title]" (coding standards)

store/types/pruning.go Outdated Show resolved Hide resolved
store/types/pruning.go Show resolved Hide resolved
store/iavl/store.go Show resolved Hide resolved
store/iavl/store.go Outdated Show resolved Hide resolved
store/iavl/store.go Outdated Show resolved Hide resolved
@AdityaSripal AdityaSripal changed the title WIP: Fix Restart application issue Fix Restart application issue Jan 29, 2020
@AdityaSripal AdityaSripal added R4R and removed WIP labels Jan 29, 2020
@AdityaSripal
Copy link
Member Author

On further thinking, I don't think the sim tests are necessary here. Don't believe they test anything that baseapp doesn't.

So ready for review!!

@AdityaSripal
Copy link
Member Author

Also needs CHANGELOG, but i can add that in once people agree on what needs to be added

  • New SnapshotEvery parameter, and difference between old KeepEvery parameter
  • Version method on CommitStore
  • CommitID stores latest persisted hash/version

anything else?

store/iavl/store.go Outdated Show resolved Hide resolved
store/iavl/store.go Outdated Show resolved Hide resolved
store/iavl/store.go Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/types/pruning.go Outdated Show resolved Hide resolved
store/types/pruning.go Outdated Show resolved Hide resolved
Co-Authored-By: Alexander Bezobchuk <alexanderbez@users.noreply.github.com>
@ethanfrey
Copy link
Contributor

Hmm so i tried implementing the approach Ethan recommended. Run into the issue that the Query function works by grabbing commitInfo of the version from disk. If the latest version's commitInfo is not flushed to disk, we cannot query latest information, and can only query from flushed versions:

Very interesting, so Query is reading from disk directly, not just via iavl. There was another outstanding issue about queries and db locks... maybe this is related?
tendermint/tendermint#3962 and the original issue from sdk

@alexanderbez
Copy link
Contributor

Hmm so i tried implementing the approach Ethan recommended. Run into the issue that the Query function works by grabbing commitInfo of the version from disk. If the latest version's commitInfo is not flushed to disk, we cannot query latest information, and can only query from flushed versions:

https://github.com/cosmos/cosmos-sdk/blob/master/store/rootmulti/store.go#L444

If we were able to query (by height), why can we also not get the commit info by that same height?

@AdityaSripal
Copy link
Member Author

AdityaSripal commented Feb 5, 2020

Ready for review again. Manual tests performed:

  1. Ran gaia with syncable option, sent a tx, queried successfully, restarted successfully, queried correct state again successfully

  2. Ran local-testnet with different pruning options for each container. (1 everything, 1 nothing, 2 syncable) and ran for 150 blocks. Note: no txs

@alexanderbez @ethanfrey

Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. There are a few small outstanding items to be addressed.

store/iavl/store_test.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Outdated Show resolved Hide resolved
store/rootmulti/store.go Show resolved Hide resolved
@@ -94,18 +138,41 @@ func (st *Store) Commit() types.CommitID {
panic(err)
}

return types.CommitID{
st.version = version
flushed := st.pruning.FlushVersion(version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdityaSripal do you have a response to this? Or is this even valid?

store/rootmulti/store_test.go Outdated Show resolved Hide resolved
store/types/pruning.go Outdated Show resolved Hide resolved
store/types/pruning.go Outdated Show resolved Hide resolved
Copy link
Contributor

@alexanderbez alexanderbez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

Copy link
Member Author

@AdityaSripal AdityaSripal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, Great work with documentation @alexanderbez! thanks for helping

@alexanderbez alexanderbez mentioned this pull request Feb 6, 2020
8 tasks
@codecov
Copy link

codecov bot commented Feb 6, 2020

Codecov Report

Merging #5579 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5579      +/-   ##
==========================================
- Coverage   44.53%   44.51%   -0.03%     
==========================================
  Files         324      324              
  Lines       24672    24700      +28     
==========================================
+ Hits        10988    10995       +7     
- Misses      12627    12645      +18     
- Partials     1057     1060       +3     

@alexanderbez alexanderbez merged commit dba80ca into master Feb 6, 2020
@alexanderbez alexanderbez deleted the aditya/pruning-fixes branch February 6, 2020 20:58
@ethanfrey
Copy link
Contributor

ethanfrey commented Feb 7, 2020

I just saw this was merged to master. Which has some major breaking changes from v0.38.0. I was expecting this to go into releases/v0.38.0 for a v0.38.1 release. Such a v0.38.1 bugfix is blocking at least 3 chains.

I will do a review on the diff now

Copy link
Contributor

@ethanfrey ethanfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and seems quite correct to me.

Added a number of relatively minor comments that you may want to address in a smaller follow-up PR.

determines which committed heights are flushed to disk and `SnapshotEvery` determines which of these
heights are kept after pruning. The `IsValid` method should be called whenever using these options. Methods
`SnapshotVersion` and `FlushVersion` accept a version arugment and determine if the version should be
flushed to disk or kept as a snapshot. Note, `KeepRecent` is automatically inferred from the options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for automatically inferring, and less confusion in configs

// If KeepEvery = 1, keepRecent should be 0 since there is no need to keep
// latest version in a in-memory cache.
//
// If KeepEvery > 1, keepRecent should be 1 so that state changes in between
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, that's what 1 does?
I would have though it would have to equal KeepEvery or something.
But I guess it doesn't if we only want "query last block"


// Previous flushed version should only be pruned if the previous version is
// not a snapshot version OR if snapshotting is disabled (SnapshotEvery == 0).
if previous != 0 && !st.pruning.SnapshotVersion(previous) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe previous > 0 is "safer".
Though I cannot think of a scenario when this could be negative, still better safe

// not a snapshot version OR if snapshotting is disabled (SnapshotEvery == 0).
if previous != 0 && !st.pruning.SnapshotVersion(previous) {
err := st.tree.DeleteVersion(previous)
if errCause := errors.Cause(err); errCause != nil && errCause != iavl.ErrVersionDoesNotExist {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

if err != nil && !iavl.ErrVersionDoesNotExist.Is(err)

https://github.com/cosmos/cosmos-sdk/blob/master/types/errors/errors.go#L175

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

setLatestVersion(batch, version)
batch.Write()
// write CommitInfo to disk only if this version was flushed to disk
if rs.pruningOpts.FlushVersion(version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer. We update in memory and response every time, and just use this check to keep CommitInfo in sync with the underlying iavl substores

// Otherwise, we query for the commit info from disk.
var commitInfo commitInfo

if res.Height == rs.lastCommitInfo.Version {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so we can always query the last one (guaranteed to be in memory cache via auto-set KeepRecent) and an other one that was flushed to disk. Seems good.

User providing height = 0 leading to "current height" is taken care of elsewhere (in baseapp), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User providing height = 0 leading to "current height" is taken care of elsewhere (in baseapp), right?

Yes!

// PruneEverything defines a pruning strategy where all committed states will
// be deleted, persisting only the current state.
PruneEverything = PruningOptions{
KeepEvery: 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting with this option, you will have significantly more IO than PruneSyncable, yet much less disk usage.

What about this just keeping one flushed to disk (as now), but only flushing every 10 blocks or so? (just increment KeepEvery). This would be helpful for nodes that need throughput

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep sound and most "general" strategies as we have now, except allow the CLI/start command to accept manual values for these instead of providing a bunch of other strategies. ie. gaiad start --pruning-keep-every ... --pruning-snapshot-every ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually probably best to set these in app.toml instead of (or in additon to) the cli.

But I agree with your point, leaving these 3 safe named posibilities, and then allowing the user to take the risk and shoot themself in the foot if they want to do advanced configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't restart node with cosmos-sdk v0.38.0
7 participants