-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: account for time already elapsed when waiting after the commit #965
Conversation
marking ready for review to begin accepting comments. while this change is small and seems to be working here and in the app, we should expect other consequences. this has also only been tested in a contrived testnets, we should ideally test this in testground before including in arabica and mocha |
config/config.go
Outdated
TimeoutCommit: 1000 * time.Millisecond, | ||
TargetRoundDuration: 3500 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to better names for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blocking feedback. IIUC it isn't possible for tendermint to guarantee a consistent block interval but this change tries to target a more consistent block time.
Relevant context: tendermint/tendermint#5911 and https://docs.tendermint.com/v0.34/tendermint-core/configuration.html#consensus-timeouts-explained
config/config.go
Outdated
// from upstream tendermint to be dynamic, where we account for the amount of | ||
// time already taken in a round. | ||
func (cfg *ConsensusConfig) Commit(t time.Time, elapsed time.Duration) time.Time { | ||
remaining := cfg.TargetRoundDuration - elapsed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a reasonable, simple change to me. Would it make sense to:
- test this in a live network or at least in testground
- have informal review this as well
before we merge this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's a good idea. We were trying to test this with testground this morning, but didn't have time to finish.
We might not want to include this change in the incentivized testnet, but we should push and test in mocha as soon as we can get data back from testground.
One side thought: after playing around with this, I noticed that if we increase the timeouts too long, then validators begin to miss producing blocks, even on the e2e test. After discussing with @cmwaters, when the timeouts are long, we think there is a significantly higher probability validators are progressing through consensus at different rates, and thus potentially missing rounds entirely. This would actually explain what we are seeing in mocha quite well. Having more consistent round times, potentially via this mechansim, but likely via similar arbitary waiting mechanisms to more closely follow the configured timeouts instead of moving to the next step in consensus after reaching 2/3s, could be effective in helping us come to consensus on the first block when we expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to think through this a bit and I think it all makes sense. If you really wanted regular block times, we should implement proposer based timestamps, but this should get us something in the range. We already have metrics for block times right? So would be good to track them.
config/config.go
Outdated
@@ -932,7 +932,7 @@ type ConsensusConfig struct { | |||
// height (this gives us a chance to receive some more precommits, even | |||
// though we already have +2/3). | |||
// NOTE: when modifying, make sure to update time_iota_ms genesis parameter | |||
TimeoutCommit time.Duration `mapstructure:"timeout_commit"` | |||
TargetRoundDuration time.Duration `mapstructure:"target_round_duration"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really per round but per height because we only commit once per height. What this means of course is that any multiround height will likely be above 15 seconds and instantly jump to the next height meaning only 2/3 voting power will be included in the commit (i.e. there will be no time for straggler voters). We should be wary of this if we are rewarding validators on a per-height basis for voting involvement. Alternatively we could reward validators for involvement across 100 heights (instead of just one). Or we could introduce another parameter which states a minimum "wait for straggler votes" period.
Actually, It might be fine if there's also a non-negligible time taken in FinalizeBlock
and PrepareProposal
. If it takes 1-2 seconds, then that's another 1 to 2 seconds that more votes (above the 2/3 line) can be received in.
We also might not care to have commits with all signatures.
Conclusion: I think we should rename this variable to TargetHeightDuration
and we should make sure we are measuring the percentage of voting power in commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really per round but per height because we only commit once per height.
ahh I see! good catch. I wrongly assumed that since the start time is in RoundState
it would get updated each round. Renamed as well
We should be wary of this if we are rewarding validators on a per-height basis for voting involvement.
I was worried about this as well (briefly mentioned in top comment) as iirc, there was somewhere in the sdk where we're rewarding validators per signature. I looked again though and couldn't find anything, so it might be obfuscated. Regardless, the number of signatures included should be something that we collect data on in testground, and might even want to force this to occur if it isn't happening naturally.
config/toml.go
Outdated
# How long we wait after committing a block, before starting on the new | ||
# height (this gives us a chance to receive some more precommits, even | ||
# though we already have +2/3). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extend the comments here
I still need to debug why the e2e test is failing after using a simpler timeout, but we also shouldn't merge this in until we collect more data on this change's impact via testground, so I'm converting to a draft |
@@ -457,12 +457,9 @@ timeout_prevote = "1s" | |||
timeout_prevote_delta = "500ms" | |||
timeout_precommit = "1s" | |||
timeout_precommit_delta = "500ms" | |||
timeout_commit = "1s" | |||
target_height_duration = "1s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] these are just docs so the value doesn't matter but proposal for it to match the 15s on line 354
target_height_duration = "1s" | |
target_height_duration = "15s" |
consensus/state.go
Outdated
cs.eventCollector.WritePoint("consensus", map[string]interface{}{ | ||
"round_data": []interface{}{rs.Height, rs.Round, rs.Step}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want to add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
…/celestia-core into evan/dynamic-timeout-commits
forgot to update this with tests that were run on testground also had 14-16s block times, without having any heights that required more than a single round of consensus, so I think we're clear to try this on a testnet. I'll try to get a pretty graph and compare it to a control to see if the standard deviation of blocktimes is the same. afaict, testground tests are not surfacing the some of the same issues that we see on a larger scale testnet such as mocha, so I wouldn't be surprised if this still has issues there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…#965) * feat!: consider time already elapsed when waiting after the commit * chore: minor doc changes * chore: try a different default config time * chore: try increasing the config again * fix: use appropriate default time * fix: docs * chore: simplify addition and rename * chore: revert pointless go mod tidy change * chore: consistent config * chore: replace config comments * chore: fix a few remaining round -> height name changes * fix: lingering compile errors from merge * fix: silly bug * fix: use correct next start time * dynamic block time modifications (#983) * chore: remove event collector * Update test/maverick/consensus/state.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> * Update test/maverick/consensus/state.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> * chore: formatting --------- Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Description
This PR uses dynamic timeout commits, meaning that we account for time that has already elapsed before we decide how long to wait after the commit is finalized. This effectively adds a "target round duration", where we don't wait the same amount of time for the timeout commit arbitrarily each round. Instead, we subtract the time elapsed between the round's start time, and the commit time.
Note that this will likely have yet unmeasured consequences in the application. For instance, if we somehow end up not waiting at all for remaining votes, then those votes won't get included in the commit, and validators that otherwise would have gotten counted as signing (and thus gotten rewards), will not be counted.
part of #939 and celestiaorg/celestia-app#1340