feat!: account for time already elapsed when waiting after the commit #965

evan-forbes · 2023-02-07T11:56:46Z

Description

This PR uses dynamic timeout commits, meaning that we account for time that has already elapsed before we decide how long to wait after the commit is finalized. This effectively adds a "target round duration", where we don't wait the same amount of time for the timeout commit arbitrarily each round. Instead, we subtract the time elapsed between the round's start time, and the commit time.

Note that this will likely have yet unmeasured consequences in the application. For instance, if we somehow end up not waiting at all for remaining votes, then those votes won't get included in the commit, and validators that otherwise would have gotten counted as signing (and thus gotten rewards), will not be counted.

part of #939 and celestiaorg/celestia-app#1340

evan-forbes · 2023-02-07T15:00:13Z

marking ready for review to begin accepting comments. while this change is small and seems to be working here and in the app, we should expect other consequences.

this has also only been tested in a contrived testnets, we should ideally test this in testground before including in arabica and mocha

evan-forbes · 2023-02-07T17:21:26Z

config/config.go

-		TimeoutCommit:               1000 * time.Millisecond,
+		TargetRoundDuration:         3500 * time.Millisecond,


I'm open to better names for this

rootulp

No blocking feedback. IIUC it isn't possible for tendermint to guarantee a consistent block interval but this change tries to target a more consistent block time.

Relevant context: tendermint/tendermint#5911 and https://docs.tendermint.com/v0.34/tendermint-core/configuration.html#consensus-timeouts-explained

config/config.go

liamsi · 2023-02-07T19:52:06Z

config/config.go

+// from upstream tendermint to be dynamic, where we account for the amount of
+// time already taken in a round.
+func (cfg *ConsensusConfig) Commit(t time.Time, elapsed time.Duration) time.Time {
+	remaining := cfg.TargetRoundDuration - elapsed


This looks like a reasonable, simple change to me. Would it make sense to:

test this in a live network or at least in testground

have informal review this as well
before we merge this?

yes that's a good idea. We were trying to test this with testground this morning, but didn't have time to finish.

We might not want to include this change in the incentivized testnet, but we should push and test in mocha as soon as we can get data back from testground.

One side thought: after playing around with this, I noticed that if we increase the timeouts too long, then validators begin to miss producing blocks, even on the e2e test. After discussing with @cmwaters, when the timeouts are long, we think there is a significantly higher probability validators are progressing through consensus at different rates, and thus potentially missing rounds entirely. This would actually explain what we are seeing in mocha quite well. Having more consistent round times, potentially via this mechansim, but likely via similar arbitary waiting mechanisms to more closely follow the configured timeouts instead of moving to the next step in consensus after reaching 2/3s, could be effective in helping us come to consensus on the first block when we expect.

cmwaters

I've tried to think through this a bit and I think it all makes sense. If you really wanted regular block times, we should implement proposer based timestamps, but this should get us something in the range. We already have metrics for block times right? So would be good to track them.

cmwaters · 2023-02-10T23:11:01Z

config/config.go

@@ -932,7 +932,7 @@ type ConsensusConfig struct {
 	// height (this gives us a chance to receive some more precommits, even
 	// though we already have +2/3).
 	// NOTE: when modifying, make sure to update time_iota_ms genesis parameter
-	TimeoutCommit time.Duration `mapstructure:"timeout_commit"`
+	TargetRoundDuration time.Duration `mapstructure:"target_round_duration"`


It's not really per round but per height because we only commit once per height. What this means of course is that any multiround height will likely be above 15 seconds and instantly jump to the next height meaning only 2/3 voting power will be included in the commit (i.e. there will be no time for straggler voters). We should be wary of this if we are rewarding validators on a per-height basis for voting involvement. Alternatively we could reward validators for involvement across 100 heights (instead of just one). Or we could introduce another parameter which states a minimum "wait for straggler votes" period.

Actually, It might be fine if there's also a non-negligible time taken in FinalizeBlock and PrepareProposal. If it takes 1-2 seconds, then that's another 1 to 2 seconds that more votes (above the 2/3 line) can be received in.

We also might not care to have commits with all signatures.

Conclusion: I think we should rename this variable to TargetHeightDuration and we should make sure we are measuring the percentage of voting power in commits.

It's not really per round but per height because we only commit once per height.

ahh I see! good catch. I wrongly assumed that since the start time is in RoundState it would get updated each round. Renamed as well

We should be wary of this if we are rewarding validators on a per-height basis for voting involvement.

I was worried about this as well (briefly mentioned in top comment) as iirc, there was somewhere in the sdk where we're rewarding validators per signature. I looked again though and couldn't find anything, so it might be obfuscated. Regardless, the number of signatures included should be something that we collect data on in testground, and might even want to force this to occur if it isn't happening naturally.

config/config.go

cmwaters · 2023-02-10T23:12:29Z

config/toml.go

 # How long we wait after committing a block, before starting on the new
 # height (this gives us a chance to receive some more precommits, even
 # though we already have +2/3).


Maybe extend the comments here

consensus/state.go

evan-forbes · 2023-02-15T02:21:55Z

I still need to debug why the e2e test is failing after using a simpler timeout, but we also shouldn't merge this in until we collect more data on this change's impact via testground, so I'm converting to a draft

DOCKER/docker-entrypoint.sh

config/toml.go

rootulp · 2023-02-17T19:41:44Z

docs/tendermint-core/configuration.md

@@ -457,12 +457,9 @@ timeout_prevote = "1s"
 timeout_prevote_delta = "500ms"
 timeout_precommit = "1s"
 timeout_precommit_delta = "500ms"
-timeout_commit = "1s"
+target_height_duration = "1s"


[optional] these are just docs so the value doesn't matter but proposal for it to match the 15s on line 354

Suggested change

target_height_duration = "1s"

target_height_duration = "15s"

config/config.go

cmwaters · 2023-03-24T16:40:30Z

consensus/state.go

+	cs.eventCollector.WritePoint("consensus", map[string]interface{}{
+		"round_data": []interface{}{rs.Height, rs.Round, rs.Step},
+	})


Do we still want to add this?

cmwaters

LGTM

test/maverick/consensus/state.go

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

…/celestia-core into evan/dynamic-timeout-commits

evan-forbes · 2023-04-17T04:18:59Z

forgot to update this with tests that were run on testground also had 14-16s block times, without having any heights that required more than a single round of consensus, so I think we're clear to try this on a testnet.

I'll try to get a pretty graph and compare it to a control to see if the standard deviation of blocktimes is the same. afaict, testground tests are not surfacing the some of the same issues that we see on a larger scale testnet such as mocha, so I wouldn't be surprised if this still has issues there.

cmwaters

LGTM

…e commit (#965)" This reverts commit cc1bc3f.

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

…#965) * feat!: consider time already elapsed when waiting after the commit * chore: minor doc changes * chore: try a different default config time * chore: try increasing the config again * fix: use appropriate default time * fix: docs * chore: simplify addition and rename * chore: revert pointless go mod tidy change * chore: consistent config * chore: replace config comments * chore: fix a few remaining round -> height name changes * fix: lingering compile errors from merge * fix: silly bug * fix: use correct next start time * dynamic block time modifications (#983) * chore: remove event collector * Update test/maverick/consensus/state.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> * Update test/maverick/consensus/state.go Co-authored-by: Callum Waters <cmwaters19@gmail.com> * chore: formatting --------- Co-authored-by: Callum Waters <cmwaters19@gmail.com>

…e commit (#965)" This reverts commit 4e0b060.

evan-forbes added 5 commits February 7, 2023 05:53

feat!: consider time already elapsed when waiting after the commit

3dc00cb

chore: minor doc changes

eaba332

chore: try a different default config time

c98755d

chore: try increasing the config again

10e0cb4

fix: use appropriate default time

ccfffb3

evan-forbes added this to the Mainnet milestone Feb 7, 2023

evan-forbes self-assigned this Feb 7, 2023

evan-forbes marked this pull request as ready for review February 7, 2023 15:00

evan-forbes requested review from liamsi, rootulp and cmwaters February 7, 2023 15:00

fix: docs

e2ce341

evan-forbes commented Feb 7, 2023

View reviewed changes

rootulp reviewed Feb 7, 2023

View reviewed changes

config/config.go Outdated Show resolved Hide resolved

config/config.go Show resolved Hide resolved

config/config.go Show resolved Hide resolved

config/config.go Outdated Show resolved Hide resolved

config/config.go Show resolved Hide resolved

config/config.go Show resolved Hide resolved

liamsi reviewed Feb 7, 2023

View reviewed changes

rootulp previously approved these changes Feb 7, 2023

View reviewed changes

cmwaters requested changes Feb 10, 2023

View reviewed changes

chore: simplify addition and rename

45072f3

evan-forbes dismissed rootulp’s stale review via 45072f3 February 15, 2023 01:43

evan-forbes added 3 commits February 14, 2023 19:46

chore: revert pointless go mod tidy change

b604775

chore: consistent config

033c4e4

chore: replace config comments

c104ece

evan-forbes marked this pull request as draft February 15, 2023 02:22

rootulp reviewed Feb 17, 2023

View reviewed changes

evan-forbes mentioned this pull request Mar 8, 2023

Investigate and fix peers frequently ending up on different rounds of consensus #963

Closed

chore: fix a few remaining round -> height name changes

8e5b17b

evan-forbes mentioned this pull request Mar 9, 2023

feat: add influxdb trace collection #970

Merged

Merge branch 'v0.34.x-celestia' into evan/dynamic-timeout-commits

d477caa

cmwaters reviewed Mar 24, 2023

View reviewed changes

cmwaters previously approved these changes Mar 24, 2023

View reviewed changes

test/maverick/consensus/state.go Outdated Show resolved Hide resolved

test/maverick/consensus/state.go Outdated Show resolved Hide resolved

evan-forbes and others added 3 commits March 29, 2023 09:42

Merge branch 'v0.34.x-celestia' into evan/dynamic-timeout-commits

3b5b343

chore: remove event collector

ec7a58d

Update test/maverick/consensus/state.go

5391ac0

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

evan-forbes dismissed cmwaters’s stale review via 5391ac0 April 17, 2023 01:59

evan-forbes and others added 4 commits April 16, 2023 20:59

Update test/maverick/consensus/state.go

d8b6dc1

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Merge branch 'evan/dynamic-timeout-commits' of github.com:celestiaorg…

86c1add

…/celestia-core into evan/dynamic-timeout-commits

chore: formatting

ea9a7ee

Merge branch 'v0.34.x-celestia' into evan/dynamic-timeout-commits

3913ed5

evan-forbes requested review from rootulp, liamsi and cmwaters April 17, 2023 04:21

rootulp approved these changes Apr 17, 2023

View reviewed changes

cmwaters approved these changes Apr 18, 2023

View reviewed changes

evan-forbes merged commit cc1bc3f into v0.34.x-celestia Apr 20, 2023

evan-forbes deleted the evan/dynamic-timeout-commits branch April 20, 2023 11:48

evan-forbes added a commit that referenced this pull request Jul 11, 2023

Revert "feat!: account for time already elapsed when waiting after th…

8da9e6a

…e commit (#965)" This reverts commit cc1bc3f.

evan-forbes added a commit that referenced this pull request Jul 11, 2023

Revert "feat!: account for time already elapsed when waiting after th…

d24c81a

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

cmwaters pushed a commit that referenced this pull request Jul 20, 2023

Revert "feat!: account for time already elapsed when waiting after th…

d60260b

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

cmwaters pushed a commit that referenced this pull request Jul 24, 2023

Revert "feat!: account for time already elapsed when waiting after th…

3042aa0

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

cmwaters pushed a commit that referenced this pull request Jul 27, 2023

Revert "feat!: account for time already elapsed when waiting after th…

eb19a80

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

cmwaters pushed a commit that referenced this pull request Jul 27, 2023

Revert "feat!: account for time already elapsed when waiting after th…

97a4b57

…e commit (#965)" (#1033) This reverts commit cc1bc3f.

faddat mentioned this pull request Feb 22, 2024

fix: remove legacy docker folder #1240

Closed

3 tasks

evan-forbes mentioned this pull request Mar 4, 2024

Backport proposer based timestamps #1255

Open

evan-forbes mentioned this pull request May 3, 2024

Replicate the dynamic timeout issues in a knuu test #1333

Closed

evan-forbes mentioned this pull request May 15, 2024

docs: ADR-115 predictable block times cometbft/cometbft#2966

Merged

4 tasks

cmwaters added a commit that referenced this pull request Jun 25, 2024

Revert "feat!: account for time already elapsed when waiting after th…

8b95598

…e commit (#965)" This reverts commit 4e0b060.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: account for time already elapsed when waiting after the commit #965

feat!: account for time already elapsed when waiting after the commit #965

evan-forbes commented Feb 7, 2023 •

edited

Loading

evan-forbes commented Feb 7, 2023

evan-forbes Feb 7, 2023

rootulp left a comment

liamsi Feb 7, 2023

evan-forbes Feb 7, 2023

cmwaters left a comment

cmwaters Feb 10, 2023

evan-forbes Feb 15, 2023

cmwaters Feb 10, 2023

evan-forbes commented Feb 15, 2023

rootulp Feb 17, 2023

cmwaters Mar 24, 2023

cmwaters left a comment

evan-forbes commented Apr 17, 2023 •

edited

Loading

cmwaters left a comment

		TimeoutCommit: 1000 * time.Millisecond,
		TargetRoundDuration: 3500 * time.Millisecond,

feat!: account for time already elapsed when waiting after the commit #965

feat!: account for time already elapsed when waiting after the commit #965

Conversation

evan-forbes commented Feb 7, 2023 • edited Loading

Description

evan-forbes commented Feb 7, 2023

Choose a reason for hiding this comment

rootulp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan-forbes commented Feb 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

evan-forbes commented Apr 17, 2023 • edited Loading

cmwaters left a comment

Choose a reason for hiding this comment

evan-forbes commented Feb 7, 2023 •

edited

Loading

evan-forbes commented Apr 17, 2023 •

edited

Loading