consensus: disconnect from bad peers #2871

ebuchman · 2018-11-17T20:11:25Z

The consensus reactor only disconnects from peers in three cases:

msg fails to decode
msg fails basic validity checks
bad VoteSetMaj23Message

While we log other kinds of errors in processing (invalid blocks, proposals, votes), we never disconnect from peers due to them, which open us up to being spammed with bad consensus messages.

We also don't enforce that our peers ever send us anything useful. While we mark them as good if they send us 10,000 unique block parts or votes, we never mark them bad if they just fail to send us anything. This could result in us being connected to lots of useless peers.

We need to address both of these things:

peers sending us bad messages
peers not sending us any good messages

Bad Messages

Let's enumerate all messages and see when they indicate a bad peer (beyond failing their ValidateBasic() check).

In general we could probably add some more rules around all of these to prevent eg. receiving the same information multiple times from the same peer, but we'd need to ensure on the sending side that it never happens too.

NewRoundStepMessage, NewValidBlockMessage, HasVoteMessage

These are just informational so we know what to send the peer.

HasVote may contain an index beyond the size of the validator set for the given height, which should be a sign of malice, but currently we just ignore it.

VoteSetMaj23Message

We already stop peers for errors here.

ProposalHeartbeatMessage

Comes with a signature, and could be checked, but not actually helpful and should probably be eliminated outright (#2626).

ProposalMessage

Handled in the receiveRoutine - calls setProposal.

Any error here should result in disconnect from peers (as noted in #2158 (comment))

ProposalPOLMessage

Don't think there's much we can do here

BlockPartMessage

Handled in the receiveRoutine - calls addProposalBlockPart.

Some errors here (eg. in AddPart) should cause us to disconnect from the peer, but others (eg. error in unmarshaling) are not the particular peers fault, but the original proposer, and we don't know which peer corresponds to the actual proposer, so there's nothing we can do here (ultimately, we'd like to publish evidence about this so the app can punish the proposer).

VoteMessage

Handled in the receiveRoutine - calls tryAddVote. This can result in many kinds of errors - some resulting in disconnecting, and others not, hence we should use sentinels (#1327)

tryAddVote calls addVote, which may return ErrVoteHeightMismatch, for which we shouldn't disconnect.

We then call cs.LastCommit.AddVote. Some errors here could come from honest peers, so we shouldn't disconnect:

ErrVoteUnexpectedStep
ErrVoteNonDeterministicSignature
NewConflictingVoteError

The rest are from bad peers, and we should disconnect:

ErrVoteInvalidValidatorIndex
ErrVoteInvalidValidatorAddress
vote.Verify errors (needs sentinel)

VoteSetBitsMessage

Don't think there's much we can do here

Useless Peers

A peer could just never send us any consensus messages and we wouldn't disconnect from it.

We need to enforce some cadence on our peers. While this is partially mitigated by preferentially connecting to at least some peers marked good via the PEX, we would probably benefit from having additional protections within the reactor, though we can leave that to a separate issue.

Summary

For now we should do the following, as described above:

Stop peers for bad proposals
Stop peers for bad block parts
Stop peers for bad votes
Remove the heartbeat message

This subsumes #2158, #1327, #2626

The text was updated successfully, but these errors were encountered:

srmo · 2018-11-17T21:24:07Z

@ebuchman I'd love trying myself on removal of ProposalHeartbeatMessage though I struggle with understanding why it was required in the first place. As a heavy user of the "no-empty-blocks" option I'm interested in any feature that helps with stability. So why is it OK to remove, i.e. why was it needed in the first place?

Anyway, when I start working on it (be it with just removing the signing in the first step) should such a thing be done in a separate issue or just a banch referencing this issue here?

srmo · 2018-11-17T22:16:48Z

I've done https://github.com/srmo/tendermint/tree/2871-remove-proposal-hearbeat
It shows that proposalHeartbeat signing (?) is mentioned in ADR-24.
Do we need an additional ADR for this issue here?
I'd really like to discuss the ramifications of removing a heartbeat mechanism in "no-empty-blocks" scenarios, where it is likely to have peers not sending anything meaningful for extended periods of time. How will this be covered?

ebuchman · 2018-11-17T23:34:01Z

So why is it OK to remove, i.e. why was it needed in the first place?

It was never introduced for any purpose other than the thought that "maybe it will be useful for debugging some time".

While it might be true, so far in debugging issues with this feature we've always been able to resort to the logs of each instance, so the extra message wasn't useful. In a case where you don't have access to the proposer's logs or otherwise lose them, we'd be losing that information.

Note that this message isn't gossipped, only sent to directly connected peers, so, in deployments where you want to hide the validator, eg. behind sentries, only the sentries would receive this message.

If there's a strong desire to keep this message, we would have to think securing it, and it's not clear it's worth it.

srmo · 2018-11-17T23:40:24Z

OK. Understood.
I just wanted to make sure that it isn't some kind of keep alive mechanism in no-empty-blocks world.
So please, feel free to have a look at #2874

jacohend · 2019-10-03T23:29:58Z

Possibly relevant: I'd like to revisit #1557- I would like a way to remove bad peers from the ABCI layer. I think app-level validation should matter (in some cases, and for some applications) when deciding who to devote network resources to.

For example, if a peer misbehaves in a very specific way (more than simply than x number of messages failing CheckTX, but perhaps for certain types of important messages), I would like to be able to remove that peer.

ebuchman · 2019-10-31T19:39:57Z

This is tricky. Right now there is a way for the app to filter peers when they are first connected to, but it's not clear how the app would tell tendermint to disconnect peers in other cases. Since I believe this is quite distinct from the current issue, can you open a new issue describing how you think this might work? I can imagine how it would work for CheckTx - ie. certain responses would cause the sending peer to be disconnected, but it's hard for other cases because the app doesn't otherwise really get information about particular peers ...

Went through #2871, there are several issues, this PR tries to tackle the `HasVoteMessage` with an invalid validator index sent by a bad peer and it prevents the bad vote goes to the peerMsgQueue. Future work, check other bad message cases and plumbing the reactor errors with the peer manager and then can disconnect the peer sending the bad messages.

ebuchman added T:bug Type Bug (Confirmed) C:p2p Component: P2P pkg C:consensus Component: Consensus T:security Type: Security (specify priority) labels Nov 17, 2018

ebuchman modified the milestone: launch Nov 17, 2018

ebuchman added this to the v1.0 milestone Nov 17, 2018

This was referenced Nov 17, 2018

ErrInvalidProposalSignature Question #2158

Closed

ErrVoteInvalid #1327

Closed

consensus: do we need to sign heartbeat? #2626

Closed

Tendermint starts with inconsistent genesis.json #2815

Closed

milosevic mentioned this issue Nov 21, 2018

Prioritize validator nodes in p2p communications #2860

Closed

ebuchman mentioned this issue Dec 4, 2018

Disconnect peer properly in consensus #1613

Closed

ebuchman mentioned this issue Jan 4, 2019

Tendermint on non-deterministic signatures #1664

Closed

ebuchman modified the milestones: v1.0, v0.31.0 Jan 14, 2019

abelliumnt mentioned this issue Jan 22, 2019

Keep track of important Tendermint issues and PRs irisnet/irishub#1137

Closed

JekaMas mentioned this issue Feb 28, 2019

Add bad random evidance corestario/tendermint#41

Closed

ebuchman modified the milestones: v0.32.0, v0.33.0, v0.34 Feb 28, 2019

JayT106 mentioned this issue Jan 28, 2022

consensus: HasVoteMessage index boundary check #7720

Merged

ebuchman mentioned this issue Sep 12, 2022

#2871 disconnect from bad peers in consensus #9417

Draft

3 tasks

jmalicevic mentioned this issue Oct 13, 2022

Tracking issue for more aggressive removal of bad peers #9545

Closed

9 tasks

jmalicevic mentioned this issue Dec 27, 2022

Tracking issue for more aggressive removal of bad peers cometbft/cometbft#65

Open

9 tasks

greg-szabo mentioned this issue May 3, 2023

consensus: disconnect from bad peers cometbft/cometbft#789

Open

github-actions bot added the stale for use by stalebot label Sep 24, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus: disconnect from bad peers #2871

consensus: disconnect from bad peers #2871

ebuchman commented Nov 17, 2018 •

edited

Loading

srmo commented Nov 17, 2018

srmo commented Nov 17, 2018

ebuchman commented Nov 17, 2018

srmo commented Nov 17, 2018

jacohend commented Oct 3, 2019 •

edited

Loading

ebuchman commented Oct 31, 2019

consensus: disconnect from bad peers #2871

consensus: disconnect from bad peers #2871

Comments

ebuchman commented Nov 17, 2018 • edited Loading

Bad Messages

NewRoundStepMessage, NewValidBlockMessage, HasVoteMessage

VoteSetMaj23Message

ProposalHeartbeatMessage

ProposalMessage

ProposalPOLMessage

BlockPartMessage

VoteMessage

VoteSetBitsMessage

Useless Peers

Summary

srmo commented Nov 17, 2018

srmo commented Nov 17, 2018

ebuchman commented Nov 17, 2018

srmo commented Nov 17, 2018

jacohend commented Oct 3, 2019 • edited Loading

ebuchman commented Oct 31, 2019

ebuchman commented Nov 17, 2018 •

edited

Loading

jacohend commented Oct 3, 2019 •

edited

Loading