Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: fix panic on MsgApp after log truncation #31

Merged
merged 4 commits into from
Mar 9, 2023

Conversation

pav-kv
Copy link
Contributor

@pav-kv pav-kv commented Mar 7, 2023

This PR fixes a panic and adds a regression test for the following scenario:

  1. The flow of MsgApp from the leader to a follower is throttled (i.e. Inflights is full).
  2. The leader doesn't fetch entries from storage and only periodically sends MsgApps with empty Entries.
  3. A log compaction/truncation happens in the background, and cuts out a portion of the log beyond what's still in-flight towards the slow follower (i.e. Progress.Match < log cutoff)
  4. Some messages to the slow follower get dropped, and as a result it replies with a rejection MsgAppResp.
  5. The leader resets Progress.Next = Progress.Match+1, and is about to retry sending entries from this point.
  6. In raft.maybeSendAppend it calls raftLog.term and gets 0 for the missing entry (instead of some indication/error that the log was truncated at this index). It also skips fetching entries (as in step 2), and goes ahead sending an empty MsgApp (with LogTerm = 0 and a fresh Commit index).
  7. When the follower gets this MsgApp, in raftLog.maybeAppend it a) wrongly passes the matchTerm check because the 0 index matches the 0 corresponding to a missing entry in the local log, b) tries to bump the Commit index and panics because this index is beyond its local log's lastIndex().

This bug was introduced in 42419da. Specifically, the steps (2) and (6) previously used to unconditionally fetch raftLog.entries(), which would return ErrCompacted in the above scenario, and prevent sending the problematic MsgApp. The commit above introduced a condition under which the ErrCompacted would be unnoticed.

This PR makes maybeSendAppend more aware of this compaction scenario, and prevents sending the problematic MsgApp.

@pav-kv pav-kv changed the title Fix panic when log compaction happens during MsgApp throttling fix panic on log compaction during MsgApp throttling Mar 7, 2023
@pav-kv pav-kv force-pushed the panic-on-truncation branch 10 times, most recently from 00532c3 to 2da1d51 Compare March 8, 2023 12:34
@pav-kv pav-kv marked this pull request as ready for review March 8, 2023 12:34
@pav-kv pav-kv changed the title fix panic on log compaction during MsgApp throttling raft: fix panic on MsgApp after log truncation Mar 8, 2023
@pav-kv
Copy link
Contributor Author

pav-kv commented Mar 8, 2023

@nvanbenschoten @tbg PTAL

Commit 1: the minimal fix.
Commit 2: the regression test (it panicked if placed before the first commit).
Commit 3: refactoring of raftLog.term() behaviour (can replace commit 1 or go separately as is).

@pav-kv
Copy link
Contributor Author

pav-kv commented Mar 8, 2023

@ahrtr PTAL

raft.go Outdated Show resolved Hide resolved
log.go Show resolved Hide resolved
log.go Outdated Show resolved Hide resolved
This commit fixes a panic in the following scenario:

1. The flow of MsgApp from the leader to a follower is throttled (i.e.
   Inflights is full).
2. The leader doesn't fetch entries from storage and only periodically sends
   MsgApps with empty Entries.
3. A log compaction/truncation happens in the background, and cuts out a
   portion of the log beyond what's still in-flight towards the slow follower
   (i.e. Progress.Match < log cutoff)
4. Some messages to the slow follower get dropped, and as a result it replies
   with a rejection MsgAppResp.
5. The leader resets Progress.Next = Progress.Match+1, and is about to retry
   sending entries from this point.
6. In raft.maybeSendAppend it calls raftLog.term and gets 0 for the missing
   entry (instead of some indication/error that the log was truncated at this
   index). It also skips fetching entries (as in step 2), and goes ahead
   sending an empty MsgApp (with LogTerm = 0 and a fresh Commit index).
7. When the follower gets this MsgApp, in raftLog.maybeAppend it a) wrongly
   passes the matchTerm check because the 0 index matches the 0 corresponding
   to a missing entry in the local log, b) tries to bump the Commit index and
   panics because this index is beyond its local log's lastIndex().

This bug was introduced in 42419da. Specifically, the steps (2) and (6)
previously used to unconditionally fetch raftLog.entries(), which would return
ErrCompacted in the above scenario, and prevent sending the problematic MsgApp.
The commit above inroduced a condition under which the ErrCompacted would be
unnoticed.

This commit makes maybeSendAppend more aware of this compaction scenario, and
prevents sending the problematic MsgApp.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This commit makes raftLog.term() call return ErrCompacted and ErrUnavailable
errors if the requested log index is out of bounds. Previously it would return
0 which was an error-prone behaviour.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvanbenschoten
Copy link
Contributor

Github isn't liking my approval. It may be because I'm not a maintainer.

This commit makes the testing loggers print Panic and Fatal agruments before
redirecting them to the panic() call. Previously they would be displayed in a
non-human-readable way, as something like:

  panic: ([]interface {}) 0x1400000eb88

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv pav-kv requested a review from ahrtr March 9, 2023 20:29
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @pavelkalinnikov

@ahrtr ahrtr merged commit d086538 into etcd-io:main Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants