Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: use RaftAppliedIndexTerm to generate SnapshotMetadata, don't scan log #88596

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit replaces the call to Term(raftAppliedIndex) with direct use of the new RaftAppliedIndexTerm field (added in c3bc064) when generating a SnapshotMetadata in service of the raft.Storage.Snapshot interface. As of v22.2, this field has been fully migrated in.

First and foremost, this is a code simplification. However, it also helps with projects like #87050, where async Raft log writes make it possible for a Raft leader to apply an entry before it has been appended to the leader's own log. Such flexibility1 would help smooth out tail latency in any single replica's local log writes, even if that replica is the leader itself. This is an important characteristic of quorum systems that we fail to provide because of the tight coupling between the Raft leader's own log writes and the Raft leader's acknowledgment of committed proposals.

Release justification: None. Don't backport to release-22.2.

Release note: None.

Footnotes

  1. if safe, I haven't convinced myself that it is in all cases. It certainly is not for operations like non-loosely coupled log truncation.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

… log

This commit replaces the call to `Term(raftAppliedIndex)` with direct
use of the new `RaftAppliedIndexTerm` field (added in c3bc064) when
generating a `SnapshotMetadata` in service of the `raft.Storage.Snapshot`
interface. As of v22.2, this field has been fully migrated in.

First and foremost, this is a code simplification. However, it also
helps with projects like cockroachdb#87050, where async Raft log writes make it
possible for a Raft leader to apply an entry before it has been appended
to the leader's own log. Such flexibility[^1] would help smooth out tail
latency in any single replica's local log writes, even if that replica
is the leader itself. This is an important characteristic of quorum
systems that we fail to provide because of the tight coupling between
the Raft leader's own log writes and the Raft leader's acknowledgement
of committed proposals.

[^1]: if safe, I haven't convinced myself that it is in all cases. It
certainly is not for operations like non-loosely coupled log truncation.
This code is no longer needed on master.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftAppliedIndexTerm branch from 8fe893b to 89058d5 Compare September 23, 2022 23:40
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


-- commits line 20 at r1:
I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup.

I didn't quite understand this. Is it saying that strongly coupled truncation is not safe is the leader is lagging? But the leader is the one making the decision and knows its own state, so I must be missing something?

Can't really get a big problem out of this either.

  • n1 (leader) has append method stuck (in particular let's assume nothing gets made durable)
  • n2 and n3 are working fine
  • indexes 90..100 get appended to the log and commit via n2 and n3
  • n1 starts a non-cooperative truncation, this may affect indexes <= 100 (since 100 is committed and there's currently no special casing of the leader's local state in that code IIRC)
  • this commits at index 101 with quorum (n2,n3)
  • n2 and n3 can apply the truncation, that's fine - worst case it'll lead to a snapshot for n1 later
  • but if n1 applies it and then crash-restarts, we have:
    • applied index = 90 (say)
    • first index = 100

This isn't a great counter-example because we can easily fix this. The truncation code today already has to make sure the truncation is compatible with the local raft log, so it would be straightforward to turn it into a truncation that only affects [0, ..., appliedIndex] if that isn't already the case anyway.

Today we can't see a truncation that extends "past" the local log, which we do on n1 in this example, but this isn't too important either and can be fixed.

@tbg
Copy link
Member

tbg commented Sep 26, 2022

@nvanbenschoten re: "if safe, I haven't convinced myself that it is in all cases"

I think we briefly chatted about this a while ago and I mentioned that the leader completeness property would get in the way. I no longer think so. The property is:

Leader Completeness
If a log entry is committed in a given term, then that entry will be present in the logs
of the leaders for all higher-numbered terms. §3.6

There's nothing requiring the current leader to have all entries for the current term. The leader will certainly have to be aware of all entries in its term (after all, how else is it going to assign log indexes etc), but the entries do not have to be on stable storage in order for that to hold true.

I wrote up #88699 to capture the explicit goal of reducing the blast radius of a slow leader.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Sep 26, 2022

  • n1 (leader) has append method stuck (in particular let's assume nothing gets made durable)
    ...
  • but if n1 applies it and then crash-restarts, we have:
    • applied index = 90 (say)
    • first index = 100

...
The truncation code today already has to make sure the truncation is compatible with the local raft log, so it would be straightforward to turn it into a truncation that only affects [0, ..., appliedIndex] if that isn't already the case anyway.

I'm still confused (bear with me), so I'll try to elaborate on my understanding and you can correct me.

My understanding was that the async work was making the appends to the raft log async, but what is appended is fsynced. So when the leader decides on a truncation up to 100 and sends it via the raft log, it would be enacted (in the strongly-coupled case) by the async worker thread after it has finished the work of appending and fsyncing up to 100.

I just realized you said "applied index = 90", which is the state machine. We would already have a problem because we do non-fsynced application, but don't because they are both synchronous in the same thread and share the same engine, and so the fsyncing for the log appending case takes care of fsyncing the previous state machine applications. With the async appending to the raft log, the state machine application can both lead and lag what is durable in the local raft log. If it leads, then by virtue of sharing the same engine, the fsyncing of the appends will make the applied index durable (we will still need to fixup the mutual consistency of the raft log and state machine consistency at crash recovery time, like what is done in ReplicasStorage.Init). If the application lags, because it happens asynchronously via the stableTo stuff in @nvanbenschoten's prototype, then we have the problem you indicated, because not only is the application not durable, it is possible that the application has not even been attempted. Your proposed solution seems to ensure that the application has been attempted on the leader, but it is not clear to me that it ensures (a) the application has been attempted on the follower before the follower truncates, (b) the attempted application is durable. It seems to me that we need loosely-coupled truncation to handle this correctly.

What am I missing?

@nvanbenschoten
Copy link
Member Author

I agree that tightly coupled log truncation is safe even if applied > stable_last_index on the Raft leader, as long as the Raft leader is only ever proposing log truncation up to stable_last_index. That is guaranteed today by newTruncateDecision.

When I wrote this, I was conflating a discussion about the safety (in terms of the Raft protocol's invariants) of a leader's applied index outpacing its stable last index with the safety (in terms of CRDB's use of Raft) of any replica's applied index outpacing its stable last index.

The case where you can see this go wrong is on follower replicas with tightly coupled log truncation. In a world with async log writes and async log application, you can pretty easily construct a situation that looks like:

  1. leader proposes + commits index 10-20
  2. lease proposes + commits log truncation at index 21 that instructs replicas to truncate up to entry 15
  3. follower (not part of original quorum) catches up on log after delay, receives log indexes 10-21 all at once, is immediately told that they are committed
  4. follower applies index 10-21 before appending them to its local log. Truncation at index 15 performed before corresponding append
  5. Leaked log entries

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 27, 2022

Build succeeded:

@craig craig bot merged commit 03b8afb into cockroachdb:master Sep 27, 2022
@tbg
Copy link
Member

tbg commented Sep 29, 2022

The case where you can see this go wrong is on follower replicas with tightly coupled log truncation. In a world with async log writes and async log application, you can pretty easily construct a situation that looks like:

Oh, are we planning to let replicas apply entries they don't have in their durable log? I hadn't considered that but it makes sense to decouple the two as much as possible.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/raftAppliedIndexTerm branch September 30, 2022 07:17
@nvanbenschoten
Copy link
Member Author

My plan was to expose this as an option (default off) through etcd/raft. It raises too many questions to assume that all users of the library can properly handle the applied state outpacing the log. Many might have unintentional dependencies between the two. I also don't want the integration of the raft refactor to depend on CRDB resolving all of these dependencies immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants