Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: account for below Raft stats changes when using DeprecatedDelta #38976

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #38859.

The explanation of what is going wrong is in #38859 (comment) and the next comment.

The problem was that we modify the stats delta for entries in applyRaftCommandToBatch here and here. We then conditionally replace this stats delta here if the raft command was proposed with a DeprecatedDelta instead of a new Delta field.

I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our acceptance test suite is catching the issue, so I think that's sufficient.

Release note: None

@nvanbenschoten nvanbenschoten requested review from ajwerner and a team July 18, 2019 22:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks!

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_application.go, line 594 at r1 (raw file):

	// upgrades. Thanks to commutativity, the spanlatch manager does not have to
	// serialize on the stats key.
	ms := *replicaState.Stats

nit: this doesn't necessarily seem like the place to do this dereference, it doesn't get used until the next set of conditionals and even then not in the common case. I don't think it'd be crazy to do it in each of the two branches below that use it.


pkg/storage/replica_application.go, line 687 at r1 (raw file):

	start := timeutil.Now()

	// TODO(ajwerner): This assertion no longer makes much sense.

Indeed is garbage. We read the same thing from the same batch twice and compare it to itself.

Fixes cockroachdb#38859.

The explanation of what is going wrong is in cockroachdb#38859 (comment)
and the next comment.

The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch`
[here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600)
and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631).
We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325)
if the raft command was proposed with a `DeprecatedDelta` instead of a new
`Delta` field.

I'm not adding a unit test here because this is a glaring bug and we don't test
either of these migrations at all anymore (since 78e1866). Our test suite is
catching the issue, so I think that's sufficient.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/storage/replica_application.go, line 594 at r1 (raw file):

Previously, ajwerner wrote…

nit: this doesn't necessarily seem like the place to do this dereference, it doesn't get used until the next set of conditionals and even then not in the common case. I don't think it'd be crazy to do it in each of the two branches below that use it.

Done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@nvanbenschoten
Copy link
Member Author

I'm going to see what I can do about adding real unit tests around this behavior when I address the feedback on #38954 and add some tighter mocking around entry application.

bors r=ajwerner

craig bot pushed a commit that referenced this pull request Jul 19, 2019
38976: storage: account for below Raft stats changes when using DeprecatedDelta r=ajwerner a=nvanbenschoten

Fixes #38859.

The explanation of what is going wrong is in #38859 (comment) and the next comment.

The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch` [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600) and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631). We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325) if the raft command was proposed with a `DeprecatedDelta` instead of a new `Delta` field.

I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our acceptance test suite is catching the issue, so I think that's sufficient.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Jul 19, 2019

Build succeeded

@craig craig bot merged commit 19c5c09 into cockroachdb:master Jul 19, 2019
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/fixStatsPanic branch July 19, 2019 01:37
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 2, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 5, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 6, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 7, 2019
39254: storage/apply: create apply package for raft entry application r=nvanbenschoten a=nvanbenschoten

The new package provides abstractions and routines associated with the application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about making storage abstractions more clear and easier to understand in isolation. One commonly discussed proposal is introducing a `storage/replicate` package that would encapsulate the concerns of raft replication (e.g. log manipulation, snapshots, leader election, heartbeats, etc.). This `storage/apply` package will fit in nicely alongside a replication abstraction.
- Initial discussion on #38954 concluded that adding an optimization to acknowledge clients after their raft entries have committed but before they had been applied with the current code structure was moving in the opposite direction and making things even harder to understand due to the introduction of more complex state management.
- Recent instability in this area (#38976, #39064, #39135, #39203) has revealed that there exists a high degree of difficulty involved in testing any of the logic in the area of raft entry application. This has naturally led to testing at a distance using tools like testing hooks, which is frustrating and delicate. As a result, we're missing tests for things like old migrations that we still need to support. We also have trouble writing regression tests when bugs do pop up.
- The proposed optimization in #17500 (comment) to apply committed raft entries to the Replica storage engine asynchronously in a separate thread than the raft processing thread will make entry application significantly more complex. For instance, we'll likely need to introduce a separate scheduler to coordinate entry application passes across Ranges on a node. The schedule will likely want to prioritize leaders over followers and embed other policies to optimize for total system throughput. There's a strong desire to isolate this new complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of applying committed raft entries. To start, this makes the process easier to understand both in terms of the macro-level steps that are taken during application of a batch of entries and in terms of the impact that an individual command has on the replicated state machine. For instance, the PR helps provide answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit to write targeted unit tests. Not only can the `storage/apply` package be tested with a mock state machine (done in this PR), but we can test Replica's implementation of the state machine interface in isolation without needing to touch raft at all.

Finally, the refactor paves the way for making the proposed change in #38954 in a much cleaner way. This is demonstrated in the second commit, which is being included here to show why certain things were designed the way they were but will not be merged with this PR.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 15, 2019
The new package provides abstractions and routines associated with the
application of committed raft entries to a replicated state machine.

This was inspired by four driving forces:
- We've been having a number of discussions on the Core team about
  making storage abstractions more clear and easier to understand
  in isolation. One commonly discussed proposal is introducing a
  `storage/replicate` package that would encapsulate the concerns of
  raft replication (e.g. log manipulation, snapshots, leader election,
  heartbeats, etc.). This `storage/apply` package will fit in nicely
  alongside a replication abstraction.
- Initial discussion on cockroachdb#38954 concluded that adding an optimization
  to acknowledge clients after their raft entries have committed but
  before they had been applied with the current code structure was
  moving in the opposite direction and making things even harder to
  understand due to the introduction of more complex state management.
- Recent instability in this area (cockroachdb#38976, cockroachdb#39064, cockroachdb#39135, cockroachdb#39203) has
  revealed that there exists a high degree of difficulty involved in testing
  any of the logic in the area of raft entry application. This has naturally
  led to testing at a distance using tools like testing hooks, which is
  frustrating and delicate. As a result, we're missing tests for thing
  like old migrations that we still need to support. We also have trouble
  writing regression tests when bugs do pop up.
- The proposed optimization in cockroachdb#17500 (comment)
  to apply committed raft entries to the Replica storage engine asynchronously
  in a separate thread than the raft processing thread will make entry
  application significantly more complex. For instance, we'll likely
  need to introduce a separate scheduler to coordinate entry application
  passes across Ranges on a node. The schedule will likely want to
  prioritize leaders over followers and embed other policies to optimize
  for total system throughput. There's a strong desire to isolate this new
  complexity and to give the logic a place to live.

The PR begins to address these concerns by formalizing the process of
applying committed raft entries. To start, this makes the process easier
to understand both in terms of the macro-level steps that are taken during
application of a batch of entries and in terms of the impact that an individual
command has on the replicated state machine. For instance, the PR helps provide
answers to all of the following questions:

- What are the stages of raft entry application?
- What is the difference between a "raft entry" and a "replicated command"?
- What can a command do besides apply its write batch to the storage engine?
- What does it mean for a successfully replicated command to be rejected during application?
- When can we acknowledge the outcome of a raft proposal?

The refactor also uncovers a large testing surface that future PRs will exploit
to write targeted unit tests. Not only can the `storage/apply` package be tested
with a mock state machine (done in this PR), but we can test Replica's
implementation of the state machine interface in isolation without needing
to touch raft at all.

Finally, the refactor paves the way for making the proposed change in cockroachdb#38954
in a much cleaner way. This is demonstrated in next commit, which is being
included here to show why certain things were designed the way they were
but will not be merged with this PR.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

teamcity: failed test: version-upgrade
3 participants