-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WAL documentation. Also fix some minor metrics registration details #16270
Conversation
agent/metrics_test.go
Outdated
respRec := httptest.NewRecorder() | ||
recordPromMetrics(t, a, respRec) | ||
|
||
require.NotContains(t, respRec.Body.String(), "agent_4_raft_wal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might still need to defer/call respRec.Body.Close()
to free resources?
agent/metrics_test.go
Outdated
|
||
func TestHTTPHandlers_AgentMetrics_WAL_Prometheus(t *testing.T) { | ||
skipIfShortTesting(t) | ||
// This test cannot use t.Parallel() since we modify global state, ie the global metrics instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this comment with some gaps in context, so apologies in advance.
I understand why we're doing this, and it seems to be an existing pattern in this file, but modifying global state is usually bad. Not necessarily one for this PR, but could we just clone/create new metrics instances and operate on these instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great callout. This is a shortcoming of go-metrics in general. I'd love to move away from go-metrics to something more modern but that comes with a lot of compatibility issues since all our metrics config and output depends on quirks of how go-metrics has always done things. All HashiCorp products are also built on it so it is a big change to go changing how go-metrics works.
If we wanted to fix just the global state part it would at least mean updating every instrumentation point in our code that calls metrics.*()
functions to instead call a method on a specific instance which is part of the agent dependencies and plumbed though all the other modules and layers. I'd love to get there one day but it's not been something we've chosen to bite off just yet!
agent/setup.go
Outdated
// above. We should probably fix that but I want to not change behavior right | ||
// now. If we are a server, add summaries for WAL and verifier metrics. | ||
if isServer && cfg.RaftLogStoreConfig.Verification.Enabled { | ||
verifierCounters := make([]prometheus.CounterDefinition, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a simple optimization to pass the cap here?
verifierCounters := make([]prometheus.CounterDefinition, 0) | |
verifierCounters := make([]prometheus.CounterDefinition, 0, len(verifier.MetricDefinitions.Counters)) |
agent/setup.go
Outdated
counters = append(counters, verifierCounters) | ||
} | ||
if isServer && cfg.RaftLogStoreConfig.Backend == consul.LogStoreBackendWAL { | ||
walCounters := make([]prometheus.CounterDefinition, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a simple optimization to pass the cap here?
walCounters := make([]prometheus.CounterDefinition, 0) | |
walCounters := make([]prometheus.CounterDefinition, 0, len(wal.MetricDefinitions.Counters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions here and there.
may help to reduce disk IO and log storage operation times. Disabling free list syncing will however increase | ||
the startup time for a server as it must scan the raft.db file for free space instead of loading the already | ||
populated free list structure. | ||
|
||
~> Note that as of Consul 1.15 there is a [new experimental storage | ||
backend](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | |
backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) |
may help to reduce disk IO and log storage operation times. Disabling free list syncing will however increase | ||
the startup time for a server as it must scan the raft.db file for free space instead of loading the already | ||
populated free list structure. | ||
|
||
~> Note that as of Consul 1.15 there is a [new experimental storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call out what the storage backend is?
@@ -452,6 +456,11 @@ These metrics are used to monitor the health of the Consul servers. | |||
| `consul.raft.last_index` | Represents the raft applied index. | index | gauge | | |||
| `consul.raft.leader.dispatchLog` | Measures the time it takes for the leader to write log entries to disk. | ms | timer | | |||
| `consul.raft.leader.dispatchNumLogs` | Measures the number of logs committed to disk in a batch. | logs | gauge | | |||
| `consul.raft.logstore.verifier.checkpoints_written` | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter | | |||
| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter | | |
| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter | |
@@ -452,6 +456,11 @@ These metrics are used to monitor the health of the Consul servers. | |||
| `consul.raft.last_index` | Represents the raft applied index. | index | gauge | | |||
| `consul.raft.leader.dispatchLog` | Measures the time it takes for the leader to write log entries to disk. | ms | timer | | |||
| `consul.raft.leader.dispatchNumLogs` | Measures the number of logs committed to disk in a batch. | logs | gauge | | |||
| `consul.raft.logstore.verifier.checkpoints_written` | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter | | |||
| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter | | |||
| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter | | |
| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter | |
| `consul.raft.logstore.verifier.checkpoints_written` | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter | | ||
| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter | | ||
| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter | | ||
| `consul.raft.logstore.verifier.read_checksum_failures` | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `consul.raft.logstore.verifier.read_checksum_failures` | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter | | |
| `consul.raft.logstore.verifier.read_checksum_failures` | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter | |
|
||
**7. Monitor target server raft metrics and logs.** | ||
|
||
See the section below on [monitoring WAL tests](#monitoring-wal-tests). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link does not work.
Maybe (#monitoring)
is the one you meant.
Consul 1.15 introduced a new experimental storage backend. This guide explains | ||
how to configure it and test it out safely. The new backend is called `wal` in | ||
configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consul 1.15 introduced a new experimental storage backend. This guide explains | |
how to configure it and test it out safely. The new backend is called `wal` in | |
configuration. | |
Consul 1.15 introduced WAL, a new experimental storage backend. This guide explains | |
how to configure it and test it out safely. The new backend is called `wal` in | |
configuration. |
files from disk. With BoltDB uses as a log, sudden burst of writes say three | ||
times larger than the normal volume can suddenly cause the file to grow to about | ||
several times it's steady-state size. After the next snapshot is taken, and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, until the end of the paragraph might be a bit confusing to digest.
files from disk. With BoltDB uses as a log, sudden burst of writes say three | |
times larger than the normal volume can suddenly cause the file to grow to about | |
several times it's steady-state size. After the next snapshot is taken, and the | |
files from disk. When BoltDB is used as a log backend, sudden burst of writes at a rate 2-3x higher than the normal volume, can suddenly cause the log file to grow to several times it's steady-state size. After the next snapshot is taken, and the |
files from disk. With BoltDB uses as a log, sudden burst of writes say three | ||
times larger than the normal volume can suddenly cause the file to grow to about | ||
several times it's steady-state size. After the next snapshot is taken, and the | ||
oldest logs truncated again, the file is left as mostly empty space. Tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file is left as mostly empty space
shall we clarify that it keeps the same size?
Even if this has never happened to a catastrophic degree in a cluster, the fact | ||
that it's a risk has meant that Consul has erred on the side of never letting | ||
too many logs accumulate in the LogStore. Significantly larger BoltDB files are | ||
somewhat slower in general because it's a tree and so still has log(N) work to | ||
do n every write. But out user's experience showed that the larger the file, the | ||
more likely it is to have a large freelist or suddenly form one after a burst of | ||
writes. For this reason, the default options for how frequently we make a full | ||
snapshot and truncate the logs, and for how many logs we keep around have always | ||
been aggressively set towards keeping BoltDB small rather than using disk IO the | ||
most efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if this has never happened to a catastrophic degree in a cluster, the fact | |
that it's a risk has meant that Consul has erred on the side of never letting | |
too many logs accumulate in the LogStore. Significantly larger BoltDB files are | |
somewhat slower in general because it's a tree and so still has log(N) work to | |
do n every write. But out user's experience showed that the larger the file, the | |
more likely it is to have a large freelist or suddenly form one after a burst of | |
writes. For this reason, the default options for how frequently we make a full | |
snapshot and truncate the logs, and for how many logs we keep around have always | |
been aggressively set towards keeping BoltDB small rather than using disk IO the | |
most efficiently. | |
Even if this has never happened to a catastrophic degree in a cluster, Consul was tuned to avoid for too many logs to accumulate in the LogStore, in order to reduce/mitigate the risk. Significantly larger BoltDB files are | |
somewhat slower in general because it's a tree and so still has log(N) work to | |
do n every write. But out user's experience showed that the larger the file, the | |
more likely it is to have a large freelist or suddenly form one after a burst of | |
writes. For this reason, the default options for how frequently we make a full | |
snapshot and truncate the logs, and for how many logs we keep around have always | |
been aggressively set towards keeping BoltDB small rather than using disk IO the | |
most efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
Significantly larger BoltDB files are
somewhat slower in general because it's a tree and so still has log(N) work to
do n every write. But out user's experience showed that the larger the file, the
more likely it is to have a large freelist or suddenly form one after a burst of
writes.
Do we mean that both (the fact it is a tree and the higher probability of freelist) are a reason of bad performances? Or that despite the good performances offered by the tree, freelists drags them down?
because it's a tree and so still has log(N) work to do n every write
Did we meanon every write
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am assuming that by freelist
we refer either to chunks of free space in the file or to the data structures we use to track the free space in the file.
It is not mentioned anywhere else in the file so it might be worth a clarification.
Other reliability issues such as [followers being unable to catch | ||
up](/consul/docs/agent/telemetry#raft-replication-capacity-issues) also stem | ||
from this need to carefully balance the size of the BoltDB log store against how | ||
long snapshots take to restore - there is a simple solution to that issue if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not state the solution and the jump to WAL is not fully obvios. We could state the solution and then clarify the technical issues to be solved before being able to use the approach (let file grow), something like:
A simple solution for that issue is ... but could not get implemented unless/until letting logs grow ...
And then specify that WAL follows this approach.
While not every user will experience a huge difference in performance, the WAL | ||
backend avoids these performance concerns entirely. It is more performant when | ||
directly measured due to solving a simpler storage problem than BoltDB was | ||
designed for. For example it can commit a single log entry with on fsync instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
designed for. For example it can commit a single log entry with on fsync instead | |
designed for. For example it can commit a single log entry with one fsync instead |
Shall we use monospace? > fsync
and so reducing disk IO with slower snapshots or keeping logs around to catch up | ||
slower followers are all possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and so reducing disk IO with slower snapshots or keeping logs around to catch up | |
slower followers are all possible. | |
and so strategies for reducing disk IO with slower snapshots or for keeping logs around to permit slower followers to catch up with cluster state, are all possible. |
- All servers in the Datacenter should be upgraded to Consul 1.15 using the | ||
[standard upgrade procedure](/consul/docs/upgrading/general-process) and | ||
the [1.15 upgrade notes](/consul/docs/upgrading/upgrade-specific#consul-1-15-x). | ||
- You need a Consul cluster with at least 3 nodes to safely test the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enough?
We mention below that enabling WAL on a voting server might itself generate an issue:
If you enable a server to use WAL using Consul OSS or on a voting server with
Consul Enterprise, it's possible that the WAL could cause corruption of
that server's state (with the caveats above) and then become the leader and
replicate that corruption to all other servers.
Or is this more related to the fact then even in case it goes all good users with one server will have downtime?
To be safe the real requirement is to be able to use non-voting servers aka to use Consul Enterprise
As noted in [Risks](#risks), Consul Enterprise users should select a non-voting | ||
server at first. For Consul OSS users, or Enterprise users who don't have | ||
non-voting servers, select one of the follower servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we hint on the fact that ENT users could test it by adding a new non-voting server in case they don't have one deployed yet? This will reduce their risks.
* `consul.raft.commitTime` measures the time to commit new writes on a quorum of | ||
servers. It should be the same or lower after deploying WAL. Even if WAL is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In telemetry.mdx we use a different definition:
Measures the time it takes to commit a new entry to the Raft log on the leader.
It might be better to reuse the same definition.
* `consul.raft.rpc.appendEntries.storeLogs` measures the time spent persisting | ||
logs to disk on each _follower_. It should be the same or lower for | ||
WAL-enabled followers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In telemetry.mdx we use a different definition:
Measures the time taken to add any outstanding logs for an agent, since the last appendEntries was invoked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the extra specificity here is useful in terms of what it means for WAL. The one in telemetry is arguably misleading even - it's only emitted by a follower, it's not "any outstanding logs" but just whatever the next batch the leader choose to send is (and there may be multiple in flight at once) and "since the last appendEntries..." is just wrong. it's the time from recieving an append entries RPC to responding to it. If the previous one was 30s ago it won't suddenly jump up to 30s... Maybe we should just fix the telemetry.mdx
definition?
* `consul.raft.replication.appendEntries.rpc` measures the time taken for each | ||
`AppendEntries` RPC from the leader's perspective. If this is significantly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In telemetry.mdx we use a different definition:
Measures the time taken by the append entries RFC, to replicate the log entries of a leader agent onto its follower agent(s)
* `consul.raft.leader.dispatchLog` measures the time spent persisting logs to | ||
disk on the _leader_. It is only relevant if a WAL-enabled server becomes a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In telemetry.mdx we use a different definition:
Measures the time it takes for the leader to write log entries to disk.
for by matching the `peer_id` label value to the server IDs listed by `consul | ||
operator raft list-peers`. | ||
|
||
* `consul.raft.compactLogs` measures the time take to truncate the logs after a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find this in telemetry.mdx.
Should it be there too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent a few hours on this and still haven't made it all the way through. I have a few big things. First, there is too much information on the usage page; and maybe too much information in general. I wonder if you can offload a lot of this background information about our decision to create a new backend to a blog or white paper or some other channel that isn't core documentation. The second point is related to the first in that there are two content types in one page -- the conceptual information and the usage information. If we want to keep the conceptual information (first 100 lines) in the docs, we should put that into an overview page. To that end, placing this type of usage information in the Upgrade documentation seems like an odd choice. As far as I can tell, the configuration is on the Consul server agent, so I would probably put this with the Agent topics.
- `raft_logstore` ((#raft_logstore)) This is a nested object that allows | ||
configuring options for Raft's LogStore component which is used to persist | ||
logs and crucial Raft state on disk during writes. This was added in Consul | ||
1.15. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `raft_logstore` ((#raft_logstore)) This is a nested object that allows | |
configuring options for Raft's LogStore component which is used to persist | |
logs and crucial Raft state on disk during writes. This was added in Consul | |
1.15. | |
- `raft_logstore` ((#raft_logstore)) This is a nested object that allows | |
configuring options for Raft's LogStore component, which persists | |
logs and crucial Raft state on disk during writes. |
I think we need to get out of the habit of including the minimum supported version number at this level. Doing so is appropriate for usage page where we have a dedicated Requirements section, but for reference and concept pages we should rely on the doc versioning mechanism. I think the deprecation notices are OK, though.
- `backend` ((#raft_logstore_backend)) This allows selection of which storage | ||
engine to use to persist logs. Valid options are `boltdb` or `wal`. Default | ||
is `boltdb`. As of Consul 1.15, `wal` is a new and experimental backend that | ||
should be used with caution. See [our experimental WAL backend testing | ||
guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | ||
to learn how to safely evaluate it for your workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `backend` ((#raft_logstore_backend)) This allows selection of which storage | |
engine to use to persist logs. Valid options are `boltdb` or `wal`. Default | |
is `boltdb`. As of Consul 1.15, `wal` is a new and experimental backend that | |
should be used with caution. See [our experimental WAL backend testing | |
guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | |
to learn how to safely evaluate it for your workload. | |
- `backend` ((#raft_logstore_backend)) Specifies which storage | |
engine to use to persist logs. Valid options are `boltdb` or `wal`. Default | |
is `boltdb`. The `wal` option specifies an experimental backend that | |
should be used with caution. Refer to [Testing the Experimental WAL LogStore Backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) to learn how to safely evaluate it for your workload. |
- `disable_log_cache` ((#raft_logstore_disable_log_cache)) This allows | ||
disabling of the in-memory cache of recent logs. This exists mostly for | ||
performance testing purposes. In theory the log cache prevents disk reads | ||
for recent logs. In practice recent logs are still in OS page cache so tend | ||
not to be slow to read using either backend. We recommend leaving it enabled | ||
for now as we've not measured a significant improvement in any metric by | ||
disabling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `disable_log_cache` ((#raft_logstore_disable_log_cache)) This allows | |
disabling of the in-memory cache of recent logs. This exists mostly for | |
performance testing purposes. In theory the log cache prevents disk reads | |
for recent logs. In practice recent logs are still in OS page cache so tend | |
not to be slow to read using either backend. We recommend leaving it enabled | |
for now as we've not measured a significant improvement in any metric by | |
disabling. | |
- `disable_log_cache` ((#raft_logstore_disable_log_cache)) Boolean that disables in-memory caching for recent logs. Default is `true`. You can set this option to `false` to test backend performance. | |
The log cache prevents disk reads | |
for recent logs, but recent logs are still cached in the OS page. As a result, either backend reads the logs quickly. We recommend using the default setting. |
I'm not sure how valuable the explanation of this setting is. If I understand it correctly, we're saying that it sort of doesn't work, anyway, because of the OS page cache. I would consider just deleting the part in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I was considering not documenting this at all but I feel equally icky about having undocumented settings. It's really there just so we can actually evaluate whether it even makes a difference or not. So far the answer seems "probably not" but it's less risky to leave it alone than change the behaviour since it doesn't help to remove it either.
Do you think this would be better:
- Undocumented (but still present for when we want to do perf testing)
- Documented as "don't use this it's just for internal testing"
- Documented as you proposed above, less verbose but with a general "you probably don' need to touch this ever" vibe?
- `verification` ((#raft_logstore_verification)) This is a nested object that | ||
allows configuring online verification of the LogStore. Verification | ||
provides additional assurances that LogStore backends are correctly storing | ||
data. It imposes very low overhead on servers and is safe to run in | ||
production, however it's mostly useful when evaluating a new backend | ||
implementation. | ||
|
||
Verification must be enabled on the leader to have any effect and can be | ||
used with any backend. When enabled, the leader will periodically write a | ||
special "checkpoint" log message including checksums of all log entries | ||
written to Raft since the last checkpoint. Followers that have verification | ||
enabled will run a background task for each checkpoint that reads all logs | ||
directly from the LogStore and recomputes the checksum. A report is output | ||
as an INFO level log for each checkpoint. | ||
|
||
Checksum failure should never happen and indicate unrecoverable corruption | ||
on that server. The only correct response is to stop the server, remove its | ||
data directory, and restart so it can be caught back up with a correct | ||
server again. Please report verification failures including details about | ||
your hardware and workload via GitHub issues. See [our experimental WAL | ||
backend testing | ||
guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | ||
for more details on using these to evaluate a new backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `verification` ((#raft_logstore_verification)) This is a nested object that | |
allows configuring online verification of the LogStore. Verification | |
provides additional assurances that LogStore backends are correctly storing | |
data. It imposes very low overhead on servers and is safe to run in | |
production, however it's mostly useful when evaluating a new backend | |
implementation. | |
Verification must be enabled on the leader to have any effect and can be | |
used with any backend. When enabled, the leader will periodically write a | |
special "checkpoint" log message including checksums of all log entries | |
written to Raft since the last checkpoint. Followers that have verification | |
enabled will run a background task for each checkpoint that reads all logs | |
directly from the LogStore and recomputes the checksum. A report is output | |
as an INFO level log for each checkpoint. | |
Checksum failure should never happen and indicate unrecoverable corruption | |
on that server. The only correct response is to stop the server, remove its | |
data directory, and restart so it can be caught back up with a correct | |
server again. Please report verification failures including details about | |
your hardware and workload via GitHub issues. See [our experimental WAL | |
backend testing | |
guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | |
for more details on using these to evaluate a new backend. | |
- `verification` ((#raft_logstore_verification)) Object that configures online verification of the LogStore. | |
Verification | |
provides additional assurances that LogStore backends are correctly storing | |
data. It imposes very low overhead on servers and is safe to run in | |
production. Verification is most useful for evaluating a new backend | |
implementation. | |
Verification must be enabled on the leader and can be | |
used with any backend. When enabled, the leader periodically writes a | |
special `checkpoint` log message that includes checksums of all log entries | |
written to Raft since the last checkpoint. Followers that have verification | |
enabled run a background task for each checkpoint that reads all logs | |
directly from the LogStore and recomputes the checksum. Consul outputs a report | |
as an `INFO` level log for each checkpoint. | |
Checksums should always pass, but if a failure occurs, it indicates unrecoverable corruption | |
on that Consul server. To resolve failure, stop the server, remove its | |
data directory, and restart so it can be caught back up with a correct | |
server again. Report verification failures including details about | |
your hardware and workload via GitHub issues. Refer to [Testing the Experimental WAL LogStore Backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) | |
for more details on using these to evaluate a new backend. |
There is a significant amount of usage information documented in this reference. While we expect there to be some overlap, we need to make sure that each content type serves its main purpose. I haven't gotten to the "Testing the Experimental WAL LogStore Backend" topic yet, but if the usage information is in both places, I think we should remove it from this section and just refer people to the usage page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I felt that, I found it hard to meangfully describe what the config does without a baseline description of what "verification" is and the other doc is more focussed on "This is how to test X" rather than "this is a thing called verification". If you think this content belongs somewhere else I'm happy with that choice too. One thing that we didn't have historically was a really god place/separation between concepts and background on subsystems and the actual reference for how to configure them. If we have a better place for the concepts now then totally agree this should move there and just link to it.
- `enabled` ((#raft_logstore_verification_enabled)) - Setting this to `true` | ||
will enable log verification checkpoints to be written (if leader) and | ||
verified on this server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `enabled` ((#raft_logstore_verification_enabled)) - Setting this to `true` | |
will enable log verification checkpoints to be written (if leader) and | |
verified on this server. | |
- `enabled` ((#raft_logstore_verification_enabled)) - Set to `true` to | |
allow this Consul server to write and verify log verification checkpoints when it is elected leader. |
Per the style guide, use present tense.
| `consul.raft.wal.stable_gets` | Counts how many calls to StableStore.Get or GetUint64. | calls | counter | | ||
| `consul.raft.wal.stable_sets` | Counts how many calls to StableStore.Set or SetUint64. | calls | counter | | ||
| `consul.raft.wal.tail_truncations` | Counts how many log entries have been truncated from the head - i.e. the newest entries. by graphing the rate of change over time you can see individual truncate calls as spikes. | logs entries truncated | counter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `consul.raft.wal.stable_gets` | Counts how many calls to StableStore.Get or GetUint64. | calls | counter | | |
| `consul.raft.wal.stable_sets` | Counts how many calls to StableStore.Set or SetUint64. | calls | counter | | |
| `consul.raft.wal.tail_truncations` | Counts how many log entries have been truncated from the head - i.e. the newest entries. by graphing the rate of change over time you can see individual truncate calls as spikes. | logs entries truncated | counter | | |
| `consul.raft.wal.stable_gets` | Counts how many calls to `StableStore.Get` or `GetUint64`. | calls | counter | | |
| `consul.raft.wal.stable_sets` | Counts how many calls to `StableStore.Set` or `SetUint64`. | calls | counter | | |
| `consul.raft.wal.tail_truncations` | Counts how many log entries have been truncated from the head. The newest entries are truncated first. You can graph the rate of change over time to see individual truncate calls as spikes. | logs entries truncated | counter | |
Not sure if the words I put into back tics should be rendered as code, so feel free to ignore those suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh this (in my version to) is actually wrong and opposite!
Consul 1.15 introduced a new experimental storage backend option. Learn how to | ||
configure and test it out in a safe way. | ||
--- | ||
|
||
# Testing the Experimental WAL LogStore Backend | ||
|
||
## Introduction | ||
|
||
Consul 1.15 introduced a new experimental storage backend. This guide explains | ||
how to configure it and test it out safely. The new backend is called `wal` in | ||
configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consul 1.15 introduced a new experimental storage backend option. Learn how to | |
configure and test it out in a safe way. | |
--- | |
# Testing the Experimental WAL LogStore Backend | |
## Introduction | |
Consul 1.15 introduced a new experimental storage backend. This guide explains | |
how to configure it and test it out safely. The new backend is called `wal` in | |
configuration. | |
Learn how to safely configure and test the experimental WAL backend in your Consul deployment. | |
--- | |
# Enable the experimental WAL LogStore backend | |
This topic describes how to safely configure and test the WAL backend in your Consul deployment. | |
## Introduction | |
Consul ships with an experimental storage backend called write-ahead log (WAL). |
This is a usage page, so we should use the dedicated Requirements section to describe Consul version that includes WAL. Also, per recent changes to the style guide, article titles now use sentence style case.
WAL is a acronym for "Write-Ahead Log". We called it this because it implements | ||
a traditional log with rotating, append-only log files. The current `LogStore` | ||
uses BoltDB which is a copy-on-write BTree which is less optimized for | ||
append-only workloads. | ||
|
||
~> The `wal` backend is considered **experimental** in Consul 1.15. Please test | ||
it safely in pre-production first and verify on a subset of servers in | ||
production using this guide before fully enabling it on all servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAL is a acronym for "Write-Ahead Log". We called it this because it implements | |
a traditional log with rotating, append-only log files. The current `LogStore` | |
uses BoltDB which is a copy-on-write BTree which is less optimized for | |
append-only workloads. | |
~> The `wal` backend is considered **experimental** in Consul 1.15. Please test | |
it safely in pre-production first and verify on a subset of servers in | |
production using this guide before fully enabling it on all servers. | |
WAL implements a traditional log with rotating, append-only log files. The current `LogStore` | |
uses BoltDB, which is a copy-on-write BTree, which is less optimized for append-only workloads. | |
!> **Upgrade warning:** The WAL backend is experimental. You should test and verify the WAL backend on a subset of servers in pre-production before fully enabling it on all servers. |
This is one of the designated uses for callouts per the style guide.
it safely in pre-production first and verify on a subset of servers in | ||
production using this guide before fully enabling it on all servers. | ||
|
||
## Why build a new backend? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Why build a new backend? | |
### WAL versus BoldDB |
I think this is still part of the introduction. In fact, most of the information in this section, I would argue, isn't documentation and would be better suited to a blog article.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had // TODO should this be in a blog?
right up until I posted the PR on this section. My main reasoning not to was just that I think this needs to live somewhere and don't think a blog will be ready on release day so wanted to land it somewhere at least initially so the content is available.
I'll go back and look at timing to see if we can make a blog happen in time.
The WAL backend has been written to resolve some long-standing issues with the | ||
current BoltDB backend. The existing BoltDB log store has worked reliably for | ||
most users for years, however it is not the most efficient way to store | ||
append-only logs to disk since it was designed as a full key-value database. It | ||
was an expedient option when our raft library was first written and always | ||
assumed we'd replace it with something more purpose-built. | ||
|
||
Importantly, a BoltDB database is a single file that only ever grows. Deleting | ||
the oldest logs which we do regularly when we've made a new snapshots of the | ||
state, leaves free space in the file that needs to be tracked to be re-used on | ||
future writes. By contrast a simple segmented log can just delete the oldest log | ||
files from disk. With BoltDB uses as a log, sudden burst of writes say three | ||
times larger than the normal volume can suddenly cause the file to grow to about | ||
several times it's steady-state size. After the next snapshot is taken, and the | ||
oldest logs truncated again, the file is left as mostly empty space. Tracking | ||
this free space requires writing extra metadata proportional to the amount of | ||
free pages to disk with every write and so after such a burst, write latencies | ||
tend to increase - in some cases dramatically causing serious performance | ||
degradation to the cluster. | ||
|
||
Even if this has never happened to a catastrophic degree in a cluster, the fact | ||
that it's a risk has meant that Consul has erred on the side of never letting | ||
too many logs accumulate in the LogStore. Significantly larger BoltDB files are | ||
somewhat slower in general because it's a tree and so still has log(N) work to | ||
do n every write. But out user's experience showed that the larger the file, the | ||
more likely it is to have a large freelist or suddenly form one after a burst of | ||
writes. For this reason, the default options for how frequently we make a full | ||
snapshot and truncate the logs, and for how many logs we keep around have always | ||
been aggressively set towards keeping BoltDB small rather than using disk IO the | ||
most efficiently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WAL backend has been written to resolve some long-standing issues with the | |
current BoltDB backend. The existing BoltDB log store has worked reliably for | |
most users for years, however it is not the most efficient way to store | |
append-only logs to disk since it was designed as a full key-value database. It | |
was an expedient option when our raft library was first written and always | |
assumed we'd replace it with something more purpose-built. | |
Importantly, a BoltDB database is a single file that only ever grows. Deleting | |
the oldest logs which we do regularly when we've made a new snapshots of the | |
state, leaves free space in the file that needs to be tracked to be re-used on | |
future writes. By contrast a simple segmented log can just delete the oldest log | |
files from disk. With BoltDB uses as a log, sudden burst of writes say three | |
times larger than the normal volume can suddenly cause the file to grow to about | |
several times it's steady-state size. After the next snapshot is taken, and the | |
oldest logs truncated again, the file is left as mostly empty space. Tracking | |
this free space requires writing extra metadata proportional to the amount of | |
free pages to disk with every write and so after such a burst, write latencies | |
tend to increase - in some cases dramatically causing serious performance | |
degradation to the cluster. | |
Even if this has never happened to a catastrophic degree in a cluster, the fact | |
that it's a risk has meant that Consul has erred on the side of never letting | |
too many logs accumulate in the LogStore. Significantly larger BoltDB files are | |
somewhat slower in general because it's a tree and so still has log(N) work to | |
do n every write. But out user's experience showed that the larger the file, the | |
more likely it is to have a large freelist or suddenly form one after a burst of | |
writes. For this reason, the default options for how frequently we make a full | |
snapshot and truncate the logs, and for how many logs we keep around have always | |
been aggressively set towards keeping BoltDB small rather than using disk IO the | |
most efficiently. | |
The WAL backend stores append-only logs to disk more efficiently than BoltDB because it is a full key-value database. A BoltDB database is a single file that continuously grows and deleting | |
the oldest logs leaves free space in the file that must be tracked to be re-used on | |
future writes. WAL is a simple segmented log that can delete the oldest log | |
files from disk. | |
With BoltDB, a burst of writes can cause the file to grow exponentially. After taking the next snapshot and the oldest logs are truncated, the file is mostly empty space. Tracking | |
the free space requires Consul to write extra metadata proportional to the amount of | |
free pages to disk with every write. After a burst of writes, write latencies | |
tend to increase that cause serious performance degradation to the cluster. | |
Consul never lets too many logs accumulate in the LogStore. Large BoltDB files are | |
slow as a result of its tree structure. It requires additional work for every write. But large files tend to either have a large freelist or suddenly form one after a burst of | |
writes. For this reason, the default options for how frequently we make a full | |
snapshot and truncate the logs, and for how many logs we keep around have always | |
been aggressively set toward keeping BoltDB small rather than using disk IO the | |
most efficiently. |
I think this is too much detail for a usage doc. Can't we just say that the WAL backend is more efficient for writing logs than BoltDB because it has a truncating mechanism more suitable to rolling logs than BoltDB? We can move the details to either a blog about the change or to the release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Nothing blocking. Had a minor suggestion/question about whether we want to emit metrics for failed verifications in addition to logs.
@@ -62,12 +62,12 @@ func makeLogVerifyReportFn(logger hclog.Logger) verifier.ReportFn { | |||
if r.WrittenSum > 0 && r.WrittenSum != r.ExpectedSum { | |||
// The failure occurred before the follower wrote to the log so it | |||
// must be corrupted in flight from the leader! | |||
l2.Info("verification checksum FAILED: in-flight corruption", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for now...but....Would it be helpful to have metrics here for failures? Or is discovering these via aggregrated logs be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do already - they are emitted by the verifier directly - see the telemetry.mdx additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the documentation for a feature like this should include as technical information about the conceptual design as possible. I'd have to respectfully disagree with @trujillo-adam about putting this in a white paper or blog, since this is a core feature that underpins the overall Consul system.
The functional parts of this PR were merged in #16388 The documentation was refactored by @im2nguyen and @trujillo-adam in this new PR: #16387 so I'm going to close this one and continue with their suggestions there. Thanks everyone for the reviews which have helped shape one or other of those PRs! |
This PR includes the public documentation for the experimental WAL backend merged in #16176.
It also includes a couple of minor non-functional fixes discovered:
Feedback Requested
PR Checklist
[ ] updated test coverage