Add WAL documentation. Also fix some minor metrics registration details #16270

banks · 2023-02-15T13:48:40Z

This PR includes the public documentation for the experimental WAL backend merged in #16176.

It also includes a couple of minor non-functional fixes discovered:

Verification failures were being logged as INFO not ERROR
While WAL and verifier metrics were integrated and working, we weren't registering their definitions for Prometheus.
Added tests for the registration to ensure we only add new metrics when the functionality is enabled

Feedback Requested

I'd love feedback on the overall clarity and structure of the testing guide.
Does the location in the information hierarchy make sense? For now it seems like an upgrading thing as most of that content is specific to this version where the backend is experimental. It might be once it's GA we need to find a new home for some of the content about the difference between backends etc?

PR Checklist

~~[ ] updated test coverage~~
external facing docs updated
not a security concern

loshz · 2023-02-21T17:44:23Z

agent/metrics_test.go

+		respRec := httptest.NewRecorder()
+		recordPromMetrics(t, a, respRec)
+
+		require.NotContains(t, respRec.Body.String(), "agent_4_raft_wal")


I think we might still need to defer/call respRec.Body.Close() to free resources?

loshz · 2023-02-21T17:47:10Z

agent/metrics_test.go

+
+func TestHTTPHandlers_AgentMetrics_WAL_Prometheus(t *testing.T) {
+	skipIfShortTesting(t)
+	// This test cannot use t.Parallel() since we modify global state, ie the global metrics instance


Leaving this comment with some gaps in context, so apologies in advance.

I understand why we're doing this, and it seems to be an existing pattern in this file, but modifying global state is usually bad. Not necessarily one for this PR, but could we just clone/create new metrics instances and operate on these instead?

Great callout. This is a shortcoming of go-metrics in general. I'd love to move away from go-metrics to something more modern but that comes with a lot of compatibility issues since all our metrics config and output depends on quirks of how go-metrics has always done things. All HashiCorp products are also built on it so it is a big change to go changing how go-metrics works.

If we wanted to fix just the global state part it would at least mean updating every instrumentation point in our code that calls metrics.*() functions to instead call a method on a specific instance which is part of the agent dependencies and plumbed though all the other modules and layers. I'd love to get there one day but it's not been something we've chosen to bite off just yet!

loshz · 2023-02-21T17:50:17Z

agent/setup.go

+	// above. We should probably fix that but I want to not change behavior right
+	// now. If we are a server, add summaries for WAL and verifier metrics.
+	if isServer && cfg.RaftLogStoreConfig.Verification.Enabled {
+		verifierCounters := make([]prometheus.CounterDefinition, 0)


Might be a simple optimization to pass the cap here?

Suggested change

verifierCounters := make([]prometheus.CounterDefinition, 0)

verifierCounters := make([]prometheus.CounterDefinition, 0, len(verifier.MetricDefinitions.Counters))

loshz · 2023-02-21T17:50:53Z

agent/setup.go

+		counters = append(counters, verifierCounters)
+	}
+	if isServer && cfg.RaftLogStoreConfig.Backend == consul.LogStoreBackendWAL {
+		walCounters := make([]prometheus.CounterDefinition, 0)


Might be a simple optimization to pass the cap here?

Suggested change

walCounters := make([]prometheus.CounterDefinition, 0)

walCounters := make([]prometheus.CounterDefinition, 0, len(wal.MetricDefinitions.Counters))

krastin

Just a few suggestions here and there.

krastin · 2023-02-22T17:44:54Z

website/content/docs/agent/telemetry.mdx

 may help to reduce disk IO and log storage operation times. Disabling free list syncing will however increase
 the startup time for a server as it must scan the raft.db file for free space instead of loading the already
 populated free list structure.

+~> Note that as of Consul 1.15 there is a [new experimental storage
+backend](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend)


Suggested change

backend](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

krastin · 2023-02-22T17:45:39Z

website/content/docs/agent/telemetry.mdx

 may help to reduce disk IO and log storage operation times. Disabling free list syncing will however increase
 the startup time for a server as it must scan the raft.db file for free space instead of loading the already
 populated free list structure.

+~> Note that as of Consul 1.15 there is a [new experimental storage


Should we call out what the storage backend is?

krastin · 2023-02-22T17:46:30Z

website/content/docs/agent/telemetry.mdx

@@ -452,6 +456,11 @@ These metrics are used to monitor the health of the Consul servers.
 | `consul.raft.last_index`                            | Represents the raft applied index.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | index                             | gauge   |
 | `consul.raft.leader.dispatchLog`                    | Measures the time it takes for the leader to write log entries to disk.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ms                                | timer   |
 | `consul.raft.leader.dispatchNumLogs`                | Measures the number of logs committed to disk in a batch.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | logs                              | gauge   |
+| `consul.raft.logstore.verifier.checkpoints_written`     | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter |
+| `consul.raft.logstore.verifier.dropped_reports`         | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter |


Suggested change

| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter |

| `consul.raft.logstore.verifier.dropped_reports` | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter |

krastin · 2023-02-22T17:46:41Z

website/content/docs/agent/telemetry.mdx

@@ -452,6 +456,11 @@ These metrics are used to monitor the health of the Consul servers.
 | `consul.raft.last_index`                            | Represents the raft applied index.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | index                             | gauge   |
 | `consul.raft.leader.dispatchLog`                    | Measures the time it takes for the leader to write log entries to disk.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ms                                | timer   |
 | `consul.raft.leader.dispatchNumLogs`                | Measures the number of logs committed to disk in a batch.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | logs                              | gauge   |
+| `consul.raft.logstore.verifier.checkpoints_written`     | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter |
+| `consul.raft.logstore.verifier.dropped_reports`         | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter |
+| `consul.raft.logstore.verifier.ranges_verified`         | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter |


Suggested change

| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter |

| `consul.raft.logstore.verifier.ranges_verified` | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter |

krastin · 2023-02-22T17:46:54Z

website/content/docs/agent/telemetry.mdx

+| `consul.raft.logstore.verifier.checkpoints_written`     | Counts the number of checkpoint entries written to the LogStore. | checkpoints | counter |
+| `consul.raft.logstore.verifier.dropped_reports`         | Counts how many times the verifier routine was still busy when the next checksum came in and so verification for a range was skipped. If you see this happen, consider increasing the interval between checkpoints with [`raft_logstore.verification.interval`](http://localhost:3000/consul/docs/agent/config/config-files#raft_logstore_verification) | reports dropped | counter |
+| `consul.raft.logstore.verifier.ranges_verified`         | Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | log ranges verifications | counter |
+| `consul.raft.logstore.verifier.read_checksum_failures`  | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter |


Suggested change

| `consul.raft.logstore.verifier.read_checksum_failures` | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter |

| `consul.raft.logstore.verifier.read_checksum_failures` | Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. | disk corruptions | counter |

danielehc · 2023-02-22T18:55:52Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+
+**7. Monitor target server raft metrics and logs.**
+
+See the section below on [monitoring WAL tests](#monitoring-wal-tests).


The link does not work.

Maybe (#monitoring) is the one you meant.

danielehc · 2023-02-22T19:00:54Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+Consul 1.15 introduced a new experimental storage backend. This guide explains
+how to configure it and test it out safely. The new backend is called `wal` in
+configuration.


Suggested change

Consul 1.15 introduced a new experimental storage backend. This guide explains

how to configure it and test it out safely. The new backend is called `wal` in

configuration.

Consul 1.15 introduced WAL, a new experimental storage backend. This guide explains

how to configure it and test it out safely. The new backend is called `wal` in

configuration.

danielehc · 2023-02-22T19:11:47Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+files from disk. With BoltDB uses as a log, sudden burst of writes say three
+times larger than the normal volume can suddenly cause the file to grow to about
+several times it's steady-state size. After the next snapshot is taken, and the


This, until the end of the paragraph might be a bit confusing to digest.

Suggested change

files from disk. With BoltDB uses as a log, sudden burst of writes say three

times larger than the normal volume can suddenly cause the file to grow to about

several times it's steady-state size. After the next snapshot is taken, and the

files from disk. When BoltDB is used as a log backend, sudden burst of writes at a rate 2-3x higher than the normal volume, can suddenly cause the log file to grow to several times it's steady-state size. After the next snapshot is taken, and the

danielehc · 2023-02-22T19:13:12Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+files from disk. With BoltDB uses as a log, sudden burst of writes say three
+times larger than the normal volume can suddenly cause the file to grow to about
+several times it's steady-state size. After the next snapshot is taken, and the
+oldest logs truncated again, the file is left as mostly empty space. Tracking


the file is left as mostly empty space

shall we clarify that it keeps the same size?

danielehc · 2023-02-22T19:21:01Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+Even if this has never happened to a catastrophic degree in a cluster, the fact
+that it's a risk has meant that Consul has erred on the side of never letting
+too many logs accumulate in the LogStore. Significantly larger BoltDB files are
+somewhat slower in general because it's a tree and so still has log(N) work to
+do n every write. But out user's experience showed that the larger the file, the
+more likely it is to have a large freelist or suddenly form one after a burst of
+writes. For this reason, the default options for how frequently we make a full
+snapshot and truncate the logs, and for how many logs we keep around have always
+been aggressively set towards keeping BoltDB small rather than using disk IO the
+most efficiently.


Suggested change

Even if this has never happened to a catastrophic degree in a cluster, the fact

that it's a risk has meant that Consul has erred on the side of never letting

too many logs accumulate in the LogStore. Significantly larger BoltDB files are

somewhat slower in general because it's a tree and so still has log(N) work to

do n every write. But out user's experience showed that the larger the file, the

more likely it is to have a large freelist or suddenly form one after a burst of

writes. For this reason, the default options for how frequently we make a full

snapshot and truncate the logs, and for how many logs we keep around have always

been aggressively set towards keeping BoltDB small rather than using disk IO the

most efficiently.

Even if this has never happened to a catastrophic degree in a cluster, Consul was tuned to avoid for too many logs to accumulate in the LogStore, in order to reduce/mitigate the risk. Significantly larger BoltDB files are

somewhat slower in general because it's a tree and so still has log(N) work to

do n every write. But out user's experience showed that the larger the file, the

more likely it is to have a large freelist or suddenly form one after a burst of

writes. For this reason, the default options for how frequently we make a full

snapshot and truncate the logs, and for how many logs we keep around have always

been aggressively set towards keeping BoltDB small rather than using disk IO the

most efficiently.

Also:

Significantly larger BoltDB files are
somewhat slower in general because it's a tree and so still has log(N) work to
do n every write. But out user's experience showed that the larger the file, the
more likely it is to have a large freelist or suddenly form one after a burst of
writes.

Do we mean that both (the fact it is a tree and the higher probability of freelist) are a reason of bad performances? Or that despite the good performances offered by the tree, freelists drags them down?

because it's a tree and so still has log(N) work to do n every write
Did we mean on every write?

Also I am assuming that by freelist we refer either to chunks of free space in the file or to the data structures we use to track the free space in the file.

It is not mentioned anywhere else in the file so it might be worth a clarification.

danielehc · 2023-02-22T19:38:13Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+Other reliability issues such as [followers being unable to catch
+up](/consul/docs/agent/telemetry#raft-replication-capacity-issues) also stem
+from this need to carefully balance the size of the BoltDB log store against how
+long snapshots take to restore - there is a simple solution to that issue if


We do not state the solution and the jump to WAL is not fully obvios. We could state the solution and then clarify the technical issues to be solved before being able to use the approach (let file grow), something like:

A simple solution for that issue is ... but could not get implemented unless/until letting logs grow ...

And then specify that WAL follows this approach.

danielehc · 2023-02-22T19:40:21Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+While not every user will experience a huge difference in performance, the WAL
+backend avoids these performance concerns entirely. It is more performant when
+directly measured due to solving a simpler storage problem than BoltDB was
+designed for. For example it can commit a single log entry with on fsync instead


Suggested change

designed for. For example it can commit a single log entry with on fsync instead

designed for. For example it can commit a single log entry with one fsync instead

Shall we use monospace? > fsync

danielehc · 2023-02-22T19:42:58Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+and so reducing disk IO with slower snapshots or keeping logs around to catch up
+slower followers are all possible.


Suggested change

and so reducing disk IO with slower snapshots or keeping logs around to catch up

slower followers are all possible.

and so strategies for reducing disk IO with slower snapshots or for keeping logs around to permit slower followers to catch up with cluster state, are all possible.

danielehc · 2023-02-22T19:52:29Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+- All servers in the Datacenter should be upgraded to Consul 1.15 using the
+  [standard upgrade procedure](/consul/docs/upgrading/general-process) and
+  the [1.15 upgrade notes](/consul/docs/upgrading/upgrade-specific#consul-1-15-x).
+- You need a Consul cluster with at least 3 nodes to safely test the new


Is this enough?

We mention below that enabling WAL on a voting server might itself generate an issue:

If you enable a server to use WAL using Consul OSS or on a voting server with
Consul Enterprise, it's possible that the WAL could cause corruption of
that server's state (with the caveats above) and then become the leader and
replicate that corruption to all other servers.

Or is this more related to the fact then even in case it goes all good users with one server will have downtime?

To be safe the real requirement is to be able to use non-voting servers aka to use Consul Enterprise

danielehc · 2023-02-22T19:58:28Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+As noted in [Risks](#risks), Consul Enterprise users should select a non-voting
+server at first. For Consul OSS users, or Enterprise users who don't have
+non-voting servers, select one of the follower servers.


Shall we hint on the fact that ENT users could test it by adding a new non-voting server in case they don't have one deployed yet? This will reduce their risks.

danielehc · 2023-02-22T20:09:24Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+* `consul.raft.commitTime` measures the time to commit new writes on a quorum of
+  servers. It should be the same or lower after deploying WAL. Even if WAL is


In telemetry.mdx we use a different definition:

Measures the time it takes to commit a new entry to the Raft log on the leader.

It might be better to reuse the same definition.

danielehc · 2023-02-22T20:11:59Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+* `consul.raft.rpc.appendEntries.storeLogs` measures the time spent persisting
+  logs to disk on each _follower_. It should be the same or lower for
+  WAL-enabled followers.


In telemetry.mdx we use a different definition:

Measures the time taken to add any outstanding logs for an agent, since the last appendEntries was invoked

I feel like the extra specificity here is useful in terms of what it means for WAL. The one in telemetry is arguably misleading even - it's only emitted by a follower, it's not "any outstanding logs" but just whatever the next batch the leader choose to send is (and there may be multiple in flight at once) and "since the last appendEntries..." is just wrong. it's the time from recieving an append entries RPC to responding to it. If the previous one was 30s ago it won't suddenly jump up to 30s... Maybe we should just fix the telemetry.mdx definition?

danielehc · 2023-02-22T20:13:49Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+* `consul.raft.replication.appendEntries.rpc` measures the time taken for each
+  `AppendEntries` RPC from the leader's perspective. If this is significantly


In telemetry.mdx we use a different definition:

Measures the time taken by the append entries RFC, to replicate the log entries of a leader agent onto its follower agent(s)

danielehc · 2023-02-22T20:18:12Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+* `consul.raft.leader.dispatchLog` measures the time spent persisting logs to
+  disk on the _leader_. It is only relevant if a WAL-enabled server becomes a


In telemetry.mdx we use a different definition:

Measures the time it takes for the leader to write log entries to disk.

danielehc · 2023-02-22T20:18:58Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+  for by matching the `peer_id` label value to the server IDs listed by `consul
+  operator raft list-peers`.
+
+* `consul.raft.compactLogs` measures the time take to truncate the logs after a


I could not find this in telemetry.mdx.

Should it be there too?

trujillo-adam

I've spent a few hours on this and still haven't made it all the way through. I have a few big things. First, there is too much information on the usage page; and maybe too much information in general. I wonder if you can offload a lot of this background information about our decision to create a new backend to a blog or white paper or some other channel that isn't core documentation. The second point is related to the first in that there are two content types in one page -- the conceptual information and the usage information. If we want to keep the conceptual information (first 100 lines) in the docs, we should put that into an overview page. To that end, placing this type of usage information in the Upgrade documentation seems like an odd choice. As far as I can tell, the configuration is on the Consul server agent, so I would probably put this with the Agent topics.

trujillo-adam · 2023-02-22T18:33:30Z

website/content/docs/agent/config/config-files.mdx

+- `raft_logstore` ((#raft_logstore)) This is a nested object that allows
+  configuring options for Raft's LogStore component which is used to persist
+  logs and crucial Raft state on disk during writes. This was added in Consul
+  1.15.


Suggested change

- `raft_logstore` ((#raft_logstore)) This is a nested object that allows

configuring options for Raft's LogStore component which is used to persist

logs and crucial Raft state on disk during writes. This was added in Consul

1.15.

- `raft_logstore` ((#raft_logstore)) This is a nested object that allows

configuring options for Raft's LogStore component, which persists

logs and crucial Raft state on disk during writes.

I think we need to get out of the habit of including the minimum supported version number at this level. Doing so is appropriate for usage page where we have a dedicated Requirements section, but for reference and concept pages we should rely on the doc versioning mechanism. I think the deprecation notices are OK, though.

trujillo-adam · 2023-02-22T18:39:03Z

website/content/docs/agent/config/config-files.mdx

+  - `backend` ((#raft_logstore_backend)) This allows selection of which storage
+    engine to use to persist logs. Valid options are `boltdb` or `wal`. Default
+    is `boltdb`. As of Consul 1.15, `wal` is a new and experimental backend that
+    should be used with caution. See [our experimental WAL backend testing
+    guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)
+    to learn how to safely evaluate it for your workload.


Suggested change

- `backend` ((#raft_logstore_backend)) This allows selection of which storage

engine to use to persist logs. Valid options are `boltdb` or `wal`. Default

is `boltdb`. As of Consul 1.15, `wal` is a new and experimental backend that

should be used with caution. See [our experimental WAL backend testing

guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

to learn how to safely evaluate it for your workload.

- `backend` ((#raft_logstore_backend)) Specifies which storage

engine to use to persist logs. Valid options are `boltdb` or `wal`. Default

is `boltdb`. The `wal` option specifies an experimental backend that

should be used with caution. Refer to [Testing the Experimental WAL LogStore Backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend) to learn how to safely evaluate it for your workload.

trujillo-adam · 2023-02-22T18:51:28Z

website/content/docs/agent/config/config-files.mdx

+  - `disable_log_cache` ((#raft_logstore_disable_log_cache)) This allows
+    disabling of the in-memory cache of recent logs. This exists mostly for
+    performance testing purposes. In theory the log cache prevents disk reads
+    for recent logs. In practice recent logs are still in OS page cache so tend
+    not to be slow to read using either backend. We recommend leaving it enabled
+    for now as we've not measured a significant improvement in any metric by
+    disabling.


Suggested change

- `disable_log_cache` ((#raft_logstore_disable_log_cache)) This allows

disabling of the in-memory cache of recent logs. This exists mostly for

performance testing purposes. In theory the log cache prevents disk reads

for recent logs. In practice recent logs are still in OS page cache so tend

not to be slow to read using either backend. We recommend leaving it enabled

for now as we've not measured a significant improvement in any metric by

disabling.

- `disable_log_cache` ((#raft_logstore_disable_log_cache)) Boolean that disables in-memory caching for recent logs. Default is `true`. You can set this option to `false` to test backend performance.

The log cache prevents disk reads

for recent logs, but recent logs are still cached in the OS page. As a result, either backend reads the logs quickly. We recommend using the default setting.

I'm not sure how valuable the explanation of this setting is. If I understand it correctly, we're saying that it sort of doesn't work, anyway, because of the OS page cache. I would consider just deleting the part in the middle.

Honestly I was considering not documenting this at all but I feel equally icky about having undocumented settings. It's really there just so we can actually evaluate whether it even makes a difference or not. So far the answer seems "probably not" but it's less risky to leave it alone than change the behaviour since it doesn't help to remove it either.

Do you think this would be better:

Undocumented (but still present for when we want to do perf testing)

Documented as "don't use this it's just for internal testing"

Documented as you proposed above, less verbose but with a general "you probably don' need to touch this ever" vibe?

trujillo-adam · 2023-02-22T19:07:55Z

website/content/docs/agent/config/config-files.mdx

+  - `verification` ((#raft_logstore_verification)) This is a nested object that
+    allows configuring online verification of the LogStore. Verification
+    provides additional assurances that LogStore backends are correctly storing
+    data. It imposes very low overhead on servers and is safe to run in
+    production, however it's mostly useful when evaluating a new backend
+    implementation.
+
+    Verification must be enabled on the leader to have any effect and can be
+    used with any backend. When enabled, the leader will periodically write a
+    special "checkpoint" log message including checksums of all log entries
+    written to Raft since the last checkpoint. Followers that have verification
+    enabled will run a background task for each checkpoint that reads all logs
+    directly from the LogStore and recomputes the checksum. A report is output
+    as an INFO level log for each checkpoint.
+
+    Checksum failure should never happen and indicate unrecoverable corruption
+    on that server. The only correct response is to stop the server, remove its
+    data directory, and restart so it can be caught back up with a correct
+    server again. Please report verification failures including details about
+    your hardware and workload via GitHub issues. See [our experimental WAL
+    backend testing
+    guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)
+    for more details on using these to evaluate a new backend.


Suggested change

- `verification` ((#raft_logstore_verification)) This is a nested object that

allows configuring online verification of the LogStore. Verification

provides additional assurances that LogStore backends are correctly storing

data. It imposes very low overhead on servers and is safe to run in

production, however it's mostly useful when evaluating a new backend

implementation.

Verification must be enabled on the leader to have any effect and can be

used with any backend. When enabled, the leader will periodically write a

special "checkpoint" log message including checksums of all log entries

written to Raft since the last checkpoint. Followers that have verification

enabled will run a background task for each checkpoint that reads all logs

directly from the LogStore and recomputes the checksum. A report is output

as an INFO level log for each checkpoint.

Checksum failure should never happen and indicate unrecoverable corruption

on that server. The only correct response is to stop the server, remove its

data directory, and restart so it can be caught back up with a correct

server again. Please report verification failures including details about

your hardware and workload via GitHub issues. See [our experimental WAL

backend testing

guide](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

for more details on using these to evaluate a new backend.

- `verification` ((#raft_logstore_verification)) Object that configures online verification of the LogStore.

Verification

provides additional assurances that LogStore backends are correctly storing

data. It imposes very low overhead on servers and is safe to run in

production. Verification is most useful for evaluating a new backend

implementation.

Verification must be enabled on the leader and can be

used with any backend. When enabled, the leader periodically writes a

special `checkpoint` log message that includes checksums of all log entries

written to Raft since the last checkpoint. Followers that have verification

enabled run a background task for each checkpoint that reads all logs

directly from the LogStore and recomputes the checksum. Consul outputs a report

as an `INFO` level log for each checkpoint.

Checksums should always pass, but if a failure occurs, it indicates unrecoverable corruption

on that Consul server. To resolve failure, stop the server, remove its

data directory, and restart so it can be caught back up with a correct

server again. Report verification failures including details about

your hardware and workload via GitHub issues. Refer to [Testing the Experimental WAL LogStore Backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

for more details on using these to evaluate a new backend.

There is a significant amount of usage information documented in this reference. While we expect there to be some overlap, we need to make sure that each content type serves its main purpose. I haven't gotten to the "Testing the Experimental WAL LogStore Backend" topic yet, but if the usage information is in both places, I think we should remove it from this section and just refer people to the usage page.

Yeah I felt that, I found it hard to meangfully describe what the config does without a baseline description of what "verification" is and the other doc is more focussed on "This is how to test X" rather than "this is a thing called verification". If you think this content belongs somewhere else I'm happy with that choice too. One thing that we didn't have historically was a really god place/separation between concepts and background on subsystems and the actual reference for how to configure them. If we have a better place for the concepts now then totally agree this should move there and just link to it.

trujillo-adam · 2023-02-22T19:12:16Z

website/content/docs/agent/config/config-files.mdx

+    - `enabled` ((#raft_logstore_verification_enabled)) - Setting this to `true`
+    will enable log verification checkpoints to be written (if leader) and
+    verified on this server.


Suggested change

- `enabled` ((#raft_logstore_verification_enabled)) - Setting this to `true`

will enable log verification checkpoints to be written (if leader) and

verified on this server.

- `enabled` ((#raft_logstore_verification_enabled)) - Set to `true` to

allow this Consul server to write and verify log verification checkpoints when it is elected leader.

Per the style guide, use present tense.

trujillo-adam · 2023-02-22T19:54:43Z

website/content/docs/agent/telemetry.mdx

+| `consul.raft.wal.stable_gets`                 |  Counts how many calls to StableStore.Get or GetUint64. | calls  | counter |
+| `consul.raft.wal.stable_sets`                 |  Counts how many calls to StableStore.Set or SetUint64. | calls  | counter |
+| `consul.raft.wal.tail_truncations`            |  Counts how many log entries have been truncated from the head - i.e. the newest entries. by graphing the rate of change over time you can see individual truncate calls as spikes. | logs entries truncated | counter |


heh this (in my version to) is actually wrong and opposite!

trujillo-adam · 2023-02-22T19:59:17Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+  Consul 1.15 introduced a new experimental storage backend option. Learn how to
+  configure and test it out in a safe way.
+---
+
+# Testing the Experimental WAL LogStore Backend
+
+## Introduction
+
+Consul 1.15 introduced a new experimental storage backend. This guide explains
+how to configure it and test it out safely. The new backend is called `wal` in
+configuration.


Suggested change

Consul 1.15 introduced a new experimental storage backend option. Learn how to

configure and test it out in a safe way.

---

# Testing the Experimental WAL LogStore Backend

## Introduction

Consul 1.15 introduced a new experimental storage backend. This guide explains

how to configure it and test it out safely. The new backend is called `wal` in

configuration.

Learn how to safely configure and test the experimental WAL backend in your Consul deployment.

---

# Enable the experimental WAL LogStore backend

This topic describes how to safely configure and test the WAL backend in your Consul deployment.

## Introduction

Consul ships with an experimental storage backend called write-ahead log (WAL).

This is a usage page, so we should use the dedicated Requirements section to describe Consul version that includes WAL. Also, per recent changes to the style guide, article titles now use sentence style case.

trujillo-adam · 2023-02-22T20:04:12Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+WAL is a acronym for "Write-Ahead Log". We called it this because it implements
+a traditional log with rotating, append-only log files. The current `LogStore`
+uses BoltDB which is a copy-on-write BTree which is less optimized for
+append-only workloads.
+
+~> The `wal` backend is considered **experimental** in Consul 1.15. Please test
+it safely in pre-production first and verify on a subset of servers in
+production using this guide before fully enabling it on all servers.


Suggested change

WAL is a acronym for "Write-Ahead Log". We called it this because it implements

a traditional log with rotating, append-only log files. The current `LogStore`

uses BoltDB which is a copy-on-write BTree which is less optimized for

append-only workloads.

~> The `wal` backend is considered **experimental** in Consul 1.15. Please test

it safely in pre-production first and verify on a subset of servers in

production using this guide before fully enabling it on all servers.

WAL implements a traditional log with rotating, append-only log files. The current `LogStore`

uses BoltDB, which is a copy-on-write BTree, which is less optimized for append-only workloads.

!> **Upgrade warning:** The WAL backend is experimental. You should test and verify the WAL backend on a subset of servers in pre-production before fully enabling it on all servers.

This is one of the designated uses for callouts per the style guide.

trujillo-adam · 2023-02-22T20:08:14Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+it safely in pre-production first and verify on a subset of servers in
+production using this guide before fully enabling it on all servers.
+
+## Why build a new backend?


Suggested change

## Why build a new backend?

### WAL versus BoldDB

I think this is still part of the introduction. In fact, most of the information in this section, I would argue, isn't documentation and would be better suited to a blog article.

I actually had // TODO should this be in a blog? right up until I posted the PR on this section. My main reasoning not to was just that I think this needs to live somewhere and don't think a blog will be ready on release day so wanted to land it somewhere at least initially so the content is available.

I'll go back and look at timing to see if we can make a blog happen in time.

trujillo-adam · 2023-02-22T20:27:49Z

website/content/docs/upgrading/instructions/testing-experimental-raft-backend.mdx

+The WAL backend has been written to resolve some long-standing issues with the
+current BoltDB backend. The existing BoltDB log store has worked reliably for
+most users for years, however it is not the most efficient way to store
+append-only logs to disk since it was designed as a full key-value database. It
+was an expedient option when our raft library was first written and always
+assumed we'd replace it with something more purpose-built.
+
+Importantly, a BoltDB database is a single file that only ever grows. Deleting
+the oldest logs which we do regularly when we've made a new snapshots of the
+state, leaves free space in the file that needs to be tracked to be re-used on
+future writes. By contrast a simple segmented log can just delete the oldest log
+files from disk. With BoltDB uses as a log, sudden burst of writes say three
+times larger than the normal volume can suddenly cause the file to grow to about
+several times it's steady-state size. After the next snapshot is taken, and the
+oldest logs truncated again, the file is left as mostly empty space. Tracking
+this free space requires writing extra metadata proportional to the amount of
+free pages to disk with every write and so after such a burst, write latencies
+tend to increase - in some cases dramatically causing serious performance
+degradation to the cluster.
+
+Even if this has never happened to a catastrophic degree in a cluster, the fact
+that it's a risk has meant that Consul has erred on the side of never letting
+too many logs accumulate in the LogStore. Significantly larger BoltDB files are
+somewhat slower in general because it's a tree and so still has log(N) work to
+do n every write. But out user's experience showed that the larger the file, the
+more likely it is to have a large freelist or suddenly form one after a burst of
+writes. For this reason, the default options for how frequently we make a full
+snapshot and truncate the logs, and for how many logs we keep around have always
+been aggressively set towards keeping BoltDB small rather than using disk IO the
+most efficiently.


Suggested change

The WAL backend has been written to resolve some long-standing issues with the

current BoltDB backend. The existing BoltDB log store has worked reliably for

most users for years, however it is not the most efficient way to store

append-only logs to disk since it was designed as a full key-value database. It

was an expedient option when our raft library was first written and always

assumed we'd replace it with something more purpose-built.

Importantly, a BoltDB database is a single file that only ever grows. Deleting

the oldest logs which we do regularly when we've made a new snapshots of the

state, leaves free space in the file that needs to be tracked to be re-used on

future writes. By contrast a simple segmented log can just delete the oldest log

files from disk. With BoltDB uses as a log, sudden burst of writes say three

times larger than the normal volume can suddenly cause the file to grow to about

several times it's steady-state size. After the next snapshot is taken, and the

oldest logs truncated again, the file is left as mostly empty space. Tracking

this free space requires writing extra metadata proportional to the amount of

free pages to disk with every write and so after such a burst, write latencies

tend to increase - in some cases dramatically causing serious performance

degradation to the cluster.

Even if this has never happened to a catastrophic degree in a cluster, the fact

that it's a risk has meant that Consul has erred on the side of never letting

too many logs accumulate in the LogStore. Significantly larger BoltDB files are

somewhat slower in general because it's a tree and so still has log(N) work to

do n every write. But out user's experience showed that the larger the file, the

more likely it is to have a large freelist or suddenly form one after a burst of

writes. For this reason, the default options for how frequently we make a full

snapshot and truncate the logs, and for how many logs we keep around have always

been aggressively set towards keeping BoltDB small rather than using disk IO the

most efficiently.

The WAL backend stores append-only logs to disk more efficiently than BoltDB because it is a full key-value database. A BoltDB database is a single file that continuously grows and deleting

the oldest logs leaves free space in the file that must be tracked to be re-used on

future writes. WAL is a simple segmented log that can delete the oldest log

files from disk.

With BoltDB, a burst of writes can cause the file to grow exponentially. After taking the next snapshot and the oldest logs are truncated, the file is mostly empty space. Tracking

the free space requires Consul to write extra metadata proportional to the amount of

free pages to disk with every write. After a burst of writes, write latencies

tend to increase that cause serious performance degradation to the cluster.

Consul never lets too many logs accumulate in the LogStore. Large BoltDB files are

slow as a result of its tree structure. It requires additional work for every write. But large files tend to either have a large freelist or suddenly form one after a burst of

writes. For this reason, the default options for how frequently we make a full

snapshot and truncate the logs, and for how many logs we keep around have always

been aggressively set toward keeping BoltDB small rather than using disk IO the

most efficiently.

I think this is too much detail for a usage doc. Can't we just say that the WAL backend is more efficient for writing logs than BoltDB because it has a truncating mechanism more suitable to rolling logs than BoltDB? We can move the details to either a blog about the change or to the release notes.

jmurret

👍 Nothing blocking. Had a minor suggestion/question about whether we want to emit metrics for failed verifications in addition to logs.

jmurret · 2023-02-22T21:34:20Z

agent/consul/server_log_verification.go

@@ -62,12 +62,12 @@ func makeLogVerifyReportFn(logger hclog.Logger) verifier.ReportFn {
 			if r.WrittenSum > 0 && r.WrittenSum != r.ExpectedSum {
 				// The failure occurred before the follower wrote to the log so it
 				// must be corrupted in flight from the leader!
-				l2.Info("verification checksum FAILED: in-flight corruption",


Not for now...but....Would it be helpful to have metrics here for failures? Or is discovering these via aggregrated logs be enough?

We do already - they are emitted by the verifier directly - see the telemetry.mdx additions.

david-yu

I think the documentation for a feature like this should include as technical information about the conceptual design as possible. I'd have to respectfully disagree with @trujillo-adam about putting this in a white paper or blog, since this is a core feature that underpins the overall Consul system.

banks · 2023-02-24T11:32:09Z

The functional parts of this PR were merged in #16388

The documentation was refactored by @im2nguyen and @trujillo-adam in this new PR: #16387 so I'm going to close this one and continue with their suggestions there.

Thanks everyone for the reviews which have helped shape one or other of those PRs!

Add WAL documentation. Also fix some minor metrics registration details

769bbb7

banks requested a review from a team as a code owner February 15, 2023 13:48

github-actions bot added the type/docs Documentation needs to be created/updated/clarified label Feb 15, 2023

banks added the pr/no-changelog PR does not need a corresponding .changelog entry label Feb 15, 2023

vercel bot deployed to Preview – consul February 15, 2023 13:51 View deployment

Add tests to verify metrics are registered correctly

864b8e8

banks requested a review from a team February 21, 2023 16:50

loshz reviewed Feb 21, 2023

View reviewed changes

DanStough added the backport/1.15 This release series is no longer active on CE. Use backport/ent/1.15. label Feb 22, 2023

krastin reviewed Feb 22, 2023

View reviewed changes

danielehc reviewed Feb 22, 2023

View reviewed changes

trujillo-adam requested changes Feb 22, 2023

View reviewed changes

jmurret approved these changes Feb 22, 2023

View reviewed changes

Merge branch 'main' into wal-docs

8cba2cd

vercel bot had a problem deploying to Preview – consul February 23, 2023 04:04 Failure

vercel bot deployed to Preview – consul-ui-staging February 23, 2023 04:04 View deployment

david-yu self-requested a review February 23, 2023 04:05

david-yu approved these changes Feb 23, 2023

View reviewed changes

banks mentioned this pull request Feb 23, 2023

Correct WAL metrics registrations #16388

Merged

hc-github-team-consul-core mentioned this pull request Feb 23, 2023

Backport of Correct WAL metrics registrations into release/1.15.x #16389

Merged

trujillo-adam self-requested a review February 23, 2023 21:47

trujillo-adam approved these changes Feb 23, 2023

View reviewed changes

Merge branch 'main' into wal-docs

5328ea2

vercel bot deployed to Preview – consul February 23, 2023 21:54 View deployment

banks closed this Feb 24, 2023

	verifierCounters := make([]prometheus.CounterDefinition, 0)
	verifierCounters := make([]prometheus.CounterDefinition, 0, len(verifier.MetricDefinitions.Counters))

	walCounters := make([]prometheus.CounterDefinition, 0)
	walCounters := make([]prometheus.CounterDefinition, 0, len(wal.MetricDefinitions.Counters))

	backend](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend)
	backend](/consul/docs/upgrading/instructions/testing-experimental-raft-backend)

	\| `consul.raft.logstore.verifier.ranges_verified` \| Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. \| log ranges verifications \| counter \|
	\| `consul.raft.logstore.verifier.ranges_verified` \| Counts the number of log ranges for which a verification report has been completed. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. \| log ranges verifications \| counter \|

	\| `consul.raft.logstore.verifier.read_checksum_failures` \| Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](http://localhost:3000/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. \| disk corruptions \| counter \|
	\| `consul.raft.logstore.verifier.read_checksum_failures` \| Counts the number of times a range of logs between two check points contained at least one disk corruption. See [monitoring experimental backends](/consul/docs/upgrading/instructions/testing-experimental-raft-backend#monitoring) for more information. \| disk corruptions \| counter \|


		7. Monitor target server raft metrics and logs.

		See the section below on [monitoring WAL tests](#monitoring-wal-tests).

	designed for. For example it can commit a single log entry with on fsync instead
	designed for. For example it can commit a single log entry with one fsync instead

		and so reducing disk IO with slower snapshots or keeping logs around to catch up
		slower followers are all possible.

	and so reducing disk IO with slower snapshots or keeping logs around to catch up
	slower followers are all possible.
	and so strategies for reducing disk IO with slower snapshots or for keeping logs around to permit slower followers to catch up with cluster state, are all possible.

		* `consul.raft.commitTime` measures the time to commit new writes on a quorum of
		servers. It should be the same or lower after deploying WAL. Even if WAL is

		* `consul.raft.replication.appendEntries.rpc` measures the time taken for each
		`AppendEntries` RPC from the leader's perspective. If this is significantly

		* `consul.raft.leader.dispatchLog` measures the time spent persisting logs to
		disk on the _leader_. It is only relevant if a WAL-enabled server becomes a

Add WAL documentation. Also fix some minor metrics registration details #16270

Add WAL documentation. Also fix some minor metrics registration details #16270

Conversation

banks commented Feb 15, 2023 • edited Loading

Feedback Requested

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krastin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielehc Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielehc Feb 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trujillo-adam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trujillo-adam Feb 22, 2023 • edited by im2nguyen Loading

Choose a reason for hiding this comment

trujillo-adam Feb 22, 2023 • edited by im2nguyen Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmurret left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-yu left a comment

Choose a reason for hiding this comment

banks commented Feb 24, 2023

banks commented Feb 15, 2023 •

edited

Loading

danielehc Feb 22, 2023 •

edited

Loading

danielehc Feb 22, 2023 •

edited

Loading

trujillo-adam Feb 22, 2023 •

edited by im2nguyen

Loading

trujillo-adam Feb 22, 2023 •

edited by im2nguyen

Loading