release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952

ajwerner · 2020-02-11T02:10:10Z

Backport:

2/2 commits from "storage: make queue timeouts dynamic and add cluster setting for snapshots" (storage: make queue timeouts dynamic and add cluster setting for snapshots #42686)
2/2 commits from "storage: make queue timeouts controllable, snapshot sending queues dynamic" (storage: make queue timeouts controllable, snapshot sending queues dynamic #44809)

Please see individual PRs for details.

/cc @cockroachdb/release

The background storage queues each carry a timeout. This timeout seems like a good idea to unstick a potentially stuck queue. I'm not going to speculate as to why these queues might find themselves getting stuck but, let's just say, maybe it happens and maybe having a timeout is a good idea. Unfortunately sometimes these timeouts can come to bite us, especially when things are unexpectedly slow or data sizes are unexpectedly large. Failure to make progress before a timeout expires in queue processing is a common cause of long-term outages. In those cases it's good to have an escape hatch. Another concern on the horizon is the desire to have larger range sizes. Today our default queue processing timeout is 1m. The raft snapshot queue does not override this. The default snapshot rate is 8 MB/s. ``` (512MB / 8MB/s) = 64s > 1m ``` This unfortunate fact means that ranges larger than 512 MB can never successfully receive a snapshot from the raft snapshot queue. The next commit will utilize this fact by adding a cluster setting to control the timeout of the raft snapshot queue. This commit changes the constant per-queue timeout to a function which can consult both cluster settings and the Replica which is about to be processed. Release note: None.

…nd size This commit adds a hidden setting to control the minimum timeout of raftSnapshotQueue processing. It is an escape hatch to deal with snapshots for large ranges. At the default send rate of 8MB/s a range must stay smaller than 500MB to be successfully sent before the default 1m timeout. When this has been hit traditionally it is has been mitigated by increasing the send rate. This may not always be desirable. In addition to the minimum timeout the change also now computes the timeout on a per Replica basis based on the current snapshot rate limit and the size of the snapshot being sent. This should prevent large ranges with slow send rates from timing out. Maybe there should be a release note but because the setting is hidden I opted not to add it. Release note: None.

Before this commit we would permit rate cluster settings to be set to non-positive values. This would have caused crashes had anybody dared to try. Release note: None

cockroach-teamcity · 2020-02-11T02:10:20Z

This change is

knz

LGTM. However:

on the one hand I appreciate that you don't want to advertise these change as a user-facing feature in release notes
OTOH the motivation for back-porting this is that there are real problem in deployments that this is intending to solve. At the very least you should produce a blurb of explanation towards the support engineer, what are the symptoms of the problem and how to apply your solution in practice.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 7 of 7 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained

knz

Maybe put this explanation in the PR description? Both this one and the original one that got merged. I know that TSEs go and look at PRs when they want to understand a change.

Reviewable status: complete! 0 of 0 LGTMs obtained

ajwerner · 2020-02-28T23:10:36Z

@rmloveland what's the best way for me to document this for backport? Help on how to word the release note would additionally be appreciated.

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

rmloveland · 2020-03-02T14:41:19Z

ajwerner <notifications@github.com> writes:

@rmloveland what's the best way for me to document this for backport?

By "document this for backport" do you mean "get this added to the 19.2 docs"? If so please create an issue and assign it to me at https://github.com/cockroachdb/docs/issues/new (Generally, an issue should be automatically created for me by a script that will run once the PR is merged - however you need to add release note comments to the commit messages for that script to pick them up. Also, I'm not 100% confident in the freshness of the data provided by the Github API - I don't think it always picks up things that have been edited / changed.)

Help on how to word the release note would additionally be appreciated.

I took a shot at it based on scanning the commit messages. I don't know if two release note messages are needed, but I kind of read them as "separate but related" changes. However, that may not be true. You will probably need to edit these for technical accuracy: - Added a cluster setting `XXX.XXX` to control the timeout for several storage system queues: the Raft snapshot queue, the replication queue, and the merge queue. Setting this to a higher value can be helpful in cases where a queue gets stuck due to e.g. very large ranges, very slow send rates, or a combination of both. - Fixed a bug where large ranges with slow send rates would hit the timeout in several storage system queues by making the timeout dynamic based on the current rate limit and the size of the data being sent. This affects several storage system queues: the Raft snapshot queue, the replication queue, and the merge queue.

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

…namic In cockroachdb#42686 we made the raft snapshot queue timeout dynamic and based on the size of the snapshot being sent. We also added an escape hatch to control the timeout of processing of that queue. This change generalizes that cluster setting to apply to all of the queues. It so happens that the replicate queue and the merge queue also sometimes need to send snapshots. This PR gives them similar treatment to the raft snapshot queue. The previous cluster setting was never released and is reserved so it does not need a release note. Release note (bug fix): Fixed a bug where large ranges with slow send rates would hit the timeout in several storage system queues by making the timeout dynamic based on the current rate limit and the size of the data being sent. This affects several storage system queues: the Raft snapshot queue, the replication queue, and the merge queue.

ajwerner · 2020-03-04T13:24:10Z

I added the second release note to the last commit. The setting is not public, so for the moment I'd like to punt on documenting it. I suspect that having it not be public isn't great but it's an escape hatch we shouldn't need to use. If we find ourselves using it, then we should document it.

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

45573: storage: trigger GC based on SysCount/SysBytes r=ajwerner a=tbg If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: #42765 3. aborting transactions in a busy loop: #38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in #44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable). Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

If there is "a lot" of data in Sys{Bytes,Count}, then we are likely experiencing a large abort span. The abort span is not supposed to become that large, but it does happen and causes stability fallout, usually due to a combination of shortcomings: 1. there's no trigger for GC based on abort span size alone (before this commit) 2. transaction aborts tended to create unnecessary abort span entries, fixed (and 19.2-backported) in: cockroachdb#42765 3. aborting transactions in a busy loop: cockroachdb#38088 (and we suspect this also happens in user apps occasionally) 4. large snapshots would never complete due to the queue time limits (addressed in cockroachdb#44952). In an ideal world, we would factor in the abort span into this method directly, but until then the condition guarding this block will do. At worst, there is some other reason for SysBytes become that large while also incurring a large SysCount, but I'm not sure how this would happen. The only other things in this span are the versioned range descriptors (which follow regular GC and it's only ever adding a SysCount of one), or transaction records (which expire like abort span records). Release note (bug fix): Range garbage collection will now trigger based on a large abort span, adding defense-in-depth against ranges growing large (and eventually unstable).

ajwerner added 3 commits February 10, 2020 20:59

storage: ensure that rate cluster settings are positive

8dcc132

Before this commit we would permit rate cluster settings to be set to non-positive values. This would have caused crashes had anybody dared to try. Release note: None

ajwerner requested a review from knz February 11, 2020 02:11

knz approved these changes Feb 11, 2020

View reviewed changes

tbg mentioned this pull request Mar 2, 2020

storage: trigger GC based on SysCount/SysBytes #45573

Merged

ajwerner force-pushed the backport19.2-42686-44809 branch from 519eba8 to a526d55 Compare March 4, 2020 13:24

ajwerner merged commit fa0f6c4 into cockroachdb:release-19.2 Mar 4, 2020

tbg mentioned this pull request Mar 5, 2020

backport-19.2: storage: trigger GC based on SysCount/SysBytes #45744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952

release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952

ajwerner commented Feb 11, 2020

cockroach-teamcity commented Feb 11, 2020

knz left a comment

knz left a comment

ajwerner commented Feb 28, 2020

rmloveland commented Mar 2, 2020 via email

ajwerner commented Mar 4, 2020

release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952

release-19.2: storage: make queue timeouts controllable, snapshot sending queues dynamic #44952

Conversation

ajwerner commented Feb 11, 2020

cockroach-teamcity commented Feb 11, 2020

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

ajwerner commented Feb 28, 2020

rmloveland commented Mar 2, 2020 via email

ajwerner commented Mar 4, 2020