Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: consider option to disable quota pool #20817

Closed
nvanbenschoten opened this issue Dec 18, 2017 · 11 comments
Closed

storage: consider option to disable quota pool #20817

nvanbenschoten opened this issue Dec 18, 2017 · 11 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@nvanbenschoten
Copy link
Member

The quotaPool plays an important role in throttling forward progress on a Range when one or more followers consistently falls behind. The pool allows the Range to move at the speed of the fastest quorum while all replicas are up-to-date "enough", but forces the Range to move at the speed of the slowest follower once it falls far enough behind.

This is an important policy for keeping ranges healthy in cases where all replicas are expected to be kept up-to-date. However, it is a behavioral deviation from what some might expect from a quorum system, so it might come as a surprise. It also may not be desired for "asymmetrical" deployments where some nodes are expected to fall behind constantly.

We should consider adding an option to disable the quota pool for these kinds of deployments. We should also improve the documentation on this topic so that its effects are not as surprising to users as we've seen them be in the past.

cc. @petermattis @tschottdorf

@a-robinson
Copy link
Contributor

What specific situations has this been an issue in?

In high-latency deployments you'd expect a node to persistently lag behind by a bit, but not to keep falling further and further behind. So the problem there is more about making sure the quota pool is large enough.

In clusters where one node is just legitimately slower than others, the user is already going to be in for a bad time due to the fact that we don't take replica slowness into account in replica/lease placement (#17572), so some of the replicas they're using will probably have their lease on the slow node.

@bdarnell
Copy link
Contributor

It also may not be desired for "asymmetrical" deployments where some nodes are expected to fall behind constantly.

It is designed specifically to prevent such deployments, as they do not have the advertised degree of fault-tolerance. (and often get into unproductive and expensive snapshot loops)

If the quota pool is a problem, we may want to treat that as an input to the rebalancer to find a faster node to replace the slow follower. But that may not be possible with locality constraints. If no faster node that satisfies the constraints can be found, then slowing down write traffic is the right thing to do.

The quota pool may need tuning with the benefit of real-world usage experience. Some sort of dynamic self-tuning analogous to TCP window sizes might be best.

@nvanbenschoten
Copy link
Member Author

In high-latency deployments you'd expect a node to persistently lag behind by a bit, but not to keep falling further and further behind.

I'd expect the combination of a high-latency deployment and a high-throughput workload is what would stress the quota pool the most. In that scenario, ranges will quickly be forced to move at the speed of their slowest follower. This is what we've seen in privately reported issues.

As we continue to improve snapshot efficiency, giving snapshots even more leverage over standard entry application, I can envision legitimate deployments that always want to move at the quorum speed and are willing accept periodic snapshots. This may be better served with non-voting replicas, but even they will need to make a similar decision.

@nvanbenschoten
Copy link
Member Author

It is designed specifically to prevent such deployments, as they do not have the advertised degree of fault-tolerance.

Do you mind expanding on this? When an up-to-date replica fails, we'll still be able to maintain availability once the straggler is caught-up.

@bdarnell
Copy link
Contributor

Yes, but you won't be able to commit any new writes (or extend a lease) until the straggler has caught up, so you're a single failure from unavailability.

@nvanbenschoten
Copy link
Member Author

But that's the case regardless of whether the straggler needs a snapshot or not. The quota pool may just help to reduce the maximum straggler catch-up latency.

@tbg
Copy link
Member

tbg commented Dec 19, 2017

Besides, the unavailability on losing a non-straggler is not data loss, just unavailability. That may be worth it for some.

That said, I'm not clear on the tradeoffs here. Definitely should hold off until we understand it better.

@bdarnell
Copy link
Contributor

I'd expect the combination of a high-latency deployment and a high-throughput workload is what would stress the quota pool the most. In that scenario, ranges will quickly be forced to move at the speed of their slowest follower. This is what we've seen in privately reported issues.

There are two different kinds of high-latency deployments: Those that have sufficient bandwidth to handle the desired throughput, and those that don't. For the former, the problem is simply that the quota pool is too small, and we should enlarge it (possibly making it adaptive to network conditions) rather than disable it.

When the bandwidth to the third replica is insufficient, we have a problem. The only way this replica can be useful is to allow it to fall far enough behind that occasional snapshots are cheaper than streaming the log as it happens. (and even then the usefulness is questionable: if one of the fast replicas dies, the surviving fast replica will probably need to stream a new snapshot to the slow replica before it can spin up a new fast replica. This feels backwards, although since two-DC deployments don't work very well either I'm not sure what we can do about it besides "get better networking to the third DC")

Supporting this kind of low-bandwidth replica is trickier than just allowing it to ignore the quota pool. Currently, every range will try to keep up as much as it can, saturating the bandwidth that we've already shown to be insufficient. We'd need to get better about prioritization to ensure fairness across ranges and prioritizing those replicas where the slow replica is blocking the rest of the range.

Besides, the unavailability on losing a non-straggler is not data loss, just unavailability. That may be worth it for some.

Unavailability can eventually become data loss. We lose data if we have too many failures in less time than it takes to recover failed replicas. If we have to do too much work to catch up slow followers before we can start new ones, that could dramatically increase the recovery time.

Let's make an option to change the size of the quota pool instead of an option to disable it completely. I think we'll always find that some finite value here is best instead of turning it off.

@tbg
Copy link
Member

tbg commented Dec 19, 2017

Do you think we should even introduce that option? Might be premature as well.

@bdarnell
Copy link
Contributor

AIUI, the issue that motivated this discussion was primarily about the inability to send snapshots for large ranges and only secondarily involved the quota pool. So I don't think it's urgent, but the current 1MiB quota pool is arbitrary and probably not ideal, especially for high-latency networks (OTOH, you can already tune it indirectly with COCKROACH_RAFT_LOG_MAX_SIZE).

@petermattis petermattis added A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jul 21, 2018
@tbg tbg added this to the Later milestone Jul 22, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@bdarnell
Copy link
Contributor

Two years later, I think it's safe to say we're more comfortable with the quota pool and won't be adding an option to disable it. Changing its size might make more sense, but since we haven't had any actual demand for this I'm going to just close this issue. (And we have the environment variable COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD to indirectly change it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants