-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: allow monitoring whether cluster is (at least partially) unavailable #19644
Comments
@tschottdorf wouldn't this be nice? Our live node count metric is flakey, and on top of that, it doesn't really represent cluster availability since it makes people think that they might have some unavailable data when they probably don't if they only lost one node. I didn't realize ranges_unavailable would not capture ranges that had lost all three replicas. Interesting... On a more important note though, I noticed that you marked this as a 1.2 milestone. Were you planning on taking this on? |
@dianasaur323 this came up during discussion with one of our customers (same as #8473), which is why I tentatively marked it for 1.2. I have no immediate plans to work on it, though. Note that being a distributed system, authoritative answers are expensive. Perhaps the sweet spot is marking a cluster as "endangered" whenever even a single node is down (or, if we try to get fancier, look at all the zone configs and constraints and try to compute whether with the available nodes, we can possibly satisfy quorums everywhere -- but this check is too optimistic and needs to be pessimized by juxtaposing it with |
@tschottdorf are we still going to get to this by 2.0? Perhaps it might make sense to move it out? I have it as part of our work to make CockroachDB easier to debug when the majority of nodes is lost, but I don't think it's a requirement for that work. |
I don't think much will happen here in the 2.0 timeframe, and that it's justifiable to move it out. |
Perfect, moving out |
We've recently had customers run into this problem. A customer decided to shut down via 'cockroach node decommission ' and wait for it to go down to zero replicas (or stuck at one) before terminating the host. A range was reported as under-replicated at one point, but the old nodes where shut down anyway and that error went away. We need to make sure these errors are not lost! |
So I wanted to add more colour here, since this is important for customers who have to deal with data soverenity. Let's take this basic setup: A 9 node cluster with three regions: SELECT node_id, locality FROM crdb_internal.gossip_nodes;
node_id | locality
+---------+-----------------+
1 | {"region": "a"}
2 | {"region": "a"}
3 | {"region": "a"}
4 | {"region": "b"}
5 | {"region": "b"}
6 | {"region": "b"}
7 | {"region": "c"}
8 | {"region": "c"}
9 | {"region": "c"}
CREATE DATABaSE testdb;
CREATE TABLE IF NOT EXISTS testdb.test (
pk_region
STRING(20),
pk_thread
STRING(20),
column1
STRING(20),
CONSTRAINT "primary" PRIMARY KEY (pk_region ASC, pk_thread ASC)
)
PARTITION BY LIST (pk_region)(
PARTITION pa VALUES IN ('a'),
PARTITION pb VALUES IN ('b'),
PARTITION pc VALUES IN ('c'),
PARTITION default VALUES IN (DEFAULT));
ALTER PARTITION pa OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=a"]';
ALTER PARTITION pb OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=b"]';
ALTER PARTITION pc OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=c"]'; With this setup, even without any data, you get this: SHOW experimental_ranges FROM TABLE testdb.test;
start_key | end_key | range_id | replicas | lease_holder
+----------------+----------------+----------+----------+--------------+
NULL | /"a" | 21 | {1,6,8} | 6
/"a" | /"a"/PrefixEnd | 22 | {1,2,3} | 1
/"a"/PrefixEnd | /"b" | 23 | {1,6,8} | 1
/"b" | /"b"/PrefixEnd | 24 | {4,5,6} | 5
/"b"/PrefixEnd | /"c" | 25 | {3,4,7} | 7
/"c" | /"c"/PrefixEnd | 51 | {7,8,9} | 9
/"c"/PrefixEnd | NULL | 52 | {3,4,7} | 7 Now I'll kill nodes 7, 8 and 9. So all of And after 5 mins, the cluster settles into: Problem ranges returns that nothing is wrong and the range report for 51 is: Also, show experimental_ranges never finishes. Is there an issue for that? It should be able to return and show the unavailability. Of course, bring the nodes back online and voila. I also have no idea why the number of ranges increased then decreased. That's another issue that I bet is part of this as well. cc @drewdeally |
We have marked this issue as stale because it has been inactive for |
To monitor cluster availability, the current options are
ranges_unavailable
metricbut neither is authoritative. For example, the former may be too optimistic: three nodes with one node down can still hurt ranges that are for any reason only <3x replicated at the time of failure. Similarly, the
ranges_unavailable
metric won't report anything on a range that is completely unavailable (i.e. not a single replica is alive to complain).We should expose a metric that comes as close as possible to reporting the "true" availability of the cluster. Naively, this could be achieved by scanning the meta ranges and intersecting it with the known live nodes, but this would be too expensive.
The text was updated successfully, but these errors were encountered: