stability: allow monitoring whether cluster is (at least partially) unavailable #19644

tbg · 2017-10-30T17:40:28Z

To monitor cluster availability, the current options are

watching the live node count and
the ranges_unavailable metric

but neither is authoritative. For example, the former may be too optimistic: three nodes with one node down can still hurt ranges that are for any reason only <3x replicated at the time of failure. Similarly, the ranges_unavailable metric won't report anything on a range that is completely unavailable (i.e. not a single replica is alive to complain).

We should expose a metric that comes as close as possible to reporting the "true" availability of the cluster. Naively, this could be achieved by scanning the meta ranges and intersecting it with the known live nodes, but this would be too expensive.

The text was updated successfully, but these errors were encountered:

dianasaur323 · 2017-11-07T04:53:25Z

@tschottdorf wouldn't this be nice? Our live node count metric is flakey, and on top of that, it doesn't really represent cluster availability since it makes people think that they might have some unavailable data when they probably don't if they only lost one node. I didn't realize ranges_unavailable would not capture ranges that had lost all three replicas. Interesting...

On a more important note though, I noticed that you marked this as a 1.2 milestone. Were you planning on taking this on?

tbg · 2017-11-07T18:02:23Z

@dianasaur323 this came up during discussion with one of our customers (same as #8473), which is why I tentatively marked it for 1.2. I have no immediate plans to work on it, though.

Note that being a distributed system, authoritative answers are expensive. Perhaps the sweet spot is marking a cluster as "endangered" whenever even a single node is down (or, if we try to get fancier, look at all the zone configs and constraints and try to compute whether with the available nodes, we can possibly satisfy quorums everywhere -- but this check is too optimistic and needs to be pessimized by juxtaposing it with unavailable_replicas metrics, at which point it might be pretty good) and offering a more expensive (but authoritative) check that is triggered manually which would walk the meta ranges and cross-check this with node liveness info to determine whether any replicas have lost quorum.

dianasaur323 · 2018-01-17T17:52:32Z

@tschottdorf are we still going to get to this by 2.0? Perhaps it might make sense to move it out? I have it as part of our work to make CockroachDB easier to debug when the majority of nodes is lost, but I don't think it's a requirement for that work.

tbg · 2018-01-18T03:43:25Z

I don't think much will happen here in the 2.0 timeframe, and that it's justifiable to move it out.

dianasaur323 · 2018-01-22T14:44:24Z

Perfect, moving out

awoods187 · 2018-10-16T15:45:15Z

We've recently had customers run into this problem. A customer decided to shut down via 'cockroach node decommission ' and wait for it to go down to zero replicas (or stuck at one) before terminating the host. A range was reported as under-replicated at one point, but the old nodes where shut down anyway and that error went away. We need to make sure these errors are not lost!

BramGruneir · 2019-06-18T17:51:31Z

So I wanted to add more colour here, since this is important for customers who have to deal with data soverenity.

Let's take this basic setup:

A 9 node cluster with three regions: a, b and c.

SELECT node_id, locality FROM crdb_internal.gossip_nodes;
  node_id |    locality
+---------+-----------------+
        1 | {"region": "a"}
        2 | {"region": "a"}
        3 | {"region": "a"}
        4 | {"region": "b"}
        5 | {"region": "b"}
        6 | {"region": "b"}
        7 | {"region": "c"}
        8 | {"region": "c"}
        9 | {"region": "c"}


CREATE DATABaSE testdb;
CREATE TABLE IF NOT EXISTS testdb.test (
    pk_region
        STRING(20),
    pk_thread
        STRING(20),
    column1
        STRING(20),
    CONSTRAINT "primary" PRIMARY KEY (pk_region ASC, pk_thread ASC)
)
    PARTITION BY LIST (pk_region)(
        PARTITION pa VALUES IN ('a'), 
        PARTITION pb VALUES IN ('b'), 
        PARTITION pc VALUES IN ('c'), 
        PARTITION default VALUES IN (DEFAULT));
ALTER PARTITION pa OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=a"]';
ALTER PARTITION pb OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=b"]';
ALTER PARTITION pc OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=c"]';

With this setup, even without any data, you get this:

SHOW experimental_ranges FROM TABLE testdb.test;
start_key    |    end_key     | range_id | replicas | lease_holder
+----------------+----------------+----------+----------+--------------+
  NULL           | /"a"           |       21 | {1,6,8}  |            6
  /"a"           | /"a"/PrefixEnd |       22 | {1,2,3}  |            1
  /"a"/PrefixEnd | /"b"           |       23 | {1,6,8}  |            1
  /"b"           | /"b"/PrefixEnd |       24 | {4,5,6}  |            5
  /"b"/PrefixEnd | /"c"           |       25 | {3,4,7}  |            7
  /"c"           | /"c"/PrefixEnd |       51 | {7,8,9}  |            9
  /"c"/PrefixEnd | NULL           |       52 | {3,4,7}  |            7

And if you look in the UI:

Now I'll kill nodes 7, 8 and 9. So all of c.

In the UI we get:

And after 5 mins, the cluster settles into:

Problem ranges returns that nothing is wrong and the range report for 51 is:

Also, show experimental_ranges never finishes. Is there an issue for that? It should be able to return and show the unavailability.

Of course, bring the nodes back online and voila.

I also have no idea why the number of ranges increased then decreased. That's another issue that I bet is part of this as well.

cc @drewdeally

github-actions · 2021-06-08T02:13:08Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

tbg added this to the 1.2 milestone Oct 30, 2017

dianasaur323 modified the milestones: 2.0, 2.1 Jan 22, 2018

knz added A-monitoring S-3-wrong-metadata Issues causing erroneous metadata or monitoring stats to be returned. labels Apr 27, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

petermattis removed this from the 2.1 milestone Oct 5, 2018

tbg mentioned this issue May 2, 2019

metrics: add ranges are "Under-diversified" metric (different from under-replicated) #26757

Closed

tbg assigned andreimatei Jun 17, 2019

piyush-singh mentioned this issue Jun 24, 2019

docs: add an Insights into Constraint Conformance RFC #38309

Merged

solongordon mentioned this issue Oct 10, 2019

sql: SHOW RANGES hangs if a table has any unavailable ranges #41518

Closed

github-actions bot added the no-issue-activity label Jun 8, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

github-actions bot added the X-stale label Jun 28, 2021

github-actions bot closed this as completed Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: allow monitoring whether cluster is (at least partially) unavailable #19644

stability: allow monitoring whether cluster is (at least partially) unavailable #19644

tbg commented Oct 30, 2017

dianasaur323 commented Nov 7, 2017

tbg commented Nov 7, 2017

dianasaur323 commented Jan 17, 2018

tbg commented Jan 18, 2018

dianasaur323 commented Jan 22, 2018

awoods187 commented Oct 16, 2018

BramGruneir commented Jun 18, 2019

github-actions bot commented Jun 8, 2021

stability: allow monitoring whether cluster is (at least partially) unavailable #19644

stability: allow monitoring whether cluster is (at least partially) unavailable #19644

Comments

tbg commented Oct 30, 2017

dianasaur323 commented Nov 7, 2017

tbg commented Nov 7, 2017

dianasaur323 commented Jan 17, 2018

tbg commented Jan 18, 2018

dianasaur323 commented Jan 22, 2018

awoods187 commented Oct 16, 2018

BramGruneir commented Jun 18, 2019

github-actions bot commented Jun 8, 2021