Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: allow monitoring whether cluster is (at least partially) unavailable #19644

Closed
tbg opened this issue Oct 30, 2017 · 8 comments
Closed
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3-wrong-metadata Issues causing erroneous metadata or monitoring stats to be returned. T-kv KV Team X-stale

Comments

@tbg
Copy link
Member

tbg commented Oct 30, 2017

To monitor cluster availability, the current options are

  1. watching the live node count and
  2. the ranges_unavailable metric

but neither is authoritative. For example, the former may be too optimistic: three nodes with one node down can still hurt ranges that are for any reason only <3x replicated at the time of failure. Similarly, the ranges_unavailable metric won't report anything on a range that is completely unavailable (i.e. not a single replica is alive to complain).

We should expose a metric that comes as close as possible to reporting the "true" availability of the cluster. Naively, this could be achieved by scanning the meta ranges and intersecting it with the known live nodes, but this would be too expensive.

@tbg tbg added this to the 1.2 milestone Oct 30, 2017
@dianasaur323
Copy link
Contributor

@tschottdorf wouldn't this be nice? Our live node count metric is flakey, and on top of that, it doesn't really represent cluster availability since it makes people think that they might have some unavailable data when they probably don't if they only lost one node. I didn't realize ranges_unavailable would not capture ranges that had lost all three replicas. Interesting...

On a more important note though, I noticed that you marked this as a 1.2 milestone. Were you planning on taking this on?

@tbg
Copy link
Member Author

tbg commented Nov 7, 2017

@dianasaur323 this came up during discussion with one of our customers (same as #8473), which is why I tentatively marked it for 1.2. I have no immediate plans to work on it, though.

Note that being a distributed system, authoritative answers are expensive. Perhaps the sweet spot is marking a cluster as "endangered" whenever even a single node is down (or, if we try to get fancier, look at all the zone configs and constraints and try to compute whether with the available nodes, we can possibly satisfy quorums everywhere -- but this check is too optimistic and needs to be pessimized by juxtaposing it with unavailable_replicas metrics, at which point it might be pretty good) and offering a more expensive (but authoritative) check that is triggered manually which would walk the meta ranges and cross-check this with node liveness info to determine whether any replicas have lost quorum.

@dianasaur323
Copy link
Contributor

@tschottdorf are we still going to get to this by 2.0? Perhaps it might make sense to move it out? I have it as part of our work to make CockroachDB easier to debug when the majority of nodes is lost, but I don't think it's a requirement for that work.

@tbg
Copy link
Member Author

tbg commented Jan 18, 2018

I don't think much will happen here in the 2.0 timeframe, and that it's justifiable to move it out.

@dianasaur323
Copy link
Contributor

Perfect, moving out

@dianasaur323 dianasaur323 modified the milestones: 2.0, 2.1 Jan 22, 2018
@knz knz added A-monitoring S-3-wrong-metadata Issues causing erroneous metadata or monitoring stats to be returned. labels Apr 27, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@petermattis petermattis removed this from the 2.1 milestone Oct 5, 2018
@awoods187
Copy link
Contributor

We've recently had customers run into this problem. A customer decided to shut down via 'cockroach node decommission ' and wait for it to go down to zero replicas (or stuck at one) before terminating the host. A range was reported as under-replicated at one point, but the old nodes where shut down anyway and that error went away. We need to make sure these errors are not lost!

@BramGruneir
Copy link
Member

So I wanted to add more colour here, since this is important for customers who have to deal with data soverenity.

Let's take this basic setup:

A 9 node cluster with three regions: a, b and c.

SELECT node_id, locality FROM crdb_internal.gossip_nodes;
  node_id |    locality
+---------+-----------------+
        1 | {"region": "a"}
        2 | {"region": "a"}
        3 | {"region": "a"}
        4 | {"region": "b"}
        5 | {"region": "b"}
        6 | {"region": "b"}
        7 | {"region": "c"}
        8 | {"region": "c"}
        9 | {"region": "c"}


CREATE DATABaSE testdb;
CREATE TABLE IF NOT EXISTS testdb.test (
    pk_region
        STRING(20),
    pk_thread
        STRING(20),
    column1
        STRING(20),
    CONSTRAINT "primary" PRIMARY KEY (pk_region ASC, pk_thread ASC)
)
    PARTITION BY LIST (pk_region)(
        PARTITION pa VALUES IN ('a'), 
        PARTITION pb VALUES IN ('b'), 
        PARTITION pc VALUES IN ('c'), 
        PARTITION default VALUES IN (DEFAULT));
ALTER PARTITION pa OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=a"]';
ALTER PARTITION pb OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=b"]';
ALTER PARTITION pc OF TABLE testdb.test CONFIGURE ZONE USING constraints = '["+region=c"]';

With this setup, even without any data, you get this:

SHOW experimental_ranges FROM TABLE testdb.test;
start_key    |    end_key     | range_id | replicas | lease_holder
+----------------+----------------+----------+----------+--------------+
  NULL           | /"a"           |       21 | {1,6,8}  |            6
  /"a"           | /"a"/PrefixEnd |       22 | {1,2,3}  |            1
  /"a"/PrefixEnd | /"b"           |       23 | {1,6,8}  |            1
  /"b"           | /"b"/PrefixEnd |       24 | {4,5,6}  |            5
  /"b"/PrefixEnd | /"c"           |       25 | {3,4,7}  |            7
  /"c"           | /"c"/PrefixEnd |       51 | {7,8,9}  |            9
  /"c"/PrefixEnd | NULL           |       52 | {3,4,7}  |            7

And if you look in the UI:
Screen Shot 2019-06-18 at 13 26 52

Now I'll kill nodes 7, 8 and 9. So all of c.

In the UI we get:
Screen Shot 2019-06-18 at 13 28 15

And after 5 mins, the cluster settles into:
Screen Shot 2019-06-18 at 13 33 40

Problem ranges returns that nothing is wrong and the range report for 51 is:
Screen Shot 2019-06-18 at 13 36 08

Also, show experimental_ranges never finishes. Is there an issue for that? It should be able to return and show the unavailability.

Of course, bring the nodes back online and voila.
Screen Shot 2019-06-18 at 13 38 14

I also have no idea why the number of ranges increased then decreased. That's another issue that I bet is part of this as well.

cc @drewdeally

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. A-monitoring C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3-wrong-metadata Issues causing erroneous metadata or monitoring stats to be returned. T-kv KV Team X-stale
Projects
None yet
Development

No branches or pull requests

8 participants