Added `rpk cluster health` command #4295

mmaslankaprv · 2022-04-15T13:14:13Z

Cover letter

Added an admin API endpoint GET /v1/cluster/health_overview returning simple overview of cluster health. The endpoint is intended to be used as a simple indicator of overall cluster health. It provides is_healthy flag that is a cluster health indicator. Additional fields available in the overview should make debugging potential health issues easier

Fixes: #4526

Release notes

Features

added GET /v1/cluster/health_overview admin api endpoint

dotnwat

very short review. looks good on a first pass.

I think we can close the other PR with the 5 commits which add the health summary? I'm building on top of this PR for other stuff, too.

dotnwat · 2022-04-15T21:59:14Z

src/v/cluster/health_monitor_backend.cc

+    auto ec = co_await maybe_refresh_cluster_health(
+      force_refresh::no, deadline);


Is the idea here that any node can serve a health report, and that it will be refreshed if it is stale? I wonder if we should always retrieve health reports from the controller?

I see that ec is used to determine is_healthly. Presumably it said the cluster isn't healthy if we couldn't refresh?

exactly, i assumed that if we can not refresh cluster is in bad condition.

hmm I'm not very familiar with this code so please tell me if I have some wrong assumptions.

but I think if the cluster was healthy, but we couldn't refresh, then we may have cluster_health_overview with no nodes down, all nodes have leaders, but is_healthy is false, which may be confusing.

Maybe we should add a field like obsolete cluster state to the cluster_health_overview?

src/go/rpk/pkg/cli/cmd/redpanda/admin/admin.go

src/go/rpk/pkg/cli/cmd/redpanda/admin/cluster/cluster.go

dotnwat · 2022-04-19T02:17:31Z

I factored out part of this PR into a commit I depended on: 487ab46 maybe cherry-pick it into this PR? It's the first commit in the PR #4319

Introduced `cluster_health_overview` type that represents a simple view of cluster health. The type provides a simple flag indicating if cluster is healthy and if not it provides some additional information that should make it easier to understand what makes the cluster unhealthy. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Implemented logic that goes over all the available node health reports to build the cluster health overview. The `get_cluster_health_overview` method should be a place where more complicated health overview logic may be added in future. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Signed-off-by: Michal Maslanka <michal@vectorized.io>

Added `GET /v1/cluster/health_overview` path to redpanda admin server. Signed-off-by: Michal Maslanka <michal@vectorized.io>

Signed-off-by: Michal Maslanka <michal@vectorized.io>

src/go/rpk/pkg/cli/cmd/cluster/health.go

src/go/rpk/pkg/api/admin/api_cluster.go

src/go/rpk/pkg/cli/cmd/cluster/health.go

Signed-off-by: Michal Maslanka <michal@vectorized.io>

twmb

cc @dotnwat and @jcsp: Are we planning to have any other command related to health / extend health?

At the moment this reserves rpk cluster health -- do we have a guess as to how many commands we are going to stuff under rpk cluster? Currently it's config, maintenance, and health, I'm wondering if we have some list that we plan to add for the next 3 to 6mo. Mostly, I'm wary of rpk cluster becoming a dumping ground for 30 commands.

From what I can tell, some of what's under rpk redpanda admin can / should live next to these cluster commands: rpk redpanda admin brokers {list,decommission,recommission} potentially (since these are commands on the cluster), the soon-to-be-merged rpk redpanda admin partitions list (and in the future, move). Perhaps even rpk redpanda admin config {log-level,print} should be deleted (should this be merged with rpk cluster config?)

If we can get a few thoughts here documented, I think I'd be more comfortable merging this (and more comfortable with the recent maintenance -- it wasn't that long ago that I was proposing deprecating rpk cluster entirely).

src/go/rpk/pkg/cli/cmd/cluster/health.go

jcsp · 2022-04-29T09:16:12Z

cc @dotnwat and @jcsp: Are we planning to have any other command related to health / extend health?

Yes.

At the moment this reserves rpk cluster health -- do we have a guess as to how many commands we are going to stuff under rpk cluster? Currently it's config, maintenance, and health, I'm wondering if we have some list that we plan to add for the next 3 to 6mo. Mostly, I'm wary of rpk cluster becoming a dumping ground for 30 commands.

I don't think we'll get to 30, but there will probably be a bunch. Globally, rpk is going to have lots of commands between the on-prem tooling, the cloud client stuff, the BYOC client stuff, and the kafka client stuff -- grouping the commands that are about managing clusters within a prefix up-front feels like the right thing to do, and I think we've got to get away from the confusing "redpanda" prefix that to us means "the daemon" but to a user is just the name of our company+product.

From what I can tell, some of what's under rpk redpanda admin can / should live next to these cluster commands: rpk redpanda admin brokers {list,decommission,recommission} potentially (since these are commands on the cluster), the soon-to-be-merged rpk redpanda admin partitions list (and in the future, move). Perhaps even rpk redpanda admin config {log-level,print} should be deleted (should this be merged with rpk cluster config?)

"redpanda admin config log-level" is a per-node thing, I would be inclined to put that under a "node" or "daemon" prefix if we can settle on that as a category for commands that operate directly on a single daemon. It's also just a kind of weird command, because we don't provide true configuration for log levels (it's one of the things people have to mess with systemd to set permanently). Once we add real config for log levels, that would be just another property in "cluster config", but I think we'd keep the node-level ephemeral log settings too: it's a useful pattern in support to turn up logging but just briefly.

If we can get a few thoughts here documented, I think I'd be more comfortable merging this (and more comfortable with the recent maintenance -- it wasn't that long ago that I was proposing deprecating rpk cluster entirely).

This is what I think it should look like:
https://docs.google.com/document/d/1gZfgeWJbtCSOnB92re6HcjMEc6KJDuj_dsrO-XPgCU8/edit

Basically this is two namespaces that logically belong to the core team: "cluster" and "node". These are all things that an FMC user wouldn't ever touch, and would naturally break off into a "redpandactl" tool for administrators, if we ever departed from the single CLI binary model.

LenaAn · 2022-04-29T14:25:50Z

src/v/cluster/health_monitor_frontend.h

+     *  Health overview is based on the information available in health monitor.
+     *  Cluster is considered as healthy when follwing conditions are met:
+     *
+     * - all nodes that are are responding
+     * - all partitions have leaders
+     * - cluster controller is present (_raft0 leader)
+     */


Also this comment doesn't mention that cluster is not healthy if we couldn't refresh health information (does that mean that some nodes are not responding?)

This is implicitly covered in all cluster nodes are responding, I've changed the comment to fix error in the fist item

Added `rpk redpanda cluster health` command allowing users to query for cluster health overview, the command provides two flags `-w --watch` that blocks and prints out all the cluster health changes and `-e --exit-on-healthy` which when passed in with `--watch` exits after cluster becomes healthy. Signed-off-by: Michal Maslanka <michal@vectorized.io>

twmb · 2022-05-03T00:57:25Z

src/go/rpk/pkg/cli/cmd/cluster/health.go

+	cmd := cobra.Command{
+		Use:   "health",
+		Short: "Queries cluster for health overview.",
+		Long: `


Extraneous newline to begin this long text -- how about

Long: `Queries health overview. Health overview is created based on the health reports collected periodically from all nodes in the cluster. A cluster is considered healthy when the following conditions are met: * all cluster nodes are responding * all partitions have leaders * the cluster controller is present `,

twmb

lgtm pending a long help text fix; preemptively approving now to remove my approval block.

I like the thoughts above on rpk cluster vs rpk node. We can then eventually do away with rpk redpanda.

src/go/rpk/pkg/cli/cmd/cluster/health.go

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat · 2022-05-03T05:53:25Z

lgtm pending a long help text fix; preemptively approving now to remove my approval block.

I fixed this for Michal.

dotnwat · 2022-05-03T19:14:58Z

/backport v22.1.x

vbotbuildovich · 2022-05-03T19:15:50Z

Failed to run cherry-pick command. see workflow
I executed the below command:

git cherry-pick -x a7ba00d81618f24e9ba6d954aaea4c5bb64c04e1 6f64a1bd4a2b17692d3447f96556f04e842c5cca c3dcd9c25b4eb445b976072415080c2ba93039d6 8766ee31cacce814852d0b299c9be582edcea37a 62432ac03d6279f488f8efcffb478b5ed04dd99e e43a5a2953552adeb52cf298dd5a85590dd9afd4 0319db51eb94576f2829402b2c4aa23bf1c9b152 7f864d0af7dcfcda3d91d0c118a18ff4385f5bc3 4ddb6b2a794831be09761b0b222185d25dee5ffc

[v22.1.x] backport #4295

github-actions bot added area/redpanda area/rpk labels Apr 15, 2022

mmaslankaprv force-pushed the rpk-health-overview branch from d76444b to 69588e2 Compare April 15, 2022 13:16

dotnwat reviewed Apr 15, 2022

View reviewed changes

dotnwat reviewed Apr 16, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/redpanda/admin/admin.go Outdated Show resolved Hide resolved

dotnwat reviewed Apr 16, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/redpanda/admin/cluster/cluster.go Outdated Show resolved Hide resolved

mmaslankaprv force-pushed the rpk-health-overview branch 2 times, most recently from 46e725f to 011fc22 Compare April 21, 2022 06:19

mmaslankaprv marked this pull request as ready for review April 21, 2022 06:57

mmaslankaprv requested review from NyaliaLui, twmb, 0x5d, LenaAn, ztlpn and VadimPlh as code owners April 21, 2022 06:57

twmb assigned r-vasquez and twmb Apr 22, 2022

mmaslankaprv requested a review from dotnwat April 28, 2022 06:17

mmaslankaprv added 5 commits April 28, 2022 21:06

c/health_frontend: returning cluster_health_overview

c3dcd9c

Signed-off-by: Michal Maslanka <michal@vectorized.io>

admin: added get cluster health overview endpoint

8766ee3

Added `GET /v1/cluster/health_overview` path to redpanda admin server. Signed-off-by: Michal Maslanka <michal@vectorized.io>

tests: added cluster health overview endpoint test

62432ac

Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv force-pushed the rpk-health-overview branch from 011fc22 to c5724bb Compare April 28, 2022 19:07

mmaslankaprv requested a review from r-vasquez as a code owner April 28, 2022 19:07

r-vasquez reviewed Apr 29, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

twmb requested changes Apr 29, 2022

View reviewed changes

src/go/rpk/pkg/api/admin/api_cluster.go Outdated Show resolved Hide resolved

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

rpk: added cluster overview admin api definition

e43a5a2

Signed-off-by: Michal Maslanka <michal@vectorized.io>

mmaslankaprv force-pushed the rpk-health-overview branch from c5724bb to c31eca7 Compare April 29, 2022 06:02

mmaslankaprv requested review from twmb and r-vasquez April 29, 2022 06:02

twmb changed the title ~~Added rpk admin cluster health command~~ Added rpk cluster health command Apr 29, 2022

twmb requested changes Apr 29, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/health.go Show resolved Hide resolved

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

mmaslankaprv force-pushed the rpk-health-overview branch from c31eca7 to bd6ce9d Compare April 29, 2022 07:14

mmaslankaprv requested a review from twmb April 29, 2022 08:32

LenaAn reviewed Apr 29, 2022

View reviewed changes

mmaslankaprv force-pushed the rpk-health-overview branch from bd6ce9d to 0319db5 Compare May 2, 2022 06:57

twmb reviewed May 3, 2022

View reviewed changes

twmb previously approved these changes May 3, 2022

View reviewed changes

r-vasquez reviewed May 3, 2022

View reviewed changes

src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved

dotnwat added 2 commits May 2, 2022 22:48

rpk: fix typo in help text

7f864d0

Signed-off-by: Noah Watkins <noah@redpanda.com>

rpk: improve cluster health long help message

4ddb6b2

Signed-off-by: Noah Watkins <noah@redpanda.com>

dotnwat dismissed twmb’s stale review via 4ddb6b2 May 3, 2022 05:51

r-vasquez approved these changes May 3, 2022

View reviewed changes

dotnwat merged commit ad4a107 into redpanda-data:dev May 3, 2022

dotnwat mentioned this pull request May 3, 2022

[v22.1.x] backport #4295 #4528

Merged

dotnwat added a commit that referenced this pull request May 4, 2022

Merge pull request #4528 from dotnwat/v22.1.x-4295

2914a5a

[v22.1.x] backport #4295

RafalKorepta mentioned this pull request Jun 19, 2024

Do not perform Redpanda decommission based on annotation redpanda-data/redpanda-operator#161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `rpk cluster health` command #4295

Added `rpk cluster health` command #4295

mmaslankaprv commented Apr 15, 2022 •

edited by dotnwat

Loading

dotnwat left a comment

dotnwat Apr 15, 2022 •

edited

Loading

mmaslankaprv Apr 21, 2022

LenaAn Apr 29, 2022

dotnwat commented Apr 19, 2022

twmb left a comment

jcsp commented Apr 29, 2022

LenaAn Apr 29, 2022

mmaslankaprv May 2, 2022

twmb May 3, 2022 •

edited

Loading

twmb left a comment

dotnwat commented May 3, 2022

dotnwat commented May 3, 2022

vbotbuildovich commented May 3, 2022

		auto ec = co_await maybe_refresh_cluster_health(
		force_refresh::no, deadline);

Added rpk cluster health command #4295

Added rpk cluster health command #4295

Conversation

mmaslankaprv commented Apr 15, 2022 • edited by dotnwat Loading

Cover letter

Release notes

Features

dotnwat left a comment

Choose a reason for hiding this comment

dotnwat Apr 15, 2022 • edited Loading

Choose a reason for hiding this comment

mmaslankaprv Apr 21, 2022

Choose a reason for hiding this comment

LenaAn Apr 29, 2022

Choose a reason for hiding this comment

dotnwat commented Apr 19, 2022

twmb left a comment

Choose a reason for hiding this comment

jcsp commented Apr 29, 2022

LenaAn Apr 29, 2022

Choose a reason for hiding this comment

mmaslankaprv May 2, 2022

Choose a reason for hiding this comment

twmb May 3, 2022 • edited Loading

Choose a reason for hiding this comment

twmb left a comment

Choose a reason for hiding this comment

dotnwat commented May 3, 2022

dotnwat commented May 3, 2022

vbotbuildovich commented May 3, 2022

Added `rpk cluster health` command #4295

Added `rpk cluster health` command #4295

mmaslankaprv commented Apr 15, 2022 •

edited by dotnwat

Loading

dotnwat Apr 15, 2022 •

edited

Loading

twmb May 3, 2022 •

edited

Loading