Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added rpk cluster health command #4295

Merged
merged 9 commits into from
May 3, 2022

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Apr 15, 2022

Cover letter

Added an admin API endpoint GET /v1/cluster/health_overview returning simple overview of cluster health. The endpoint is intended to be used as a simple indicator of overall cluster health. It provides is_healthy flag that is a cluster health indicator. Additional fields available in the overview should make debugging potential health issues easier

Fixes: #4526

Release notes

Features

added GET /v1/cluster/health_overview admin api endpoint

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very short review. looks good on a first pass.

I think we can close the other PR with the 5 commits which add the health summary? I'm building on top of this PR for other stuff, too.

Comment on lines +694 to +724
auto ec = co_await maybe_refresh_cluster_health(
force_refresh::no, deadline);
Copy link
Member

@dotnwat dotnwat Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that any node can serve a health report, and that it will be refreshed if it is stale? I wonder if we should always retrieve health reports from the controller?

I see that ec is used to determine is_healthly. Presumably it said the cluster isn't healthy if we couldn't refresh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, i assumed that if we can not refresh cluster is in bad condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I'm not very familiar with this code so please tell me if I have some wrong assumptions.

but I think if the cluster was healthy, but we couldn't refresh, then we may have cluster_health_overview with no nodes down, all nodes have leaders, but is_healthy is false, which may be confusing.

Maybe we should add a field like obsolete cluster state to the cluster_health_overview?

@dotnwat
Copy link
Member

dotnwat commented Apr 19, 2022

I factored out part of this PR into a commit I depended on: 487ab46 maybe cherry-pick it into this PR? It's the first commit in the PR #4319

@mmaslankaprv mmaslankaprv force-pushed the rpk-health-overview branch 2 times, most recently from 46e725f to 011fc22 Compare April 21, 2022 06:19
@mmaslankaprv mmaslankaprv marked this pull request as ready for review April 21, 2022 06:57
Introduced `cluster_health_overview` type that represents a simple
view of cluster health. The type provides a simple flag indicating if
cluster is healthy and if not it provides some additional information
that should make it easier to understand what makes the cluster
unhealthy.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Implemented logic that goes over all the available node health reports
to build the cluster health overview. The `get_cluster_health_overview`
method should be a place where more complicated health overview logic
may be added in future.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
Added `GET /v1/cluster/health_overview` path to redpanda admin server.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
Signed-off-by: Michal Maslanka <michal@vectorized.io>
src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/api/admin/api_cluster.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved
src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved
Signed-off-by: Michal Maslanka <michal@vectorized.io>
@twmb twmb changed the title Added rpk admin cluster health command Added rpk cluster health command Apr 29, 2022
Copy link
Contributor

@twmb twmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dotnwat and @jcsp: Are we planning to have any other command related to health / extend health?

At the moment this reserves rpk cluster health -- do we have a guess as to how many commands we are going to stuff under rpk cluster? Currently it's config, maintenance, and health, I'm wondering if we have some list that we plan to add for the next 3 to 6mo. Mostly, I'm wary of rpk cluster becoming a dumping ground for 30 commands.

From what I can tell, some of what's under rpk redpanda admin can / should live next to these cluster commands: rpk redpanda admin brokers {list,decommission,recommission} potentially (since these are commands on the cluster), the soon-to-be-merged rpk redpanda admin partitions list (and in the future, move). Perhaps even rpk redpanda admin config {log-level,print} should be deleted (should this be merged with rpk cluster config?)

If we can get a few thoughts here documented, I think I'd be more comfortable merging this (and more comfortable with the recent maintenance -- it wasn't that long ago that I was proposing deprecating rpk cluster entirely).

src/go/rpk/pkg/cli/cmd/cluster/health.go Show resolved Hide resolved
src/go/rpk/pkg/cli/cmd/cluster/health.go Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor

jcsp commented Apr 29, 2022

cc @dotnwat and @jcsp: Are we planning to have any other command related to health / extend health?

Yes.

At the moment this reserves rpk cluster health -- do we have a guess as to how many commands we are going to stuff under rpk cluster? Currently it's config, maintenance, and health, I'm wondering if we have some list that we plan to add for the next 3 to 6mo. Mostly, I'm wary of rpk cluster becoming a dumping ground for 30 commands.

I don't think we'll get to 30, but there will probably be a bunch. Globally, rpk is going to have lots of commands between the on-prem tooling, the cloud client stuff, the BYOC client stuff, and the kafka client stuff -- grouping the commands that are about managing clusters within a prefix up-front feels like the right thing to do, and I think we've got to get away from the confusing "redpanda" prefix that to us means "the daemon" but to a user is just the name of our company+product.

From what I can tell, some of what's under rpk redpanda admin can / should live next to these cluster commands: rpk redpanda admin brokers {list,decommission,recommission} potentially (since these are commands on the cluster), the soon-to-be-merged rpk redpanda admin partitions list (and in the future, move). Perhaps even rpk redpanda admin config {log-level,print} should be deleted (should this be merged with rpk cluster config?)

"redpanda admin config log-level" is a per-node thing, I would be inclined to put that under a "node" or "daemon" prefix if we can settle on that as a category for commands that operate directly on a single daemon. It's also just a kind of weird command, because we don't provide true configuration for log levels (it's one of the things people have to mess with systemd to set permanently). Once we add real config for log levels, that would be just another property in "cluster config", but I think we'd keep the node-level ephemeral log settings too: it's a useful pattern in support to turn up logging but just briefly.

If we can get a few thoughts here documented, I think I'd be more comfortable merging this (and more comfortable with the recent maintenance -- it wasn't that long ago that I was proposing deprecating rpk cluster entirely).

This is what I think it should look like:
https://docs.google.com/document/d/1gZfgeWJbtCSOnB92re6HcjMEc6KJDuj_dsrO-XPgCU8/edit

Basically this is two namespaces that logically belong to the core team: "cluster" and "node". These are all things that an FMC user wouldn't ever touch, and would naturally break off into a "redpandactl" tool for administrators, if we ever departed from the single CLI binary model.

Comment on lines +64 to +70
* Health overview is based on the information available in health monitor.
* Cluster is considered as healthy when follwing conditions are met:
*
* - all nodes that are are responding
* - all partitions have leaders
* - cluster controller is present (_raft0 leader)
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this comment doesn't mention that cluster is not healthy if we couldn't refresh health information (does that mean that some nodes are not responding?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implicitly covered in all cluster nodes are responding, I've changed the comment to fix error in the fist item

Added `rpk redpanda cluster health` command allowing users to
query for cluster health overview, the command provides two flags `-w
--watch` that blocks and prints out all the cluster health changes and
`-e --exit-on-healthy` which when passed in with `--watch` exits after
cluster becomes healthy.

Signed-off-by: Michal Maslanka <michal@vectorized.io>
cmd := cobra.Command{
Use: "health",
Short: "Queries cluster for health overview.",
Long: `
Copy link
Contributor

@twmb twmb May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous newline to begin this long text -- how about

Long: `Queries health overview.

Health overview is created based on the health reports collected periodically
from all nodes in the cluster. A cluster is considered healthy when the
following conditions are met:

* all cluster nodes are responding
* all partitions have leaders
* the cluster controller is present
`,

twmb
twmb previously approved these changes May 3, 2022
Copy link
Contributor

@twmb twmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending a long help text fix; preemptively approving now to remove my approval block.

I like the thoughts above on rpk cluster vs rpk node. We can then eventually do away with rpk redpanda.

Signed-off-by: Noah Watkins <noah@redpanda.com>
Signed-off-by: Noah Watkins <noah@redpanda.com>
@dotnwat
Copy link
Member

dotnwat commented May 3, 2022

lgtm pending a long help text fix; preemptively approving now to remove my approval block.

I fixed this for Michal.

@dotnwat dotnwat merged commit ad4a107 into redpanda-data:dev May 3, 2022
@dotnwat
Copy link
Member

dotnwat commented May 3, 2022

/backport v22.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. see workflow
I executed the below command:

git cherry-pick -x a7ba00d81618f24e9ba6d954aaea4c5bb64c04e1 6f64a1bd4a2b17692d3447f96556f04e842c5cca c3dcd9c25b4eb445b976072415080c2ba93039d6 8766ee31cacce814852d0b299c9be582edcea37a 62432ac03d6279f488f8efcffb478b5ed04dd99e e43a5a2953552adeb52cf298dd5a85590dd9afd4 0319db51eb94576f2829402b2c4aa23bf1c9b152 7f864d0af7dcfcda3d91d0c118a18ff4385f5bc3 4ddb6b2a794831be09761b0b222185d25dee5ffc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rpk: Added rpk cluster health command
7 participants