KEP-22: Diagnostic Bundle Collection #1506

vemelin-epm · 2020-05-12T00:49:39Z

What this PR does / why we need it:

Add support for collecting diagnostic data from the KUDO manager and an installed operator instance.

Includes:

the resources that are created by KUDO and therefore marked by the instance-specific set of labels
logs of deployed pods
KUDO manager resources
KUDO manager logs

All collected resources are stored in a directory structure

Fixes: D2IQ-63414

vemelin-epm · 2020-05-12T00:50:21Z

This is a simplified diagnostics version
Collect only the resources created by KUDO (filter by labels)
Flat structure under the instance
WIP status is due to missing tests
Current output:

diag
├── kudo
│   ├── pod_kudo-controller-manager-0
│   │   ├── kudo-controller-manager-0.log.gz
│   │   └── kudo-controller-manager-0.yaml
│   ├── serviceaccountlist.yaml
│   ├── servicelist.yaml
│   └── statefulsetlist.yaml
├── operator_zookeeper
│   ├── instance_zk
│   │   ├── pod_zk-zookeeper-0
│   │   │   ├── zk-zookeeper-0.log.gz
│   │   │   └── zk-zookeeper-0.yaml
│   │   ├── pod_zk-zookeeper-1
│   │   │   ├── zk-zookeeper-1.log.gz
│   │   │   └── zk-zookeeper-1.yaml
│   │   ├── pod_zk-zookeeper-2
│   │   │   ├── zk-zookeeper-2.log.gz
│   │   │   └── zk-zookeeper-2.yaml
│   │   ├── servicelist.yaml
│   │   ├── statefulsetlist.yaml
│   │   └── zk.yaml
│   ├── operatorversion_zookeeper-0.3.0
│   │   └── zookeeper-0.3.0.yaml
│   └── zookeeper.yaml
├── settings.yaml
└── version.yaml

ANeumann82

Generally much nicer! It's less complicated, still not easy (for me) to read, but acceptable. Maybe it's just a complex problem :)

I've added some nits and suggestions, my main issues at the moment:

Error handling: I've started to add suggestions, but generally, return err should not happen. There are some cases where it's ok, but usually they should provide a bit more context, otherwise it's hard to find where an error originated, as stacktraces are not printed by default.
Documentation: A lot more details about the intention on interfaces, a bit more details on the structs and Fns would go a long way helping to understand the code.
Some naming, although I'm not always sure there's a better naming scheme. I've added some suggestions where I thought strongly about it.

go.mod

pkg/kudoctl/cmd/diagnostics/builder.go

pkg/kudoctl/cmd/diagnostics/diagnostics.go

pkg/kudoctl/cmd/diagnostics/print.go

go.mod

nfnt

This is already in a great shape, it was much easier for me to grasp what's going on. I agree with what @ANeumann82 said: Please provide good documentation around the various abstractions used here. That said, while I agree with some abstractions, e.g. the Printable interface, others seem a bit too much; though I probably would come up with the same ones for this task 😆. We just have to keep in mind that function pointers, while very convenient, are hard to follow through the code. And with that I don't mean your code but any code. Hence, it would be great to describe the usage of the various function pointers in detail.

pkg/kudoctl/cmd/diagnostics/builder.go

pkg/kudoctl/cmd/diagnostics/diagnostics.go

zen-dog

First pass. Mostly wondering about the need for Printable and PrintableList types (see my comment)

pkg/kudoctl/cmd/diagnostics/diagnostics.go

pkg/kudoctl/cmd/diagnostics/collectors.go

pkg/kudoctl/cmd/diagnostics/print.go

vemelin-epm · 2020-05-14T13:34:07Z

Updates aiming to meet this request from @zen-dog #1506 (comment) brought essential code changes.

Printable abstraction removed
calls to collection processing context and resource configuration exposed instead of being implicitly provided via the builder
builder simplified to a plain sequential runner
callbacks modifying the shared context are removed from resource loader and made explicit
documentation added
error handling updated

Error treatment explained in brief:

resource provider errors, required explicitly to be fatal in a runner sequence are breaking the sequence. missing resource in this case is also treated as an error
resource provider errors are printed where otherwise the requested resource or log would be
fatal errors and printing errors are grouped and returned from the "main" Collect function

Example output when a client returned an error for pods request:

diag
├── operator_zookeeper
│   ├── instance_zk
│   │   ├── pod.err
│   │   ├── servicelist.yaml
│   │   ├── statefulsetlist.yaml
│   │   └── zk.yaml
│   ├── operatorversion_zookeeper-0.3.0
│   │   └── zookeeper-0.3.0.yaml
│   └── zookeeper.yaml
...

zen-dog

Nice work and quite an improvement from the previous version. However, I'm not sold on the generic ResourceCollector that can be configured to collect any kind of resource. The interface feels clunky, you need runtime casting and even had to copy type Object interface from the apimachinery internals.
It seems that we could rather have a bunch of dedicated collectors e.g. RbacCollector or StatefulSetCollector. Yes, it will be less generic, and, frankly, I don't mind. I'd rather live with some (hopefully minor) code repetition than having one generic rule-them-all-interface. And all shared information (like path generation from the instance name) can leave in a context.

pkg/kudoctl/cmd/diagnostics.go

zen-dog · 2020-05-14T19:47:14Z

pkg/kudoctl/cmd/diagnostics.go

+	cmd := &cobra.Command{
+		Use:   "diagnostics",
+		Short: "diagnostics",
+		Long:  "diagnostics command has sub-commands to collect and analyze diagnostics data",


"diagnostics command has sub-commands"? Not sure what you mean.

Pattern taken from here https://github.com/kudobuilder/kudo/blob/master/pkg/kudoctl/cmd/plan.go#L30

Ah, I see what you mean. Not every pattern found in the code base is good 😉 Also, it currently has only one sub-command.

But consistency should be more important here, as this is user-facing. Getting rid of bad patterns we inherit from is a separate issue.

@zen-dog @nfnt I am stuck here.

pkg/kudoctl/cmd/diagnostics.go

pkg/kudoctl/cmd/diagnostics/print.go

pkg/kudoctl/cmd/diagnostics/collectors.go

pkg/kudoctl/cmd/diagnostics/runner.go

zen-dog · 2020-05-14T22:13:41Z

pkg/kudoctl/cmd/diagnostics/processing_context.go

+	return fmt.Sprintf("%s/instance_%s", ctx.attachToOperator(), ctx.instanceName)
+}
+
+func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) {


Lol, this method name is too long, even by Java's standards.

It's a trade-off for generic runtime.Object.I chose a fully descriptive name for a callback. It seemed to me better than adding more code.

Suggested change

func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) {

func (ctx *processingContext) setOperatorNameFromOperatorVersion(o runtime.Object) {

Same here: I don't understand the "must" prefix, having just the "set..." part makes it a lot clearer on invocation:

callback: ctx.setOperatorNameFromOperatorVersion,

This tells me: Ah, the callback will set the operator name from the operator version. Nice. :)

btw, I don't mind long descriptive names

pkg/kudoctl/cmd/diagnostics/resource_funcs.go

pkg/kudoctl/cmd/diagnostics/collectors.go

pkg/kudoctl/cmd/diagnostics/processing_context.go

pkg/kudoctl/cmd/diagnostics/diagnostics.go

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

vemelin-epm · 2020-05-27T16:28:43Z

An approximate further development to add diagnostics for dependencies after KEP-29 is implemented can be found here
appended collection sequence
collector
branch actually works if Kafka is installed according to the instruction and then Zookeeper Instance is updated with OwnerReference to Kafka instance
@zen-dog especially FYI
POC branch (not this branch) output for Kafka with dependency on zookeeper

diag
├── kudo
│   ├── pod_kudo-controller-manager-0
│   │   ├── kudo-controller-manager-0.yaml
│   │   └── manager.log.gz
│   ├── serviceaccountlist.yaml
│   ├── servicelist.yaml
│   └── statefulsetlist.yaml
├── operator_kafka
│   ├── instance_kafka-instance
│   │   ├── kafka-instance.yaml
│   │   ├── operator_zookeeper
│   │   │   ├── instance_zookeeper-instance
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-0
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-0.yaml
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-1
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-1.yaml
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-2
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-2.yaml
│   │   │   │   ├── servicelist.yaml
│   │   │   │   ├── statefulsetlist.yaml
│   │   │   │   └── zookeeper-instance.yaml
│   │   │   ├── operatorversion_zookeeper-0.3.0
│   │   │   │   └── zookeeper-0.3.0.yaml
│   │   │   └── zookeeper.yaml
│   │   ├── pod_kafka-instance-kafka-0
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-0.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── pod_kafka-instance-kafka-1
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-1.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── pod_kafka-instance-kafka-2
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-2.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── rolebindinglist.yaml
│   │   ├── rolelist.yaml
│   │   ├── serviceaccountlist.yaml
│   │   ├── servicelist.yaml
│   │   └── statefulsetlist.yaml
│   ├── kafka.yaml
│   └── operatorversion_kafka-1.2.1
│       └── kafka-1.2.1.yaml
├── settings.yaml
└── version.yaml

zmalik · 2020-05-27T17:41:19Z

pkg/kudoctl/cmd/diagnostics/collectors.go

+				c.printer.printError(err, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), fmt.Sprintf("%s.log", container.Name))
+			} else {
+				c.printer.printLog(log, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), container.Name)
+				_ = log.Close()


why the blank identifier here?

would it be good enough to return the error? if its nil, its fine and otherwise we will have the error in return

Error treatment in my code is probably not quite transparent.
The problem is that we have:

errors after which we have to stop the collection sequence (note that we're also likely to have embedded sequences with dependency instances), e.g. no OperatorVersion.

errors when retrieving not so critical resources: print to file and continue

errors while printing: I chose to accumulate them into the printer

In addition, we want the errors for resources to be printed where the failed resources would be.
I put some explanation here #1506 (comment), though I think it's still not clear enough.
So, in this case: the error with logs probably cannot be fatal, so I think this collector should always return nil.
Should we print error of closing the log stream? I just thought this error is irrelevant.

ANeumann82 · 2020-05-28T09:00:27Z

pkg/kudoctl/cmd/diagnostics.go

+	cmd.Flags().StringVar(&instance, "instance", "", "The instance name.")
+	cmd.Flags().DurationVar(&logSince, "log-since", 0, "Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs.")


I just noticed: We should probably have an (optional) parameter for the target output directory, right? Can be added in a later PR though..

ANeumann82

lgtm. The only blocker for me is the change to the kudo.Client. I don't really see a good reason to inline the kubeClientset

pkg/kudoctl/util/kudo/kudo.go

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

pkg/kudoctl/cmd/diagnostics/collectors.go

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

zen-dog · 2020-06-03T12:21:48Z

pkg/kudoctl/cmd/diagnostics/collectors.go

+	loadResourceFn func() (runtime.Object, error)
+	name           string               // object kind used to describe the error
+	parentDir      stringGetter         // parent dir to attach the printer's output
+	failOnError    bool                 // define whether the collector should return the error


Do we need failOnError parameter? The only case when collector that should fail on error is when collecting an Instance resource. Can we just hardcode it?

pkg/kudoctl/cmd/diagnostics/collectors.go

zen-dog · 2020-06-03T12:30:35Z

pkg/kudoctl/cmd/diagnostics/collectors.go

@@ -0,0 +1,132 @@
+package diagnostics


I would probably extract each collector in its own small file and put them all into the diagnostics/collectors. Wdyt?

Generally: Yes :) But at the moment, the whole code is in a single package, this would require a full restructuring because of circular dependencies...

zen-dog · 2020-06-03T12:45:39Z

pkg/kudoctl/cmd/diagnostics/runner_helper.go

+func runnerForInstance(ir *resourceFuncsConfig, ctx *processingContext) *runner {
+	r := &runner{}
+
+	instance := resourceCollectorGroup{[]resourceCollector{


I still think that we would be better off with a dedicated high-level collector like InstanceCollector than with a generic resourceCollectorGroup. But this might be a topic for a later refactoring

zen-dog · 2020-06-03T12:47:37Z

pkg/kudoctl/cmd/diagnostics/runner_helper.go

+		name:           "role",
+		parentDir:      ctx.instanceDirectory,
+		printMode:      RuntimeObject})
+	r.addCollector(&logsCollector{


This is another reason why a dedicated PodCollector might be better: right now the complexity is hidden in the fact that logCollector has to be last and it depends on the ctx.podList. A PodCollector would collect both: pod.yaml and pod.log as an "atomic" unit.

zen-dog

I left a few comments, none of them are blocking (though dedicated collectors are at the top of my personal wish list). Nice work @vemelin-epm (and @ANeumann82)!

Please give this PR a changelog worthy title and description - this is a nice feature and we definitely should highlight it in the release log.

nfnt

LGTM!

vemelin-epm requested review from alenkacz, gerred, kensipe, nfnt and zen-dog as code owners May 12, 2020 00:49

vemelin-epm requested review from porridge, ANeumann82 and zmalik and removed request for alenkacz May 12, 2020 00:51

vemelin-epm changed the title ~~simplified diagnostics~~ WIP: simplified diagnostics May 12, 2020

ANeumann82 requested changes May 12, 2020

View reviewed changes

nfnt reviewed May 12, 2020

View reviewed changes

go.mod Outdated Show resolved Hide resolved

nfnt requested changes May 12, 2020

View reviewed changes

zen-dog reviewed May 12, 2020

View reviewed changes

pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved

vemelin-epm force-pushed the ve/simplified-diagnostics branch from d160a0a to 20296de Compare May 12, 2020 12:58

vemelin-epm requested a review from zen-dog May 14, 2020 01:41

ANeumann82 requested changes May 14, 2020

View reviewed changes

pkg/kudoctl/cmd/diagnostics/collectors.go Outdated Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved

zen-dog requested changes May 14, 2020

View reviewed changes

vemelin-epm mentioned this pull request May 16, 2020

WIP: diagnostics bundle #1436

Closed

zen-dog mentioned this pull request May 16, 2020

[KEP-29]: Extend diagnostics bundle to support operator dependencies #1511

Closed

vemelin-epm force-pushed the ve/simplified-diagnostics branch from 082294c to bb0c7d3 Compare May 18, 2020 11:16

nfnt requested changes May 18, 2020

View reviewed changes

pkg/kudoctl/cmd/diagnostics/collectors.go Outdated Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/processing_context.go Show resolved Hide resolved

pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved

vemelin-epm added 6 commits May 19, 2020 15:39

diagnostics bundle: initial mvp design

1b76766

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

refactoring: custom Object, resource holders

4be4c17

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

MVP: tree of dependent resources, processing context as map

671139b

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

refactor: isolate hierarchy

8766317

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

rename builder

73ef196

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

afero, logs since, refactor

2a28dd4

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

vemelin-epm added 2 commits May 27, 2020 14:52

refactor print

f8ce16a

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

check diag directory, rename collecting helpers

dbebfc6

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

vemelin-epm requested review from nfnt, zen-dog and ANeumann82 May 27, 2020 16:31

zmalik reviewed May 27, 2020

View reviewed changes

ANeumann82 reviewed May 28, 2020

View reviewed changes

ANeumann82 requested changes May 28, 2020

View reviewed changes

pkg/kudoctl/util/kudo/kudo.go Outdated Show resolved Hide resolved

vemelin-epm and others added 8 commits May 28, 2020 15:31

use sigs yaml, fix linting

077fb53

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

PR update: respond to code review

688af9c

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

remove ghodss from mod file

5e5f32a

Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>

Small cleanup from code review

fd9864a

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Reworked collectors and runner

2a439b9

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Another small cleanup

71a20f1

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

Merge branch 'master' into ve/simplified-diagnostics

a253535

Fixed test

fdc6dd2

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

ANeumann82 approved these changes Jun 2, 2020

View reviewed changes

vemelin-epm commented Jun 2, 2020

View reviewed changes

pkg/kudoctl/cmd/diagnostics/collectors.go Show resolved Hide resolved

Fixed missing return

0be876c

Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>

zen-dog reviewed Jun 3, 2020

View reviewed changes

pkg/kudoctl/cmd/diagnostics/collectors.go Show resolved Hide resolved

zen-dog reviewed Jun 3, 2020

View reviewed changes

zen-dog approved these changes Jun 3, 2020

View reviewed changes

ANeumann82 changed the title ~~simplified diagnostics~~ KEP-22: Diagnostic Bundle Collection Jun 3, 2020

nfnt approved these changes Jun 9, 2020

View reviewed changes

ANeumann82 merged commit a0a7e34 into master Jun 9, 2020

ANeumann82 deleted the ve/simplified-diagnostics branch June 9, 2020 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-22: Diagnostic Bundle Collection #1506

KEP-22: Diagnostic Bundle Collection #1506

vemelin-epm commented May 12, 2020 •

edited by ANeumann82

Loading

vemelin-epm commented May 12, 2020 •

edited

Loading

ANeumann82 left a comment

nfnt left a comment

zen-dog left a comment

vemelin-epm commented May 14, 2020 •

edited

Loading

zen-dog left a comment

zen-dog May 14, 2020

vemelin-epm May 15, 2020

zen-dog May 20, 2020

nfnt May 26, 2020

vemelin-epm May 27, 2020

zen-dog May 14, 2020

vemelin-epm May 15, 2020

ANeumann82 May 25, 2020

vemelin-epm commented May 27, 2020 •

edited

Loading

zmalik May 27, 2020

zmalik May 27, 2020

vemelin-epm May 28, 2020 •

edited

Loading

ANeumann82 May 28, 2020

ANeumann82 left a comment

zen-dog Jun 3, 2020

zen-dog Jun 3, 2020

ANeumann82 Jun 3, 2020

zen-dog Jun 3, 2020

zen-dog Jun 3, 2020

zen-dog left a comment

nfnt left a comment

	func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) {
	func (ctx *processingContext) setOperatorNameFromOperatorVersion(o runtime.Object) {

		cmd.Flags().StringVar(&instance, "instance", "", "The instance name.")
		cmd.Flags().DurationVar(&logSince, "log-since", 0, "Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs.")

KEP-22: Diagnostic Bundle Collection #1506

KEP-22: Diagnostic Bundle Collection #1506

Conversation

vemelin-epm commented May 12, 2020 • edited by ANeumann82 Loading

vemelin-epm commented May 12, 2020 • edited Loading

ANeumann82 left a comment

Choose a reason for hiding this comment

nfnt left a comment

Choose a reason for hiding this comment

zen-dog left a comment

Choose a reason for hiding this comment

vemelin-epm commented May 14, 2020 • edited Loading

zen-dog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vemelin-epm commented May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vemelin-epm May 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ANeumann82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zen-dog left a comment

Choose a reason for hiding this comment

nfnt left a comment

Choose a reason for hiding this comment

vemelin-epm commented May 12, 2020 •

edited by ANeumann82

Loading

vemelin-epm commented May 12, 2020 •

edited

Loading

vemelin-epm commented May 14, 2020 •

edited

Loading

vemelin-epm commented May 27, 2020 •

edited

Loading

vemelin-epm May 28, 2020 •

edited

Loading