Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-22: Diagnostic Bundle Collection #1506

Merged
merged 48 commits into from
Jun 9, 2020
Merged

Conversation

vemelin-epm
Copy link
Contributor

@vemelin-epm vemelin-epm commented May 12, 2020

What this PR does / why we need it:

Add support for collecting diagnostic data from the KUDO manager and an installed operator instance.

Includes:

  • the resources that are created by KUDO and therefore marked by the instance-specific set of labels
  • logs of deployed pods
  • KUDO manager resources
  • KUDO manager logs

All collected resources are stored in a directory structure

Fixes: D2IQ-63414

@vemelin-epm
Copy link
Contributor Author

vemelin-epm commented May 12, 2020

This is a simplified diagnostics version
Collect only the resources created by KUDO (filter by labels)
Flat structure under the instance
WIP status is due to missing tests
Current output:

diag
├── kudo
│   ├── pod_kudo-controller-manager-0
│   │   ├── kudo-controller-manager-0.log.gz
│   │   └── kudo-controller-manager-0.yaml
│   ├── serviceaccountlist.yaml
│   ├── servicelist.yaml
│   └── statefulsetlist.yaml
├── operator_zookeeper
│   ├── instance_zk
│   │   ├── pod_zk-zookeeper-0
│   │   │   ├── zk-zookeeper-0.log.gz
│   │   │   └── zk-zookeeper-0.yaml
│   │   ├── pod_zk-zookeeper-1
│   │   │   ├── zk-zookeeper-1.log.gz
│   │   │   └── zk-zookeeper-1.yaml
│   │   ├── pod_zk-zookeeper-2
│   │   │   ├── zk-zookeeper-2.log.gz
│   │   │   └── zk-zookeeper-2.yaml
│   │   ├── servicelist.yaml
│   │   ├── statefulsetlist.yaml
│   │   └── zk.yaml
│   ├── operatorversion_zookeeper-0.3.0
│   │   └── zookeeper-0.3.0.yaml
│   └── zookeeper.yaml
├── settings.yaml
└── version.yaml

@vemelin-epm vemelin-epm requested review from porridge, ANeumann82 and zmalik and removed request for alenkacz May 12, 2020 00:51
@vemelin-epm vemelin-epm changed the title simplified diagnostics WIP: simplified diagnostics May 12, 2020
Copy link
Member

@ANeumann82 ANeumann82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally much nicer! It's less complicated, still not easy (for me) to read, but acceptable. Maybe it's just a complex problem :)

I've added some nits and suggestions, my main issues at the moment:

  • Error handling: I've started to add suggestions, but generally, return err should not happen. There are some cases where it's ok, but usually they should provide a bit more context, otherwise it's hard to find where an error originated, as stacktraces are not printed by default.
  • Documentation: A lot more details about the intention on interfaces, a bit more details on the structs and Fns would go a long way helping to understand the code.
  • Some naming, although I'm not always sure there's a better naming scheme. I've added some suggestions where I thought strongly about it.

go.mod Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/builder.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
Copy link
Member

@nfnt nfnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in a great shape, it was much easier for me to grasp what's going on. I agree with what @ANeumann82 said: Please provide good documentation around the various abstractions used here. That said, while I agree with some abstractions, e.g. the Printable interface, others seem a bit too much; though I probably would come up with the same ones for this task 😆. We just have to keep in mind that function pointers, while very convenient, are hard to follow through the code. And with that I don't mean your code but any code. Hence, it would be great to describe the usage of the various function pointers in detail.

pkg/kudoctl/cmd/diagnostics/builder.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@zen-dog zen-dog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Mostly wondering about the need for Printable and PrintableList types (see my comment)

pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
@vemelin-epm vemelin-epm force-pushed the ve/simplified-diagnostics branch from d160a0a to 20296de Compare May 12, 2020 12:58
@vemelin-epm vemelin-epm requested a review from zen-dog May 14, 2020 01:41
pkg/kudoctl/cmd/diagnostics/collectors.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
@vemelin-epm
Copy link
Contributor Author

vemelin-epm commented May 14, 2020

Updates aiming to meet this request from @zen-dog #1506 (comment) brought essential code changes.

  • Printable abstraction removed
  • calls to collection processing context and resource configuration exposed instead of being implicitly provided via the builder
  • builder simplified to a plain sequential runner
  • callbacks modifying the shared context are removed from resource loader and made explicit
  • documentation added
  • error handling updated

Error treatment explained in brief:

  • resource provider errors, required explicitly to be fatal in a runner sequence are breaking the sequence. missing resource in this case is also treated as an error
  • resource provider errors are printed where otherwise the requested resource or log would be
  • fatal errors and printing errors are grouped and returned from the "main" Collect function

Example output when a client returned an error for pods request:

diag
├── operator_zookeeper
│   ├── instance_zk
│   │   ├── pod.err
│   │   ├── servicelist.yaml
│   │   ├── statefulsetlist.yaml
│   │   └── zk.yaml
│   ├── operatorversion_zookeeper-0.3.0
│   │   └── zookeeper-0.3.0.yaml
│   └── zookeeper.yaml
...

Copy link
Contributor

@zen-dog zen-dog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work and quite an improvement from the previous version. However, I'm not sold on the generic ResourceCollector that can be configured to collect any kind of resource. The interface feels clunky, you need runtime casting and even had to copy type Object interface from the apimachinery internals.
It seems that we could rather have a bunch of dedicated collectors e.g. RbacCollector or StatefulSetCollector. Yes, it will be less generic, and, frankly, I don't mind. I'd rather live with some (hopefully minor) code repetition than having one generic rule-them-all-interface. And all shared information (like path generation from the instance name) can leave in a context.

pkg/kudoctl/cmd/diagnostics.go Outdated Show resolved Hide resolved
cmd := &cobra.Command{
Use: "diagnostics",
Short: "diagnostics",
Long: "diagnostics command has sub-commands to collect and analyze diagnostics data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"diagnostics command has sub-commands"? Not sure what you mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. Not every pattern found in the code base is good 😉 Also, it currently has only one sub-command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But consistency should be more important here, as this is user-facing. Getting rid of bad patterns we inherit from is a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zen-dog @nfnt I am stuck here.

pkg/kudoctl/cmd/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/print.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/collectors.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/runner.go Outdated Show resolved Hide resolved
return fmt.Sprintf("%s/instance_%s", ctx.attachToOperator(), ctx.instanceName)
}

func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, this method name is too long, even by Java's standards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trade-off for generic runtime.Object.I chose a fully descriptive name for a callback. It seemed to me better than adding more code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) {
func (ctx *processingContext) setOperatorNameFromOperatorVersion(o runtime.Object) {

Same here: I don't understand the "must" prefix, having just the "set..." part makes it a lot clearer on invocation:

callback:       ctx.setOperatorNameFromOperatorVersion,

This tells me: Ah, the callback will set the operator name from the operator version. Nice. :)

btw, I don't mind long descriptive names

pkg/kudoctl/cmd/diagnostics/resource_funcs.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/collectors.go Outdated Show resolved Hide resolved
pkg/kudoctl/cmd/diagnostics/diagnostics.go Outdated Show resolved Hide resolved
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
@vemelin-epm
Copy link
Contributor Author

vemelin-epm commented May 27, 2020

An approximate further development to add diagnostics for dependencies after KEP-29 is implemented can be found here
appended collection sequence
collector
branch actually works if Kafka is installed according to the instruction and then Zookeeper Instance is updated with OwnerReference to Kafka instance
@zen-dog especially FYI
POC branch (not this branch) output for Kafka with dependency on zookeeper

diag
├── kudo
│   ├── pod_kudo-controller-manager-0
│   │   ├── kudo-controller-manager-0.yaml
│   │   └── manager.log.gz
│   ├── serviceaccountlist.yaml
│   ├── servicelist.yaml
│   └── statefulsetlist.yaml
├── operator_kafka
│   ├── instance_kafka-instance
│   │   ├── kafka-instance.yaml
│   │   ├── operator_zookeeper
│   │   │   ├── instance_zookeeper-instance
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-0
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-0.yaml
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-1
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-1.yaml
│   │   │   │   ├── pod_zookeeper-instance-zookeeper-2
│   │   │   │   │   ├── kubernetes-zookeeper.log.gz
│   │   │   │   │   └── zookeeper-instance-zookeeper-2.yaml
│   │   │   │   ├── servicelist.yaml
│   │   │   │   ├── statefulsetlist.yaml
│   │   │   │   └── zookeeper-instance.yaml
│   │   │   ├── operatorversion_zookeeper-0.3.0
│   │   │   │   └── zookeeper-0.3.0.yaml
│   │   │   └── zookeeper.yaml
│   │   ├── pod_kafka-instance-kafka-0
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-0.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── pod_kafka-instance-kafka-1
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-1.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── pod_kafka-instance-kafka-2
│   │   │   ├── k8skafka.log.gz
│   │   │   ├── kafka-instance-kafka-2.yaml
│   │   │   └── kafka-node-exporter.log.gz
│   │   ├── rolebindinglist.yaml
│   │   ├── rolelist.yaml
│   │   ├── serviceaccountlist.yaml
│   │   ├── servicelist.yaml
│   │   └── statefulsetlist.yaml
│   ├── kafka.yaml
│   └── operatorversion_kafka-1.2.1
│       └── kafka-1.2.1.yaml
├── settings.yaml
└── version.yaml

c.printer.printError(err, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), fmt.Sprintf("%s.log", container.Name))
} else {
c.printer.printLog(log, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), container.Name)
_ = log.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the blank identifier here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be good enough to return the error? if its nil, its fine and otherwise we will have the error in return

Copy link
Contributor Author

@vemelin-epm vemelin-epm May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error treatment in my code is probably not quite transparent.
The problem is that we have:

  • errors after which we have to stop the collection sequence (note that we're also likely to have embedded sequences with dependency instances), e.g. no OperatorVersion.
  • errors when retrieving not so critical resources: print to file and continue
  • errors while printing: I chose to accumulate them into the printer

In addition, we want the errors for resources to be printed where the failed resources would be.
I put some explanation here #1506 (comment), though I think it's still not clear enough.
So, in this case: the error with logs probably cannot be fatal, so I think this collector should always return nil.
Should we print error of closing the log stream? I just thought this error is irrelevant.

Comment on lines +46 to +47
cmd.Flags().StringVar(&instance, "instance", "", "The instance name.")
cmd.Flags().DurationVar(&logSince, "log-since", 0, "Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed: We should probably have an (optional) parameter for the target output directory, right? Can be added in a later PR though..

Copy link
Member

@ANeumann82 ANeumann82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. The only blocker for me is the change to the kudo.Client. I don't really see a good reason to inline the kubeClientset

pkg/kudoctl/util/kudo/kudo.go Outdated Show resolved Hide resolved
vemelin-epm and others added 8 commits May 28, 2020 15:31
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
loadResourceFn func() (runtime.Object, error)
name string // object kind used to describe the error
parentDir stringGetter // parent dir to attach the printer's output
failOnError bool // define whether the collector should return the error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need failOnError parameter? The only case when collector that should fail on error is when collecting an Instance resource. Can we just hardcode it?

@@ -0,0 +1,132 @@
package diagnostics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably extract each collector in its own small file and put them all into the diagnostics/collectors. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally: Yes :) But at the moment, the whole code is in a single package, this would require a full restructuring because of circular dependencies...

func runnerForInstance(ir *resourceFuncsConfig, ctx *processingContext) *runner {
r := &runner{}

instance := resourceCollectorGroup{[]resourceCollector{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that we would be better off with a dedicated high-level collector like InstanceCollector than with a generic resourceCollectorGroup. But this might be a topic for a later refactoring

name: "role",
parentDir: ctx.instanceDirectory,
printMode: RuntimeObject})
r.addCollector(&logsCollector{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reason why a dedicated PodCollector might be better: right now the complexity is hidden in the fact that logCollector has to be last and it depends on the ctx.podList. A PodCollector would collect both: pod.yaml and pod.log as an "atomic" unit.

Copy link
Contributor

@zen-dog zen-dog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, none of them are blocking (though dedicated collectors are at the top of my personal wish list). Nice work @vemelin-epm (and @ANeumann82)!

Please give this PR a changelog worthy title and description - this is a nice feature and we definitely should highlight it in the release log.

@ANeumann82 ANeumann82 changed the title simplified diagnostics KEP-22: Diagnostic Bundle Collection Jun 3, 2020
Copy link
Member

@nfnt nfnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ANeumann82 ANeumann82 merged commit a0a7e34 into master Jun 9, 2020
@ANeumann82 ANeumann82 deleted the ve/simplified-diagnostics branch June 9, 2020 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants