-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-22: Diagnostic Bundle Collection #1506
Conversation
This is a simplified diagnostics version
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally much nicer! It's less complicated, still not easy (for me) to read, but acceptable. Maybe it's just a complex problem :)
I've added some nits and suggestions, my main issues at the moment:
- Error handling: I've started to add suggestions, but generally,
return err
should not happen. There are some cases where it's ok, but usually they should provide a bit more context, otherwise it's hard to find where an error originated, as stacktraces are not printed by default. - Documentation: A lot more details about the intention on interfaces, a bit more details on the structs and Fns would go a long way helping to understand the code.
- Some naming, although I'm not always sure there's a better naming scheme. I've added some suggestions where I thought strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already in a great shape, it was much easier for me to grasp what's going on. I agree with what @ANeumann82 said: Please provide good documentation around the various abstractions used here. That said, while I agree with some abstractions, e.g. the Printable
interface, others seem a bit too much; though I probably would come up with the same ones for this task 😆. We just have to keep in mind that function pointers, while very convenient, are hard to follow through the code. And with that I don't mean your code but any code. Hence, it would be great to describe the usage of the various function pointers in detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Mostly wondering about the need for Printable
and PrintableList
types (see my comment)
d160a0a
to
20296de
Compare
Updates aiming to meet this request from @zen-dog #1506 (comment) brought essential code changes.
Error treatment explained in brief:
Example output when a client returned an error for pods request:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work and quite an improvement from the previous version. However, I'm not sold on the generic ResourceCollector
that can be configured to collect any kind of resource. The interface feels clunky, you need runtime casting and even had to copy type Object interface
from the apimachinery internals.
It seems that we could rather have a bunch of dedicated collectors e.g. RbacCollector
or StatefulSetCollector
. Yes, it will be less generic, and, frankly, I don't mind. I'd rather live with some (hopefully minor) code repetition than having one generic rule-them-all-interface. And all shared information (like path generation from the instance name) can leave in a context.
pkg/kudoctl/cmd/diagnostics.go
Outdated
cmd := &cobra.Command{ | ||
Use: "diagnostics", | ||
Short: "diagnostics", | ||
Long: "diagnostics command has sub-commands to collect and analyze diagnostics data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"diagnostics command has sub-commands"? Not sure what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pattern taken from here https://github.com/kudobuilder/kudo/blob/master/pkg/kudoctl/cmd/plan.go#L30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean. Not every pattern found in the code base is good 😉 Also, it currently has only one sub-command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But consistency should be more important here, as this is user-facing. Getting rid of bad patterns we inherit from is a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Sprintf("%s/instance_%s", ctx.attachToOperator(), ctx.instanceName) | ||
} | ||
|
||
func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, this method name is too long, even by Java's standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a trade-off for generic runtime.Object.I chose a fully descriptive name for a callback. It seemed to me better than adding more code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (ctx *processingContext) mustSetOperatorNameFromOperatorVersion(o runtime.Object) { | |
func (ctx *processingContext) setOperatorNameFromOperatorVersion(o runtime.Object) { |
Same here: I don't understand the "must" prefix, having just the "set..." part makes it a lot clearer on invocation:
callback: ctx.setOperatorNameFromOperatorVersion,
This tells me: Ah, the callback will set the operator name from the operator version. Nice. :)
btw, I don't mind long descriptive names
082294c
to
bb0c7d3
Compare
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
An approximate further development to add diagnostics for
|
c.printer.printError(err, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), fmt.Sprintf("%s.log", container.Name)) | ||
} else { | ||
c.printer.printLog(log, filepath.Join(c.parentDir(), fmt.Sprintf("pod_%s", pod.Name)), container.Name) | ||
_ = log.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the blank identifier here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be good enough to return the error? if its nil, its fine and otherwise we will have the error in return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error treatment in my code is probably not quite transparent.
The problem is that we have:
- errors after which we have to stop the collection sequence (note that we're also likely to have embedded sequences with dependency instances), e.g. no OperatorVersion.
- errors when retrieving not so critical resources: print to file and continue
- errors while printing: I chose to accumulate them into the printer
In addition, we want the errors for resources to be printed where the failed resources would be.
I put some explanation here #1506 (comment), though I think it's still not clear enough.
So, in this case: the error with logs probably cannot be fatal, so I think this collector should always return nil.
Should we print error of closing the log stream? I just thought this error is irrelevant.
cmd.Flags().StringVar(&instance, "instance", "", "The instance name.") | ||
cmd.Flags().DurationVar(&logSince, "log-since", 0, "Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed: We should probably have an (optional) parameter for the target output directory, right? Can be added in a later PR though..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. The only blocker for me is the change to the kudo.Client
. I don't really see a good reason to inline the kubeClientset
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Vasilii Emelin <vasilii_emelin@epam.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
Signed-off-by: Andreas Neumann <aneumann@mesosphere.com>
loadResourceFn func() (runtime.Object, error) | ||
name string // object kind used to describe the error | ||
parentDir stringGetter // parent dir to attach the printer's output | ||
failOnError bool // define whether the collector should return the error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need failOnError
parameter? The only case when collector that should fail on error is when collecting an Instance
resource. Can we just hardcode it?
@@ -0,0 +1,132 @@ | |||
package diagnostics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably extract each collector in its own small file and put them all into the diagnostics/collectors
. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally: Yes :) But at the moment, the whole code is in a single package, this would require a full restructuring because of circular dependencies...
func runnerForInstance(ir *resourceFuncsConfig, ctx *processingContext) *runner { | ||
r := &runner{} | ||
|
||
instance := resourceCollectorGroup{[]resourceCollector{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that we would be better off with a dedicated high-level collector like InstanceCollector
than with a generic resourceCollectorGroup
. But this might be a topic for a later refactoring
name: "role", | ||
parentDir: ctx.instanceDirectory, | ||
printMode: RuntimeObject}) | ||
r.addCollector(&logsCollector{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another reason why a dedicated PodCollector
might be better: right now the complexity is hidden in the fact that logCollector
has to be last and it depends on the ctx.podList
. A PodCollector
would collect both: pod.yaml
and pod.log
as an "atomic" unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, none of them are blocking (though dedicated collectors are at the top of my personal wish list). Nice work @vemelin-epm (and @ANeumann82)!
Please give this PR a changelog worthy title and description - this is a nice feature and we definitely should highlight it in the release log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What this PR does / why we need it:
Add support for collecting diagnostic data from the KUDO manager and an installed operator instance.
Includes:
All collected resources are stored in a directory structure
Fixes: D2IQ-63414