Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: structured, contextual logging #639

Closed
wants to merge 2 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 31, 2021

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Embracing go-logr as logger has several advantages:

  • adding a certain prefix and/or set of objects can be done consistently with less code duplication
  • it's even possible for the caller to do that, something that doesn't work when the prefix is part of the format string
  • in Go tests, all output can be associated with the currently running test via RFC: testinglogger: per-test, structured logging kubernetes/klog#240

The latter was needed to debug #638

Does this PR introduce a user-facing change?:

Log messages are structured (fixed message, additional information in key/value pairs).

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 31, 2021
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 31, 2021
@pohly pohly force-pushed the structured-logging branch 2 times, most recently from a3d1606 to 6f3e4f8 Compare May 31, 2021 11:30
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2021
pkg/capacity/capacity.go Outdated Show resolved Hide resolved
go.mod Outdated
@@ -61,3 +61,6 @@ replace k8s.io/component-base => k8s.io/component-base v0.21.0
replace k8s.io/component-helpers => k8s.io/component-helpers v0.21.0

replace k8s.io/csi-translation-lib => k8s.io/csi-translation-lib v0.21.0

// WIP
replace k8s.io/klog/v2 => github.com/pohly/klog/v2 v2.4.1-0.20210527141230-ac596814502c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed to have the code in klog: kubernetes/klog#240

It's currently on hold because the logr API changes need to be dealt with first.


klog.Info("Started Capacity Controller")
logger.Info("started controller")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for starting an info log message with lower case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a full sentence, therefore initial capital letter looked odd. I don't know whether there is some guidance on this.

It's also consistent with error messages. For those the official guidance is to start with lower case because the error might get wrapped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if the structured log KEP has any official guidance on this. The official guidance for error message is only referring to the returned error message, not error messages in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logr example uses lower case, incidentally also with a "starting" message:
https://github.com/go-logr/logr#typical-usage

But the Kubernetes documentation says "Start from a capital letter": https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#remove-string-formatting-from-log-message

It also has some other recommendations. Once we agree to go further with this and the klog PR is merged, I'll revisit the log messages and update them accordingly.


klog.V(3).Infof("Capacity Controller: storage class %s was removed", sc.Name)
logger.V(3).Info("removed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show an example of the output of this changed log msg vs the original log msg?

Copy link
Contributor Author

@pohly pohly Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I have a coverage gap in capacity_test.go - onSCDelete is not called. Will fix.

In the meantime, here's the corresponding output from onSCAddedOrUpdated, as printed by go test. With these changes:

capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="triple-sc"
...
capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef430 storageClassName:triple-sc}

Without them:

I0608 17:48:40.721547  866697 capacity.go:361] Capacity Controller: storage class triple-sc was updated or added
...
I0608 17:48:40.721526  866697 capacity.go:468] Capacity Controller: enqueuing {segment:0x27e7350 storageClassName:triple-sc}

Note that the "enqueuing" messages without this PR lacks context. It's not clear why addWorkItem was called. With contextual logging, the onSCAddOrUpdate function name and the storage class get passed down and are added to the log message.

This is important once things start to happen in parallel. When everything is sequential, one can read the log from top to bottom and remember which values were logged earlier. But when run in parallel, it's not clear whether log output actually follows the one printed directly above it.

This is a problem in our driver logs where it is hard to associate a gRPC error response with the corresponding call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the log output for a complete testcase:

=== RUN   TestRefresh
=== RUN   TestRefresh/truncated_topology
=== PAUSE TestRefresh/truncated_topology
=== CONT  TestRefresh/truncated_topology
    capacity.go:338: INFO onTopologyChanges: topology changed added=[0x27ef430 = layer0: foo+ layer1: X+ layer2: A 0x27ef450 = layer0: foo+ layer1: X+ layer2: B] removed=[]
    capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="direct-sc"
    capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="direct-sc" workitem={segment:0x27ef430 storageClassName:direct-sc}
    capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="direct-sc" workitem={segment:0x27ef450 storageClassName:direct-sc}
    capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="triple-sc"
    capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef430 storageClassName:triple-sc}
    capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef450 storageClassName:triple-sc}
    capacity.go:338: INFO onTopologyChanges: topology changed added=[0x27ef430 = layer0: foo+ layer1: X+ layer2: A 0x27ef450 = layer0: foo+ layer1: X+ layer2: B] removed=[]
    capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef430 storageClassName:direct-sc}
    capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef450 storageClassName:direct-sc}
    capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef430 storageClassName:triple-sc}
    capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef450 storageClassName:triple-sc}
    capacity.go:276: INFO prepare: initial state topology segments=2 storage classes=2 potential CSIStorageCapacity objects=4
    capacity.go:288: INFO prepare: checking for existing CSIStorageCapacity objects
--- PASS: TestRefresh (0.00s)

Note that this would be impossible to do without this PR because log output from different test cases would be mixed.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2021
@pohly
Copy link
Contributor Author

pohly commented Oct 7, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 7, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 4, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2022
The k8s.io/component-base/logs API is used to add several new command line
flags and the corresponding implementation:

   --feature-gates:
         ContextualLogging=true|false (ALPHA - default=true)
         LoggingAlphaOptions=true|false (ALPHA - default=false)
         LoggingBetaOptions=true|false (BETA - default=true)

   --log-flush-frequency duration              Maximum number of seconds between log flushes (default 5s)
   --log-json-info-buffer-size quantity        [Alpha] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this.
   --log-json-split-stream                     [Alpha] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this.
35a42
   --logging-format string                     Sets the log format. Permitted formats: "json" (gated by LoggingBetaOptions), "text". (default "text")

In contrast to the defaults in the (pretty conservative) Kubernetes components,
contextual logging gets enabled by default. That has the advantage that code
can be rewritten with the assumption that WithValue and WithName calls really
have an effect. Users can still disable the feature, but logs will be less
informative in that case.
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2023
@pohly
Copy link
Contributor Author

pohly commented Jan 22, 2023

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 22, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2023
@pohly
Copy link
Contributor Author

pohly commented Jun 5, 2023

/remove-lifecycle rotten

I was looking for a volunteer to continue with this, but so far without luck. I'll probably finish this myself.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 5, 2023
@bells17
Copy link
Contributor

bells17 commented Aug 12, 2023

Hi @pohly

I was looking for a volunteer to continue with this, but so far without luck. I'll probably finish this myself.

Would it be acceptable if I were to take on this development task?

@pohly
Copy link
Contributor Author

pohly commented Aug 21, 2023

@bells17: help with this would be very welcome. Feel free to take my branch, rebase it and continue in a new PR.

I think with this PR and kubernetes-csi/node-driver-registrar#259 it is technically clear how to use component-base/logs. The rest of the conversion can go as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md

@k8s-ci-robot
Copy link
Contributor

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-csi-external-provisioner-1-22-on-kubernetes-1-22 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-1-22-on-kubernetes-1-22
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-21 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-21
pull-kubernetes-csi-external-provisioner-1-23-on-kubernetes-1-23 c4a1889 link false /test pull-kubernetes-csi-external-provisioner-1-23-on-kubernetes-1-23
pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-23 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-23
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-24 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-24
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-25 c4a1889 link true /test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-25
pull-kubernetes-csi-external-provisioner-1-27-on-kubernetes-1-27 a2d6a4d link true /test pull-kubernetes-csi-external-provisioner-1-27-on-kubernetes-1-27

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants