Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3077: contextual logging: promotion to beta in 1.30 #4219

Merged
merged 3 commits into from
Feb 7, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Sep 18, 2023

  • One-line PR description: update for to reflect changes in the ecosystem

  • Issue link: contextual logging #3077

  • Other comments: some of the PRs linked here are not merged yet because support for slog is still new; they are expected to be merged before 1.29

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Sep 18, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 18, 2023
@@ -870,7 +912,7 @@ logging.

#### Beta

- All of kube-scheduler (in-tree) and CSI external-provisioner (out-of-tree) converted
- [All of kube-controller-manager](https://github.com/kubernetes/kubernetes/pull/119250) and some [parts of kube-scheduler](https://github.com/kubernetes/kubernetes/pull/115588) converted (in-tree), conversion of out-of-tree components possible, whether they use pflag ([external-provisioner](https://github.com/kubernetes-csi/external-provisioner/pull/639)] or plain Go flags ([node-driver-registrar](https://github.com/kubernetes-csi/node-driver-registrar/pull/259))
Copy link
Contributor Author

@pohly pohly Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ended up focusing on kube-controller-manager instead of kube-scheduler because that work was easier to split up. Conversion of external components is technically feasible, but was delayed by lack of developer time. Contributors are willing to take on that part now and have already started (1, 2).

Regarding the next points (cannot comment on it in GitHub):

Gathered feedback from developers and surveys

There has been steady interest in this, as seen from questions on Slack and good attendance of the KubeCon maintainer track sessions about logging. We've not done a survey, and I am not sure what exactly I would ask in it. May I perhaps remove that part?

New APIs in k8s.io/klog/v2 no longer marked as experimental

EDIT: Done in klog update for 1.29.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sharing some experience from using contextual logging with Cluster API.

In Cluster API we use component-base/logs to setup the log flags and the text/JSON loggers. Then we pass in the logger from klog.Background() into controller-runtime.

controller-runtime ensures that this logger (+ a few additional k/v pairs) is available in all contexts passed in our controllers and webhooks. In the controllers/webhooks we use ~ go-logr/logr.FromContext/NewContext to pass around the logger (controller-runtime has a thin wrapper around these funcs).
In Cluster API contextual logging is always enabled. Because we're using the CR FromContext/NewContext wrappers there is also no way to turn it off.

So we use large parts of the implementation and concepts of upstream Kubernetes contextual logging and align as much as we can to upstream conventions.

We are using this since a while in Cluster API and for us it worked pretty well without any issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any estimates around how much work is left in kube-scheduler? Also what about kubelet and kube-apiserver and cloud-controller-manager? I'd like to see explicit information about each at beta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tracking that in the description of #3077.

Shall I copy a snapshot of that information into the KEP?

kube-scheduler and kube-controller-manager are done and only need to be touch again when we make API changes elsewhere, for example in component-base or client-go.

kubelet has around 2000 calls which need to be updated, kube-proxy 400, kube-apiserver 660.

I've not checked cmd/cloud-controller-manager, but it seems to be small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I didn't look there. No need to put that information in the KEP in that case.

@dashpole
Copy link
Contributor

/assign @logicalhan

@pohly pohly changed the title KEP-3077: contextual logging: bump to beta KEP-3077: contextual logging: slog update for 1.29 Oct 5, 2023
@pohly
Copy link
Contributor Author

pohly commented Oct 5, 2023

This PR is no longer targeting beta in 1.29. Let's try instead for 1.30.

Not sure whether it needs to be merged before the freeze. We can also hold it until some of the pending slog PRs are merged and then merge it as a documentation update.

@pohly
Copy link
Contributor Author

pohly commented Nov 13, 2023

Let's refocus on promotion to beta, with on by default, in 1.30. In this comment I am going to check the usefulness and performance overhead of enabling contextual logging based on current master (= 5ce0bd9, shortly before v1.29.0-alpha.3).

First, use-cases. kube-controller-manager and kube-scheduler are converted to contextual logging and use WithName and WithValues to generate more useful logs.

kube-scheduler:

I1113 08:43:37.029524 87144 default_binder.go:53] "Attempting to bind pod to node" logger="Bind.DefaultBinder" pod="kube-system/coredns-69cbfb9798-ms4pq" node="127.0.0.1"

The immediate benefit is that the operation and plugin are visible in logger. pod and node already get logged as parameters in individual log calls in kube-scheduler code. Once contextual logging is supported by more packages outside of kube-scheduler, they will also be visible there (for example, client-go). Once it is GA, log calls can be simplified.

kube-controller-manager:

I1113 08:43:29.284360 87141 graph_builder.go:285] "garbage controller monitor not synced: no monitors" logger="garbage-collector-controller"

The logger=”garbage-collector-controller” was added by the kube-controller-manager core when instantiating that controller and appears in all log entries of it, as long as the code being called supports contextual logging.

Second, performance. I am using test/integration/scheduler_perf to run go test -bench comparisons with benchstat:

go test -timeout=0 -count=5 '-bench=BenchmarkPerfScheduling'  -benchtime=1ns -run=xxx ./test/integration/scheduler_perf -args -perf-scheduling-label-filter= -v=3 -feature-gate=ContextualLogging=false

That benchmark is completely CPU-bound as log output gets buffered in memory by go test, so any impact of additional code will be very visible. When using WithName too much, results got worse, therefore we toned down the instrumentation a bit to avoid performance regressions.

Most test cases in test/integration/scheduler_perf are relatively stable. I used -count=10 initially, but running everything that often took most of a day, so I switched to -count=5. My test system is a Skylake workstation, with no UI and thus little background activity.

When doing that comparison with kube-scheduler code which supports contextual logging (i.e. uses FromContext) but has no WithName and no WithValues, there was no significant difference. Significant overhead only occurs when calling WithValues and NewContext for call chains which are very fast (for example, plugin calls which don't do anything), because both calls must allocate memory. The use of WithName in kube-controller-manager is no problem because it's only done once during controller instantiation.

kube-scheduler is more interesting. The goal was to cause no performance regression at -v3 or lower. Higher log levels are for debugging and slowdowns become acceptable, in particular when the resulting log output is then more useful for debugging (as it is, in the example above).

At -v3, benchstat reports no significant difference in most test cases, so that goal has been achieved. Some even ran a bit faster (see benchstat-v3.log).

For higher log levels, let's look at -v5 and SchedulingBasic/5000Nodes (simple pods, the most common case) and SchedulingCSIPVs/5000Nodes (produces more output in the volume binding plugin). SchedulingBasic throughput is 27% lower. There's no impact on SchedulingCSIPVs. See
benchstat-v5.log.

@pohly pohly changed the title KEP-3077: contextual logging: slog update for 1.29 KEP-3077: contextual logging: promotion to beta in 1.30 Nov 13, 2023
@logicalhan
Copy link
Member

Let's refocus on promotion to beta, with on by default, in 1.30. In this comment I am going to check the usefulness and performance overhead of enabling contextual logging based on current master (= 5ce0bd9, shortly before v1.29.0-alpha.3).

First, use cases. kube-controller-manager and kube-scheduler are converted to contextual logging and use WithName and WithValues to generate more useful logs.

kube-scheduler:

I1113 08:43:37.029524 87144 default_binder.go:53] "Attempting to bind pod to node" logger="Bind.DefaultBinder" pod="kube-system/coredns-69cbfb9798-ms4pq" node="127.0.0.1"

The immediate benefit is that the operation and plugin are visible in logger. pod and node already get logged as parameters in individual log calls in kube-scheduler code. Once contextual logging is supported by more packages outside of kube-scheduler, they will also be visible there (for example, client-go). Once it is GA, log calls can be simplified.

kube-controller-manager:

I1113 08:43:29.284360 87141 graph_builder.go:285] "garbage controller monitor not synced: no monitors" logger="garbage-collector-controller"

The logger=”garbage-collector-controller” was added by the kube-controller-manager core when instantiating that controller and appears in all log entries of it, as long as the code being called supports contextual logging.

Second, performance. I am using test/integration/scheduler_perf to run go test -bench comparisons with benchstat:

go test -timeout=0 -count=5 '-bench=BenchmarkPerfScheduling'  -benchtime=1ns -run=xxx ./test/integration/scheduler_perf -args -perf-scheduling-label-filter= -v=3 -feature-gate=ContextualLogging=false

That benchmark is completely CPU-bound as log output gets buffered in memory by go test, so any impact of additional code will be very visible. When using WithName too much, results got worse, therefore we toned down the instrumentation a bit to avoid performance regressions.

Most test cases in test/integration/scheduler_perf are relatively stable. I used -count=10 initially, but running everything that often took most of a day, so I switched to -count=5. My test system is a Skylake workstation, with no UI and thus little background activity.

When doing that comparison with kube-scheduler code which supports contextual logging (i.e. uses FromContext) but has no WithName and no WithValues, there was no significant difference. Significant overhead only occurs when calling WithValues and NewContext for call chains which are very fast (for example, plugin calls which don't do anything), because both calls must allocate memory. The use of WithName in kube-controller-manager is no problem because it's only done once during controller instantiation.

kube-scheduler is more interesting. The goal was to cause no performance regression at -v3 or lower. Higher log levels are for debugging and slowdowns become acceptable, in particular when the resulting log output is then more useful for debugging (as it is, in the example above).

At -v3, benchstat reports no significant difference in most test cases, so that goal has been achieved. Some even ran a bit faster (see benchstat-v3.log).

For higher log levels, let's look at -v5 and SchedulingBasic/5000Nodes (simple pods, the most common case) and SchedulingCSIPVs/5000Nodes (produces more output in the volume binding plugin). SchedulingBasic throughput is 27% lower. There's no impact on SchedulingCSIPVs. See benchstat-v5.log.

Thanks for doing this analysis!

@pohly
Copy link
Contributor Author

pohly commented Nov 16, 2023

@logicalhan: are you now okay with promotion to beta in 1.30 as described here?

On chat we had also discussed doing some kind of scalability comparison. I asked about that in #sig-scalability on Slack and got zero responses - it looks like interest and bandwidth for such an experiment are low. Instead, my preference would be to promote the feature early in the 1.30 cycle, watch the normal scalability tests for a regresssion, then back it out again if there is one. That'll be simpler.

@logicalhan
Copy link
Member

@logicalhan: are you now okay with promotion to beta in 1.30 as described here?

On chat we had also discussed doing some kind of scalability comparison. I asked about that in #sig-scalability on Slack and got zero responses - it looks like interest and bandwidth for such an experiment are low. Instead, my preference would be to promote the feature early in the 1.30 cycle, watch the normal scalability tests for a regresssion, then back it out again if there is one. That'll be simpler.

Yeah I'm okay with that plan. Thanks for doing the due diligence!

@pohly pohly mentioned this pull request Nov 20, 2023
12 tasks
@logicalhan
Copy link
Member

/milestone v1.30

@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Nov 20, 2023
@logicalhan logicalhan added the lead-opted-in Denotes that an issue has been opted in to a release label Nov 20, 2023
pohly added a commit to pohly/kubernetes that referenced this pull request Jan 4, 2024
As discussed in
kubernetes/enhancements#4219 (comment)
benchstat reports no significant difference in most test cases for
scheduler_perf for -v3. At -v5, there is some slowdown, but that is justified
because the output becomes more useful for debugging.

Enabling the feature by default ensures that log output becomes better
regardless whether users know and remember to enable it.
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024
@@ -4,3 +4,5 @@
kep-number: 3077
alpha:
approver: "@ehashman"
beta:
approver: "@logicalhan"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnbelamaric: should (may?) I list you here?

I picked @logicalhan as replacement for ehashman because both are SIG Instrumentation, but now I noticed that this should be a PRR reviewer.

@logicalhan added LGTM on behalf of the SIG. This PR is now ready for PRR approval.

/assign @johnbelamaric

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#prr-shadow
Several questions from PRR shadow, before pulling in PRR approver.

logger.Info("Done", "pod", klog.KObj(pod))
```

Starting with beta, the feature gate will be enabled and code can be written
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this require another re-write during beta, or at this point in time we'll let it be, just in case someone decides to turn the feature off, thus falling back to previous mechanism, which as described above requires that duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update this section. We decided to keep the duplication as long as the feature still can be turned off, which is the opposite of what I was proposing here.

@@ -870,7 +912,7 @@ logging.

#### Beta

- All of kube-scheduler (in-tree) and CSI external-provisioner (out-of-tree) converted
- [All of kube-controller-manager](https://github.com/kubernetes/kubernetes/pull/119250) and some [parts of kube-scheduler](https://github.com/kubernetes/kubernetes/pull/115588) converted (in-tree), conversion of out-of-tree components possible, whether they use pflag ([external-provisioner](https://github.com/kubernetes-csi/external-provisioner/pull/639)] or plain Go flags ([node-driver-registrar](https://github.com/kubernetes-csi/node-driver-registrar/pull/259))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any estimates around how much work is left in kube-scheduler? Also what about kubelet and kube-apiserver and cloud-controller-manager? I'd like to see explicit information about each at beta.

@@ -1037,6 +1079,11 @@ Revert commits that changed log calls.

## Implementation History

* Kubernetes 1.24: initial alpha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on not modified text, so a few questions. Feature Enablement and Rollback and Version Skew Strategy questions, both of these are gaining in importance when the feature is enabled by default.

  1. What will happen if a user starts to use new flags/options which start logging to json and rolls the component back? Are there any new flags for setting up different log output at all?
  2. You mentioned in the risk section Performance overhead risk how a cluster administrator can notice that? Was this measured as part of this effort to point out which components are affected more and which less with that impact? You mentioned flat 2% but in the risk section you called out that there will be places where this might be more or less, that's why I'm asking about per-component data.
  3. Pod scheduling (= "startup latency of schedulable stateless pods" SLI) might become slightly worse - have you measured it by how much? This is similar to the above question.
  4. Failure modes, in the risks section you described Uninitialized logger, is there a way we can prevent that failure from happening in CI?
  5. In What steps should be taken if SLOs are not being met to determine the problem? I doubt the cluster operator can revert commits, you need to define this more in terms of a cluster admin rollback to previous working version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if a user starts to use new flags/options which start logging to json and rolls the component back?

JSON output is not part of this feature. That's structured logging.

Are there any new flags for setting up different log output at all?

This feature here has no separate command line flags or config options.

Was this measured as part of this effort to point out which components are affected more and which less with that impact?

We only have performance benchmarks for kube-scheduler. See #4219 (comment) for an evaluation of that.

have you measured it by how much? This is similar to the above question.

I observed no relevant difference in the kube-scheduler benchmark.

Failure modes, in the risks section you described Uninitialized logger, is there a way we can prevent that failure from happening in CI?

This is mostly a non-issue because the Go API was designed such that all code which asks for a logr.Logger gets one. A Logger value is never uninitialized. A pointer might be nil, but this is rarely used.

To catch mistakes in the CI, the modified code must be executed.

I doubt the cluster operator can revert commits, you need to define this more in terms of a cluster admin rollback to previous working version.

True. Let me double-check what I wrote there.

* Kubernetes 1.27: parts of kube-controller-manager converted
* Kubernetes 1.28: kube-controller-manager converted completely, relationship
with log/slog in Go 1.21 clarified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about 1.30?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We finished the kube-scheduler in 1.29. Let me add that.

For 1.30, we are focusing on client-go. We had a volunteer for kubelet and @tallclair wanted to do something with the apiserver, but it's unclear whether both will happen.

It shouldn't be in "Implementation History" anyway, should it? It's not history, it's the future...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh: I added a "Status and next steps" section. I hope that also addresses your question about "estimates around how much work is left" - can you take another look?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 24, 2024
We need to be more conservative than originally planned and continue to produce
the same informative log output as before when contextual logging is disabled.

To help reviewers understand the scope of the work, a new "Status and next
steps" summarizes where we are with the conversion.
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#prr-shadow
prr lgtm
/assign @johnbelamaric

@johnbelamaric
Copy link
Member

👍

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, logicalhan, pohly, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2024
@pohly
Copy link
Contributor Author

pohly commented Feb 7, 2024

@soltysh
Copy link
Contributor

soltysh commented Feb 7, 2024

Re-applying lgtm based on @logicalhan's comment #4219 (review)

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 97465c4 into kubernetes:master Feb 7, 2024
4 checks passed
k8s-publishing-bot pushed a commit to kubernetes/component-base that referenced this pull request Feb 27, 2024
As discussed in
kubernetes/enhancements#4219 (comment)
benchstat reports no significant difference in most test cases for
scheduler_perf for -v3. At -v5, there is some slowdown, but that is justified
because the output becomes more useful for debugging.

Enabling the feature by default ensures that log output becomes better
regardless whether users know and remember to enable it.

Kubernetes-commit: 9c6b42af46ed9f541fefa65a8a1e175ecd0a7c34
Jeffwan pushed a commit to Jeffwan/kubernetes that referenced this pull request Mar 6, 2024
As discussed in
kubernetes/enhancements#4219 (comment)
benchstat reports no significant difference in most test cases for
scheduler_perf for -v3. At -v5, there is some slowdown, but that is justified
because the output becomes more useful for debugging.

Enabling the feature by default ensures that log output becomes better
regardless whether users know and remember to enable it.
dinhxuanvu pushed a commit to dinhxuanvu/kubernetes that referenced this pull request Mar 28, 2024
As discussed in
kubernetes/enhancements#4219 (comment)
benchstat reports no significant difference in most test cases for
scheduler_perf for -v3. At -v5, there is some slowdown, but that is justified
because the output becomes more useful for debugging.

Enabling the feature by default ensures that log output becomes better
regardless whether users know and remember to enable it.
ah8ad3 pushed a commit to ah8ad3/kubernetes that referenced this pull request Apr 6, 2024
As discussed in
kubernetes/enhancements#4219 (comment)
benchstat reports no significant difference in most test cases for
scheduler_perf for -v3. At -v5, there is some slowdown, but that is justified
because the output becomes more useful for debugging.

Enabling the feature by default ensures that log output becomes better
regardless whether users know and remember to enable it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants