Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[noderesourcetopology] move logging to the logr interface #710

Merged
merged 2 commits into from
Apr 19, 2024

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Apr 10, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implement the changes proposed in #709 to review logging and make it consistent, easy to extend and to get right, and for bonus points easy to replace should it be needed.

Worth pointing out this PR goes in the direction of KEP 3077 (contextual logging). The code becomes almost completely ready for a switch to the contextual logger, with only few selected pieces to change in the future.

Which issue(s) this PR fixes:

Fixes #709

Special notes for your reviewer:

PR split in many smallish commit to make it easier to review (and bisect)
Excluding the obvious intended logging fixes/improvements, no intended changes in behavior.
Ultimately, this PR is not user-facing because the users should see no difference except for more consistent logs, easier to grep and/or process in general.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 10, 2024
@ffromani
Copy link
Contributor Author

/cc @PiotrProkop (for discardednodes and least_numa)

Copy link

netlify bot commented Apr 10, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit 045a43d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/66225d1eadbe400008b871cb

@ffromani
Copy link
Contributor Author

/hold

The PR is reviewable but I'm still polishing it a bit and validating the changes (slicing the logs with grep mostly)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2024
pkg/noderesourcetopology/cache/overreserve.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/cache/overreserve.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/cache/passthrough.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/config.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/filter.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/podprovider/podprovider.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/podprovider/podprovider.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/podprovider/podprovider.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/podprovider/podprovider.go Outdated Show resolved Hide resolved
pkg/noderesourcetopology/podprovider/podprovider.go Outdated Show resolved Hide resolved
@ffromani
Copy link
Contributor Author

the test failures are known. It seems better to me to push some code down the stack: k8stopologyawareschedwg/noderesourcetopology-api#35

@ffromani ffromani force-pushed the nrt-klog-to-logr branch 4 times, most recently from 655b387 to 23910b8 Compare April 15, 2024 08:54
@ffromani
Copy link
Contributor Author

/hold cancel

Bar fixes and review comments, I think this PR has all the content we need for this cleanup.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024
@ffromani ffromani force-pushed the nrt-klog-to-logr branch 2 times, most recently from ea82f02 to 4960b2c Compare April 16, 2024 13:26
@ffromani
Copy link
Contributor Author

/cc @swatisehgal

@PiotrProkop
Copy link
Contributor

PiotrProkop commented Apr 16, 2024

Should we squash history to 2 commits?

  • bump nrt-api
  • rework logging?

@ffromani
Copy link
Contributor Author

Should we squash history to 2 commits?

* bump nrt-api

* rework logging?

Consider how the project seems to prefer, I'm in favor of squashing once reviewers are happy.

@ffromani
Copy link
Contributor Author

/hold

need to squash once the reviewers are happy

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2024
@ffromani
Copy link
Contributor Author

/hold cancel

squashed as requested

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024
@ffromani
Copy link
Contributor Author

ffromani commented Apr 19, 2024

/hold

need to review the integration with the scheduler framework >= 1.29

EDIT: point in case: https://github.com/kubernetes/kubernetes/blob/v1.29.4/pkg/scheduler/framework/runtime/framework.go#L819

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@ffromani
Copy link
Contributor Author

/hold cancel

proper integration with framework and ongoing work to fully support the contextual logging (KEP-3077) would need a followup PR.
Let's stop with this already huge PR right here, I feel this is a massive improvement (I know I'm biased here :) ) already.

changelog:

  1. added constants for well-known keys for structured logging
  2. annotated all the plugin entry points like
	lh.V(4).Info(logging.FlowBegin)
	defer lh.V(4).Info(logging.FlowEnd)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@ffromani
Copy link
Contributor Author

now I call this PR done :)

Bump NRT API package to v0.1.2; there is no API change,
but we have now a better replacement for the internal
`getID` helper, which we can now remove.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Move to the logr.Logger interface everywhere, instead of the
global `klog` instance. This enable named logger, presetting values
for simple and automatic consistency, enables pluggable loggers
and comes for free since we already depend on the logr package
and klog has a native logr integration.

In addition, add minimal support to make it easy to
replace the logr reference, to help integrators of
this code. The default is still (and will still be)
klog for backward compatibility and ecosystem integration.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Contributor Author

rebased on top of the 1.29 rebase

@PiotrProkop
Copy link
Contributor

Looks good to me 👍
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@k8s-ci-robot k8s-ci-robot merged commit 51d27b6 into kubernetes-sigs:master Apr 19, 2024
10 checks passed
@ffromani ffromani deleted the nrt-klog-to-logr branch April 22, 2024 07:55
ffromani added a commit to ffromani/scheduler-plugins that referenced this pull request Apr 30, 2024
After more review and conversations, we have a better
understanding of how integration with contextual logging
should loom like.
First and foremost, injecting loggers would conflict with
the very goals of contextual logging.
So, let's drop this code we added in kubernetes-sigs#710.

The contextual logger doesn't do key/values deduplication.
This is let to (some) backends. To avoid log clutter,
trim down the extra key/value pairs and add only those we
really need to ensure a good troubleshooting experience.

Still let's make sure to add critical key/value
pairs in the relevant entries, at cost of a possible
duplication.

When reporting the current assumed resources, the current
representation is neither concise nor very human friendly.
Additionally, multi-line log entries are harder to process
and should be avoided.
So let's move to a more concise representation, which turns
out not obviously less human friendly and is no longer multiline.

Review verbosiness of log entries.
Move down to verbose=2 logs which are really key to understand
the behavior. We should set a hard limit to log entries to minimize
the log spam while keeping at least some observability without
requiring v=4 or greater.

The level v=4 is usually/often the highest-not-spammy log.
When debug logs are needed we often set v=4, and higher verbosity
levels are often used only in desperate times.
Thus, promote to v=4 the debug logs we should really see.

Everywhere else in the kubernetes ecosystem, and most
notably in the scheduler, the pod namespace/name pair is called
"pod", while we called it "logID".
We do it to use the same name for all the flows, being the
cache resync (which is driven by time, not by an object) the
odd one.

It seems better to be externally consistent (with the ecosystem)
rather than internally consistent (all the flows in the same plugin),
so we rename "logID" to "pod" in the log entries.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this pull request May 6, 2024
After more review and conversations, we have a better
understanding of how integration with contextual logging
should loom like.
First and foremost, injecting loggers would conflict with
the very goals of contextual logging.
So, let's drop this code we added in kubernetes-sigs#710.

The contextual logger doesn't do key/values deduplication.
This is let to (some) backends. To avoid log clutter,
trim down the extra key/value pairs and add only those we
really need to ensure a good troubleshooting experience.

Still let's make sure to add critical key/value
pairs in the relevant entries, at cost of a possible
duplication.

When reporting the current assumed resources, the current
representation is neither concise nor very human friendly.
Additionally, multi-line log entries are harder to process
and should be avoided.
So let's move to a more concise representation, which turns
out not obviously less human friendly and is no longer multiline.

Review verbosiness of log entries.
Move down to verbose=2 logs which are really key to understand
the behavior. We should set a hard limit to log entries to minimize
the log spam while keeping at least some observability without
requiring v=4 or greater.

The level v=4 is usually/often the highest-not-spammy log.
When debug logs are needed we often set v=4, and higher verbosity
levels are often used only in desperate times.
Thus, promote to v=4 the debug logs we should really see.

Everywhere else in the kubernetes ecosystem, and most
notably in the scheduler, the pod namespace/name pair is called
"pod", while we called it "logID".
We do it to use the same name for all the flows, being the
cache resync (which is driven by time, not by an object) the
odd one.

It seems better to be externally consistent (with the ecosystem)
rather than internally consistent (all the flows in the same plugin),
so we rename "logID" to "pod" in the log entries.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani mentioned this pull request May 22, 2024
4 tasks
ffromani added a commit to ffromani/scheduler-plugins that referenced this pull request May 27, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this pull request May 27, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/scheduler-plugins that referenced this pull request May 29, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to openshift-kni/scheduler-plugins that referenced this pull request May 29, 2024
After more review and conversations, we have a better
understanding of how integration with contextual logging
should loom like.
First and foremost, injecting loggers would conflict with
the very goals of contextual logging.
So, let's drop this code we added in kubernetes-sigs#710.

The contextual logger doesn't do key/values deduplication.
This is let to (some) backends. To avoid log clutter,
trim down the extra key/value pairs and add only those we
really need to ensure a good troubleshooting experience.

Still let's make sure to add critical key/value
pairs in the relevant entries, at cost of a possible
duplication.

When reporting the current assumed resources, the current
representation is neither concise nor very human friendly.
Additionally, multi-line log entries are harder to process
and should be avoided.
So let's move to a more concise representation, which turns
out not obviously less human friendly and is no longer multiline.

Review verbosiness of log entries.
Move down to verbose=2 logs which are really key to understand
the behavior. We should set a hard limit to log entries to minimize
the log spam while keeping at least some observability without
requiring v=4 or greater.

The level v=4 is usually/often the highest-not-spammy log.
When debug logs are needed we often set v=4, and higher verbosity
levels are often used only in desperate times.
Thus, promote to v=4 the debug logs we should really see.

Everywhere else in the kubernetes ecosystem, and most
notably in the scheduler, the pod namespace/name pair is called
"pod", while we called it "logID".
We do it to use the same name for all the flows, being the
cache resync (which is driven by time, not by an object) the
odd one.

It seems better to be externally consistent (with the ecosystem)
rather than internally consistent (all the flows in the same plugin),
so we rename "logID" to "pod" in the log entries.

Signed-off-by: Francesco Romani <fromani@redhat.com>
(cherry picked from commit 7098181)
ffromani added a commit to openshift-kni/scheduler-plugins that referenced this pull request May 29, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
(cherry picked from commit 7a8afdf)
ichbinblau pushed a commit to ichbinblau/scheduler-plugins that referenced this pull request Jun 24, 2024
After more review and conversations, we have a better
understanding of how integration with contextual logging
should loom like.
First and foremost, injecting loggers would conflict with
the very goals of contextual logging.
So, let's drop this code we added in kubernetes-sigs#710.

The contextual logger doesn't do key/values deduplication.
This is let to (some) backends. To avoid log clutter,
trim down the extra key/value pairs and add only those we
really need to ensure a good troubleshooting experience.

Still let's make sure to add critical key/value
pairs in the relevant entries, at cost of a possible
duplication.

When reporting the current assumed resources, the current
representation is neither concise nor very human friendly.
Additionally, multi-line log entries are harder to process
and should be avoided.
So let's move to a more concise representation, which turns
out not obviously less human friendly and is no longer multiline.

Review verbosiness of log entries.
Move down to verbose=2 logs which are really key to understand
the behavior. We should set a hard limit to log entries to minimize
the log spam while keeping at least some observability without
requiring v=4 or greater.

The level v=4 is usually/often the highest-not-spammy log.
When debug logs are needed we often set v=4, and higher verbosity
levels are often used only in desperate times.
Thus, promote to v=4 the debug logs we should really see.

Everywhere else in the kubernetes ecosystem, and most
notably in the scheduler, the pod namespace/name pair is called
"pod", while we called it "logID".
We do it to use the same name for all the flows, being the
cache resync (which is driven by time, not by an object) the
odd one.

It seems better to be externally consistent (with the ecosystem)
rather than internally consistent (all the flows in the same plugin),
so we rename "logID" to "pod" in the log entries.

Signed-off-by: Francesco Romani <fromani@redhat.com>
ichbinblau pushed a commit to ichbinblau/scheduler-plugins that referenced this pull request Jun 24, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
AlleNeri pushed a commit to AlleNeri/scheduler-plugins that referenced this pull request Jul 26, 2024
before kubernetes-sigs#710 and kubernetes-sigs#725, we logged the container being processed alongside
the pod (identified by namespace/name pair).
It was dropped by mistake and not deliberately.
This is useful information when troubleshooting, so let's add it back.

Signed-off-by: Francesco Romani <fromani@redhat.com>
AlleNeri pushed a commit to AlleNeri/scheduler-plugins that referenced this pull request Jul 26, 2024
After more review and conversations, we have a better
understanding of how integration with contextual logging
should loom like.
First and foremost, injecting loggers would conflict with
the very goals of contextual logging.
So, let's drop this code we added in kubernetes-sigs#710.

The contextual logger doesn't do key/values deduplication.
This is let to (some) backends. To avoid log clutter,
trim down the extra key/value pairs and add only those we
really need to ensure a good troubleshooting experience.

Still let's make sure to add critical key/value
pairs in the relevant entries, at cost of a possible
duplication.

When reporting the current assumed resources, the current
representation is neither concise nor very human friendly.
Additionally, multi-line log entries are harder to process
and should be avoided.
So let's move to a more concise representation, which turns
out not obviously less human friendly and is no longer multiline.

Review verbosiness of log entries.
Move down to verbose=2 logs which are really key to understand
the behavior. We should set a hard limit to log entries to minimize
the log spam while keeping at least some observability without
requiring v=4 or greater.

The level v=4 is usually/often the highest-not-spammy log.
When debug logs are needed we often set v=4, and higher verbosity
levels are often used only in desperate times.
Thus, promote to v=4 the debug logs we should really see.

Everywhere else in the kubernetes ecosystem, and most
notably in the scheduler, the pod namespace/name pair is called
"pod", while we called it "logID".
We do it to use the same name for all the flows, being the
cache resync (which is driven by time, not by an object) the
odd one.

It seems better to be externally consistent (with the ecosystem)
rather than internally consistent (all the flows in the same plugin),
so we rename "logID" to "pod" in the log entries.

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NodeResourceTopology] review and overhaul internal logging
4 participants