[receiver/kubeletstat] Review `cpu.utilization` naming #27885

TylerHelmuth · 2023-10-20T14:24:44Z

Component(s)

receiver/kubeletstats

Is your feature request related to a problem? Please describe.

The Kubeletestats Receiver currently uses *.cpu.utilization as the name for cpu metrics that report the CPUStats UsageNanoCores value.

I believe that UsageNanoCores reports the actual amount of cpu being used not the ratio of the amount being used out of a total limit. If this is true, then our use of utilization is not meeting semantic convention exceptions.

I would like to have a discussion about what exactly UsageNanoCores represents and if our metric naming needs updating.

Related to discussion that started in #24905

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2023-10-20T14:25:20Z

/cc @jinja2 @dmitryax @povilasv

povilasv · 2023-10-23T10:54:56Z

Did some digging:

Kubernetes Docs state:

// Total CPU usage (sum of all cores) averaged over the sample window.
// The "core" unit can be interpreted as CPU core-nanoseconds per second.
// +optional
UsageNanoCores *uint64 json:"usageNanoCores,omitempty"

Looks like it's getting these metrics from CRI and if CRI doesn't have stats it's computing using this formula:

		nanoSeconds := newStats.Timestamp - cachedStats.Timestamp

		usageNanoCores := uint64(float64(newStats.UsageCoreNanoSeconds.Value-cachedStats.UsageCoreNanoSeconds.Value) /
			float64(nanoSeconds) * float64(time.Second/time.Nanosecond))

Ref: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/stats/cri_stats_provider.go#L791-L800

Where:

// Cumulative CPU usage (sum across all cores) since object creation.
UsageCoreNanoSeconds *UInt64Value protobuf:"bytes,2,opt,name=usage_core_nano_seconds,json=usageCoreNanoSeconds,proto3" json:"usage_core_nano_seconds,omitempty"

🤔

Playing a bit with the formula:

Limit - is total available cpu time.

Let's say we collect every 1 second, and app uses total available cpu time so 1 second.

nanoSeconds := now() - (now() - 1s) = 1s = 1,000,000,000 nanoseconds

UsaeNanocores := (2,000,000,000 - 1,000,000,000)  / 1,000,000,000  * 1,000,000,000 = 1,000,000,000 
or simplified:

UsageNanocores := (2s - 1s) / 1s * float64(time.Second/time.Nanosecond)) =  unit64(1 * float64(time.Second/time.Nanosecond)))  =  1,000,000,000

Based on this example, the result is actual usage of 1,000,000,000 nano seconds or 1second.

So this metricunit seems to be nanoseconds, not percentage.

If my calculations are correct, I think we should rename to cpu.usage with proper unit (nanoseconds)?

TylerHelmuth · 2023-10-23T14:45:01Z

@povilasv thank you!

github-actions · 2023-12-25T03:29:16Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/kubeletstats: @dmitryax @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

**Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to #24905 Related to #27885

…elemetry#25901) **Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to open-telemetry#24905 Related to open-telemetry#27885

github-actions · 2024-03-11T03:30:32Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/kubeletstats: @dmitryax @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ChrsMark · 2024-03-27T13:30:05Z

FYI @TylerHelmuth @povilasv In SemConv we have merged open-telemetry/semantic-conventions#282 which adds the container.cpu.time metric for now.

For the dockerstats receiver we have #31649 which will try to align the implementation with the added SemConvs.

Do we have a summary so far for what is missing from the kubeletstats receiver in terms of naming changes (like this current issue)?

Shall we try to adopt the kubeletstats receiver with #31649? Happy to help with that.

At the moment the implementation of the receiver provides the following:

container.cpu.time: (sum) Total cumulative CPU time (sum of all cores) spent by the container/pod/node since its creation
container.cpu.utilization: (gauge) Container CPU utilization
container.cpu.usage: (gauge) Total CPU usage (sum of all cores per second) averaged over the sample window

Are we planning to keep them all?
Are those all allgined with https://github.com/open-telemetry/semantic-conventions/blob/71c2e8072596fb9a4ceb68303c83f5389e0beb5f/docs/general/metrics.md#instrument-naming?

From https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/25901/files#diff-3343de7bfda986546ce7cb166e641ae88c0b0aecadd016cb253cd5a0463ff464R352-R353 I see we are going to remove/deprecate container.cpu.utilization?
Could we keep it instead as optional metric and find a proper way to calculate it? I see it was mentioned at #25901 (comment) but not sure how it was resolved. I guess that would be possible by adding a cpuNodeLimit (retrieved from the Node Resource) at

opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/internal/kubelet/metadata.go

Line 71 in 80bbf5e

cpuLimit: r.Limits.Cpu().AsApproximateFloat64(),

similarly with set resource limit. I drafted a very WiP implementation for this to illustrate the point -> ChrsMark@27ce769

TylerHelmuth · 2024-04-02T19:56:48Z

@ChrsMark in my opinion yes to all questions. We want to be aligned with the spec (although I'd love to reduce the number of iterations in the receivers to gain that alignment, how long till we're stable lol).

I don't have a lot of time to dedicate to getting kubeletstatsreceiver update-to-date with the non-stable spec. At this point I was planning to wait for things to stabilize before making any more changes besides the work we started in this issue.

ChrsMark · 2024-04-03T07:39:01Z

Thank's @TylerHelmuth, I see the point of not chasing after an unstable schema/spec.

Just to clarify regarding the container.cpu.utilization though: shall we abort its deprecation and try to calculate this properly as it was mentioned at #25901 (comment)? This can happen based on the cpuNodeLimit, and seems to be doable based on a quick research I did: ChrsMark@27ce769

TylerHelmuth · 2024-04-04T17:26:21Z

@ChrsMark yes I'd be fine with keeping the metric if we can calculate it correctly. We'd still need to go through some sort of feature-gate processor to make it clear to users that the metric has changed and that if they want the old value they need to use the new .usage metric.

ChrsMark · 2024-04-10T11:35:29Z

@TylerHelmuth @povilasv I have drafted a patch to illustrate the point at #32295. My sightings look promising :).

If we agree on the idea I can move the PR forward to fix the details and open it for review. Let me know what you think.

TylerHelmuth · 2024-04-10T20:56:06Z

Seems reasonable. @jinja2 please take a look

povilasv · 2024-04-15T07:58:39Z

This looks reasonable for me as well. I would add the informer too so we don't call get Node everytime we scrape data, in practice Node cpu capacity doesn't change

ChrsMark · 2024-04-15T13:37:15Z

Thank's for the feedback folks!
I have adjusted #32295 to use an informer instead of getting the Node on every scrape. Something I noticed is that we have several places where there are different informer based implementations, like in k8sclusterreceiver and k8sattributesprocessor. Maybe it would make sense to extract that common logic into a common lib and re-use it from the various receivers and processors (even in resourcedetectorprocessor), but we can file a different issue for this.
For now we can continue any discussion at #32295.

ChrsMark · 2024-07-15T09:43:47Z

I wonder though if we should first deprecate the metrics and re-introduce this specific one back as it was pointed out at #27885 (comment).

Otherwise we can fix this one directly.

povilasv · 2024-07-16T11:52:35Z

I personally, would be okay if we just fix it, but not sure what others think?

ChrsMark · 2024-07-22T11:58:34Z

The thing is that change the way it is calculated requires enabling the NodeInformer. So if we change this now as is, any existing users' configurations will become invalid and the metric will being reported as 0. This is highlighted by the failing tests at #34191.

So what I would suggest here is:

Update the README file, add a warning in the release notes, and update the printed warning that the value will "swap" in a specific future release. Maybe write a blog post too.
Make the *.cpu.usage metrics enabled by default: [receiver/kubeletstats] Enable by default the .cpu.usage metrics #34217
Make the *.cpu.utilization metrics disabled by default.
Remove entirely the k8s.node.cpu.utilization, k8s.pod.cpu.utilization and container.cpu.utilization metrics.
Add back the k8s.node.cpu.utilization with [receiver/kubeletstats] Emit k8s.node.cpu.utilization as ratio against node's capacity #34191, which will be disabled by default.

The above should be done in concrete/standalone PRs and can be split across different releases to ensure a gradual switch process for the users.

WDYT?

/cc @TylerHelmuth @dmitryax

TylerHelmuth · 2024-07-22T20:37:31Z

So if we change this now as is, any existing users' configurations will become invalid and the metric will being reported (will be 0).

Ya that'd be bad.

I like this plan as it works toward the original goal (using the proper name) and allows us to work on an actual utilization metric separately. My only concern is that I really didn't want to do breaking semantic convention changes to the k8s components until the k8s semconv was stable, so that all the changes could be hidden behind one feature flag.

That isn't happening anytime soon tho. I think starting with enabling cpu.usage by default is a good next step.

dmitryax · 2024-08-07T04:14:46Z

@ChrsMark, I like that outlined plan. Thanks for putting it together! However, I'd prefer combining 1 and 2 in one release so we don't change the cardinality. cc @TylerHelmuth

TylerHelmuth · 2024-08-07T05:01:01Z

@dmitryax If we do 1 and 2 in one release we need more evangelism around the change. The usage metrics and their warnings have been around for a long time, but the change wasn't very public and I worry users aren't adopting it.

I think we'd need to update to the README, add a warning in the release notes, and update the printed warning that the value will "swap" in a specific future release. Maybe write a blog post too.

ChrsMark · 2024-08-08T09:37:26Z

I think we'd need to update to the README, add a warning in the release notes, and update the printed warning that the value will "swap" in a specific future release. Maybe write a blog post too.

@TylerHelmuth @dmitryax Should we then first do those in a step 0 in order to later combine step 1 and step 2? I'd be fine with that and taking care of it.

TylerHelmuth · 2024-08-08T12:45:19Z

Sure

ChrsMark · 2024-08-12T07:21:22Z

@TylerHelmuth cool! Any preferences on what the target release of this change/deprecation should be? Based on the release-schedule I would go with v0.111.0 (2024-10-07).

TylerHelmuth · 2024-08-12T18:02:24Z

Is there a way for use to add a feature gate or something that causes the collector to fail on start if utilization is used? Breaking changes from semantics are so difficult because a user could upgrade the collector and not realize it just started emitting new telemetry and in this case it is a metric that has been a core metric from this receiver for 4+ years.

It would be great it we can add a set where the collector fails to start if utilization is in use, and the user could disable the feature gate to get back to a running state, but at least they'd have to acknowledge the change is coming on a specific date.

ChrsMark · 2024-08-13T06:54:08Z

It would be great it we can add a set where the collector fails to start if utilization is in use, and the user could disable the feature gate to get back to a running state, but at least they'd have to acknowledge the change is coming on a specific date.

Hmm having the Collector to fail completely for one metric sounds a bit concerning to me tbh. Has this approach been used in other components too?

How about making the switch completely behind a feature gate then?

This way we enable users to switch whenever they want prior to a specific release and then after we have make the switch the feature gate to continue using the utilization will still be there for some period. We can define a plan like the following:

Introduce it in v0.x.0 as alpha - disabled by default.
Move it to beta in v0.y.0 - enabled by default. (for configs that still enable the utilization it can fail)
Move it to stable in v0.z.0 - cannot be disabled.
Remove it three releases after stable.

(we have something similar in hostmetricsreceiver)

Wouldn't that cover most of what we have discussed here?

TylerHelmuth · 2024-08-13T17:27:59Z

Yes thats what I'm thinking. If the feature gate is enabled and a utilization metric is enabled, the collector will fail to start.

ChrsMark · 2024-08-14T08:07:42Z

Thank's @TylerHelmuth! I will be off for the following weeks, but I will pick it up once I'm back.

ChrsMark · 2024-09-17T06:59:10Z

@TylerHelmuth @dmitryax I have tried the feature gate approach at #35139. Please have a look once you get the chance. You can find more details about the feature gate's plan in the PR's description.

… deprecation (#35139) **Description:** This PR adds a feature gate as discussed at #27885 (comment). If the feature gate is enabled the `container.cpu.utilization`, `k8s.pod.cpu.utilization` and `k8s.node.cpu.utilization` metrics will be not be disabled being replaced by the `container.cpu.usage`, `k8s.pod.cpu.usage` and `k8s.node.cpu.usage`. ### Feature gate schedule - alpha: when enabled it makes the .cpu.usage metrics enabled by default - beta: .cpu.usage metrics are enabled by default and any configuration enabling the deprecated .cpu.utilization metrics will be failing. Explicitly disabling the feature gate provides the old (deprecated) behavior. - stable: .cpu.usage metrics are enabled by default and the deprecated metrics are completely removed. - Removed three releases after `stable`. **Documentation:** ### How to test this 1. Using the following configuration ```yaml mode: daemonset presets: kubeletMetrics: enabled: true image: repository: otelcontribcol-dev tag: "latest" pullPolicy: IfNotPresent command: name: otelcontribcol extraArgs: [--feature-gates=receiver.kubeletstats.enableCPUUsageMetrics] config: exporters: debug: verbosity: normal receivers: kubeletstats: collection_interval: 10s auth_type: 'serviceAccount' endpoint: '${env:K8S_NODE_NAME}:10250' insecure_skip_verify: true service: pipelines: metrics: receivers: [kubeletstats] processors: [batch] exporters: [debug] ``` 2. Ensure that only the `.cpu.usage` metrics are reported. 3. Disable the feature gate and check that only the `.cpu.utilization` metrics are reported. --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>

… deprecation (open-telemetry#35139) **Description:** This PR adds a feature gate as discussed at open-telemetry#27885 (comment). If the feature gate is enabled the `container.cpu.utilization`, `k8s.pod.cpu.utilization` and `k8s.node.cpu.utilization` metrics will be not be disabled being replaced by the `container.cpu.usage`, `k8s.pod.cpu.usage` and `k8s.node.cpu.usage`. ### Feature gate schedule - alpha: when enabled it makes the .cpu.usage metrics enabled by default - beta: .cpu.usage metrics are enabled by default and any configuration enabling the deprecated .cpu.utilization metrics will be failing. Explicitly disabling the feature gate provides the old (deprecated) behavior. - stable: .cpu.usage metrics are enabled by default and the deprecated metrics are completely removed. - Removed three releases after `stable`. **Documentation:** ### How to test this 1. Using the following configuration ```yaml mode: daemonset presets: kubeletMetrics: enabled: true image: repository: otelcontribcol-dev tag: "latest" pullPolicy: IfNotPresent command: name: otelcontribcol extraArgs: [--feature-gates=receiver.kubeletstats.enableCPUUsageMetrics] config: exporters: debug: verbosity: normal receivers: kubeletstats: collection_interval: 10s auth_type: 'serviceAccount' endpoint: '${env:K8S_NODE_NAME}:10250' insecure_skip_verify: true service: pipelines: metrics: receivers: [kubeletstats] processors: [batch] exporters: [debug] ``` 2. Ensure that only the `.cpu.usage` metrics are reported. 3. Disable the feature gate and check that only the `.cpu.utilization` metrics are reported. --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>

github-actions · 2024-11-18T03:37:41Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/kubeletstats: @dmitryax @TylerHelmuth @ChrsMark

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth added enhancement New feature or request needs triage New item requiring triage priority:p2 Medium discussion needed Community discussion needed receiver/kubeletstats and removed enhancement New feature or request needs triage New item requiring triage labels Oct 20, 2023

TylerHelmuth mentioned this issue Oct 20, 2023

[receiver/kubeletstats] Start name change for cpu.utilization #25901

Merged

github-actions bot mentioned this issue Oct 24, 2023

Weekly Report: 2023-10-17 - 2023-10-24 #28557

Closed

github-actions bot added the Stale label Dec 25, 2023

TylerHelmuth removed the Stale label Jan 8, 2024

dmitryax pushed a commit that referenced this issue Jan 12, 2024

[receiver/kubeletstats] Start name change for cpu.utilization (#25901)

74294cb

**Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to #24905 Related to #27885

a-thaler mentioned this issue Feb 29, 2024

Fix and review cpu utilization metrics kyma-project/telemetry-manager#838

Closed

github-actions bot added the Stale label Mar 11, 2024

TylerHelmuth removed the Stale label Mar 11, 2024

ChrsMark mentioned this issue Apr 10, 2024

[receiver/kubeletstats] Add k8s.container.cpu.node.utilization metric #32295

Merged

github-actions bot removed the Stale label Jul 13, 2024

ChrsMark mentioned this issue Jul 15, 2024

Adding first draft of [OTEL Kubernetes] Cluster Overview Dashboard elastic/integrations#10443

Closed

4 tasks

ChrsMark mentioned this issue Jul 22, 2024

[receiver/kubeletstats] Emit k8s.node.cpu.utilization as ratio against node's capacity #34191

Closed

ChrsMark mentioned this issue Jul 23, 2024

[receiver/kubeletstats] Enable by default the .cpu.usage metrics #34217

Closed

ChrsMark mentioned this issue Aug 6, 2024

Define semantic conventions for k8s metrics open-telemetry/semantic-conventions#1032

Open

ChrsMark mentioned this issue Aug 12, 2024

[receiver/kubeletstats] Add note about metrics deprecation on specific release #34602

Closed

ChrsMark mentioned this issue Sep 11, 2024

[receiver/kubeletstats] Add feature gate for cpu utilization metrics' deprecation #35139

Merged

ChrsMark mentioned this issue Oct 3, 2024

[Kubernetest OTEL] Follow up enhancements elastic/integrations#11310

Merged

4 tasks

ChrsMark mentioned this issue Oct 17, 2024

[k8s] Define semantic conventions for k8s cpu metrics open-telemetry/semantic-conventions#1489

Open

github-actions bot added the Stale label Nov 18, 2024

ChrsMark added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 18, 2024

mx-psi added the waiting-for:semantic-conventions Waiting on something on semantic-conventions to be stabilized label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/kubeletstat] Review `cpu.utilization` naming #27885

[receiver/kubeletstat] Review `cpu.utilization` naming #27885

TylerHelmuth commented Oct 20, 2023 •

edited

Loading

TylerHelmuth commented Oct 20, 2023

povilasv commented Oct 23, 2023 •

edited

Loading

TylerHelmuth commented Oct 23, 2023

github-actions bot commented Dec 25, 2023

github-actions bot commented Mar 11, 2024

ChrsMark commented Mar 27, 2024 •

edited

Loading

TylerHelmuth commented Apr 2, 2024 •

edited

Loading

ChrsMark commented Apr 3, 2024

TylerHelmuth commented Apr 4, 2024

ChrsMark commented Apr 10, 2024

TylerHelmuth commented Apr 10, 2024

povilasv commented Apr 15, 2024

ChrsMark commented Apr 15, 2024

ChrsMark commented Jul 15, 2024 •

edited

Loading

povilasv commented Jul 16, 2024

ChrsMark commented Jul 22, 2024 •

edited

Loading

TylerHelmuth commented Jul 22, 2024

dmitryax commented Aug 7, 2024 •

edited

Loading

TylerHelmuth commented Aug 7, 2024 •

edited

Loading

ChrsMark commented Aug 8, 2024

TylerHelmuth commented Aug 8, 2024

ChrsMark commented Aug 12, 2024

TylerHelmuth commented Aug 12, 2024

ChrsMark commented Aug 13, 2024

TylerHelmuth commented Aug 13, 2024 •

edited

Loading

ChrsMark commented Aug 14, 2024

ChrsMark commented Sep 17, 2024

github-actions bot commented Nov 18, 2024

[receiver/kubeletstat] Review cpu.utilization naming #27885

[receiver/kubeletstat] Review cpu.utilization naming #27885

Comments

TylerHelmuth commented Oct 20, 2023 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

TylerHelmuth commented Oct 20, 2023

povilasv commented Oct 23, 2023 • edited Loading

TylerHelmuth commented Oct 23, 2023

github-actions bot commented Dec 25, 2023

github-actions bot commented Mar 11, 2024

ChrsMark commented Mar 27, 2024 • edited Loading

TylerHelmuth commented Apr 2, 2024 • edited Loading

ChrsMark commented Apr 3, 2024

TylerHelmuth commented Apr 4, 2024

ChrsMark commented Apr 10, 2024

TylerHelmuth commented Apr 10, 2024

povilasv commented Apr 15, 2024

ChrsMark commented Apr 15, 2024

ChrsMark commented Jul 15, 2024 • edited Loading

povilasv commented Jul 16, 2024

ChrsMark commented Jul 22, 2024 • edited Loading

TylerHelmuth commented Jul 22, 2024

dmitryax commented Aug 7, 2024 • edited Loading

TylerHelmuth commented Aug 7, 2024 • edited Loading

ChrsMark commented Aug 8, 2024

TylerHelmuth commented Aug 8, 2024

ChrsMark commented Aug 12, 2024

TylerHelmuth commented Aug 12, 2024

ChrsMark commented Aug 13, 2024

TylerHelmuth commented Aug 13, 2024 • edited Loading

ChrsMark commented Aug 14, 2024

ChrsMark commented Sep 17, 2024

github-actions bot commented Nov 18, 2024

[receiver/kubeletstat] Review `cpu.utilization` naming #27885

[receiver/kubeletstat] Review `cpu.utilization` naming #27885

TylerHelmuth commented Oct 20, 2023 •

edited

Loading

povilasv commented Oct 23, 2023 •

edited

Loading

ChrsMark commented Mar 27, 2024 •

edited

Loading

TylerHelmuth commented Apr 2, 2024 •

edited

Loading

ChrsMark commented Jul 15, 2024 •

edited

Loading

ChrsMark commented Jul 22, 2024 •

edited

Loading

dmitryax commented Aug 7, 2024 •

edited

Loading

TylerHelmuth commented Aug 7, 2024 •

edited

Loading

TylerHelmuth commented Aug 13, 2024 •

edited

Loading