-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/kubeletstat] Review cpu.utilization
naming
#27885
Comments
Did some digging: Kubernetes Docs state:
Looks like it's getting these metrics from CRI and if CRI doesn't have stats it's computing using this formula:
Where:
🤔 Playing a bit with the formula: Limit - is total available cpu time. Let's say we collect every 1 second, and app uses total available cpu time so 1 second.
Based on this example, the result is actual usage of 1,000,000,000 nano seconds or 1second. So this metricunit seems to be nanoseconds, not percentage. If my calculations are correct, I think we should rename to |
@povilasv thank you! |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
…elemetry#25901) **Description:** Starts the name change processor for `*.cpu.utilization` metrics. **Link to tracking Issue:** Related to open-telemetry#24905 Related to open-telemetry#27885
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
FYI @TylerHelmuth @povilasv In SemConv we have merged open-telemetry/semantic-conventions#282 which adds the For the Do we have a summary so far for what is missing from the Shall we try to adopt the At the moment the implementation of the receiver provides the following:
Are we planning to keep them all? From https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/25901/files#diff-3343de7bfda986546ce7cb166e641ae88c0b0aecadd016cb253cd5a0463ff464R352-R353 I see we are going to remove/deprecate opentelemetry-collector-contrib/receiver/kubeletstatsreceiver/internal/kubelet/metadata.go Line 71 in 80bbf5e
|
@ChrsMark in my opinion yes to all questions. We want to be aligned with the spec (although I'd love to reduce the number of iterations in the receivers to gain that alignment, how long till we're stable lol). I don't have a lot of time to dedicate to getting kubeletstatsreceiver update-to-date with the non-stable spec. At this point I was planning to wait for things to stabilize before making any more changes besides the work we started in this issue. |
Thank's @TylerHelmuth, I see the point of not chasing after an unstable schema/spec. Just to clarify regarding the |
@ChrsMark yes I'd be fine with keeping the metric if we can calculate it correctly. We'd still need to go through some sort of feature-gate processor to make it clear to users that the metric has changed and that if they want the old value they need to use the new |
@TylerHelmuth @povilasv I have drafted a patch to illustrate the point at #32295. My sightings look promising :). If we agree on the idea I can move the PR forward to fix the details and open it for review. Let me know what you think. |
Seems reasonable. @jinja2 please take a look |
This looks reasonable for me as well. I would add the informer too so we don't call get Node everytime we scrape data, in practice Node cpu capacity doesn't change |
Thank's for the feedback folks! |
I wonder though if we should first deprecate the metrics and re-introduce this specific one back as it was pointed out at #27885 (comment). Otherwise we can fix this one directly. |
I personally, would be okay if we just fix it, but not sure what others think? |
The thing is that change the way it is calculated requires enabling the So what I would suggest here is:
The above should be done in concrete/standalone PRs and can be split across different releases to ensure a gradual switch process for the users. WDYT? |
Ya that'd be bad. I like this plan as it works toward the original goal (using the proper name) and allows us to work on an actual utilization metric separately. My only concern is that I really didn't want to do breaking semantic convention changes to the k8s components until the k8s semconv was stable, so that all the changes could be hidden behind one feature flag. That isn't happening anytime soon tho. I think starting with enabling |
@ChrsMark, I like that outlined plan. Thanks for putting it together! However, I'd prefer combining 1 and 2 in one release so we don't change the cardinality. cc @TylerHelmuth |
@dmitryax If we do 1 and 2 in one release we need more evangelism around the change. The I think we'd need to update to the README, add a warning in the release notes, and update the printed warning that the value will "swap" in a specific future release. Maybe write a blog post too. |
@TylerHelmuth @dmitryax Should we then first do those in a step 0 in order to later combine step 1 and step 2? I'd be fine with that and taking care of it. |
Sure |
@TylerHelmuth cool! Any preferences on what the target release of this change/deprecation should be? Based on the release-schedule I would go with |
Is there a way for use to add a feature gate or something that causes the collector to fail on start if It would be great it we can add a set where the collector fails to start if utilization is in use, and the user could disable the feature gate to get back to a running state, but at least they'd have to acknowledge the change is coming on a specific date. |
Hmm having the Collector to fail completely for one metric sounds a bit concerning to me tbh. Has this approach been used in other components too? How about making the switch completely behind a feature gate then? This way we enable users to switch whenever they want prior to a specific release and then after we have make the switch the feature gate to continue using the
(we have something similar in Wouldn't that cover most of what we have discussed here? |
Yes thats what I'm thinking. If the feature gate is enabled and a |
Thank's @TylerHelmuth! I will be off for the following weeks, but I will pick it up once I'm back. |
@TylerHelmuth @dmitryax I have tried the feature gate approach at #35139. Please have a look once you get the chance. You can find more details about the feature gate's plan in the PR's description. |
… deprecation (#35139) **Description:** This PR adds a feature gate as discussed at #27885 (comment). If the feature gate is enabled the `container.cpu.utilization`, `k8s.pod.cpu.utilization` and `k8s.node.cpu.utilization` metrics will be not be disabled being replaced by the `container.cpu.usage`, `k8s.pod.cpu.usage` and `k8s.node.cpu.usage`. ### Feature gate schedule - alpha: when enabled it makes the .cpu.usage metrics enabled by default - beta: .cpu.usage metrics are enabled by default and any configuration enabling the deprecated .cpu.utilization metrics will be failing. Explicitly disabling the feature gate provides the old (deprecated) behavior. - stable: .cpu.usage metrics are enabled by default and the deprecated metrics are completely removed. - Removed three releases after `stable`. **Documentation:** ### How to test this 1. Using the following configuration ```yaml mode: daemonset presets: kubeletMetrics: enabled: true image: repository: otelcontribcol-dev tag: "latest" pullPolicy: IfNotPresent command: name: otelcontribcol extraArgs: [--feature-gates=receiver.kubeletstats.enableCPUUsageMetrics] config: exporters: debug: verbosity: normal receivers: kubeletstats: collection_interval: 10s auth_type: 'serviceAccount' endpoint: '${env:K8S_NODE_NAME}:10250' insecure_skip_verify: true service: pipelines: metrics: receivers: [kubeletstats] processors: [batch] exporters: [debug] ``` 2. Ensure that only the `.cpu.usage` metrics are reported. 3. Disable the feature gate and check that only the `.cpu.utilization` metrics are reported. --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
… deprecation (open-telemetry#35139) **Description:** This PR adds a feature gate as discussed at open-telemetry#27885 (comment). If the feature gate is enabled the `container.cpu.utilization`, `k8s.pod.cpu.utilization` and `k8s.node.cpu.utilization` metrics will be not be disabled being replaced by the `container.cpu.usage`, `k8s.pod.cpu.usage` and `k8s.node.cpu.usage`. ### Feature gate schedule - alpha: when enabled it makes the .cpu.usage metrics enabled by default - beta: .cpu.usage metrics are enabled by default and any configuration enabling the deprecated .cpu.utilization metrics will be failing. Explicitly disabling the feature gate provides the old (deprecated) behavior. - stable: .cpu.usage metrics are enabled by default and the deprecated metrics are completely removed. - Removed three releases after `stable`. **Documentation:** ### How to test this 1. Using the following configuration ```yaml mode: daemonset presets: kubeletMetrics: enabled: true image: repository: otelcontribcol-dev tag: "latest" pullPolicy: IfNotPresent command: name: otelcontribcol extraArgs: [--feature-gates=receiver.kubeletstats.enableCPUUsageMetrics] config: exporters: debug: verbosity: normal receivers: kubeletstats: collection_interval: 10s auth_type: 'serviceAccount' endpoint: '${env:K8S_NODE_NAME}:10250' insecure_skip_verify: true service: pipelines: metrics: receivers: [kubeletstats] processors: [batch] exporters: [debug] ``` 2. Ensure that only the `.cpu.usage` metrics are reported. 3. Disable the feature gate and check that only the `.cpu.utilization` metrics are reported. --------- Signed-off-by: ChrsMark <chrismarkou92@gmail.com>
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
receiver/kubeletstats
Is your feature request related to a problem? Please describe.
The Kubeletestats Receiver currently uses
*.cpu.utilization
as the name for cpu metrics that report the CPUStatsUsageNanoCores
value.I believe that
UsageNanoCores
reports the actual amount of cpu being used not the ratio of the amount being used out of a total limit. If this is true, then our use ofutilization
is not meeting semantic convention exceptions.I would like to have a discussion about what exactly
UsageNanoCores
represents and if our metric naming needs updating.Related to discussion that started in #24905
The text was updated successfully, but these errors were encountered: