Feature/remove k8s cache #32539

gsantoro · 2022-07-28T16:44:55Z

What does this PR do?

Replaced time expiring cache in the Kubernetes module with a dictionary that never expires.

Since some metrics in this internal cache were only set once at startup and not regularly updated, we needed to replace the internal cache with a dictionary that never expires.

Why is it important?

It fixes an issue that caused some metrics kubernetes.container.cpu.usage.limit.pct and kubernetes.container.memory.usage.limit.pct, kubernetes.pod.memory.usage.limit.pct and kubernetes.pod.cpu.usage.limit.pct to be missing at times:

if state_node, node metricsets are disabled,
- container with limits set
  - kubernetes.container.* metrics are present
- container without limits set
  - kubernetes.container.* metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
- pod with all container limits set
  - kubernetes.pod.* metrics are missing (Bug to be fixed here)
- pod with all containers without limits set
  - kubernetes.pod.* metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
if state_node, node metricsets are enabled,
- container/pod with limits set
  - all 4 metrics are present
- container/pod without limits set
  - all 4 metrics are present but they use node limits instead of container limits. This is CORRECT since limits are not set on containers and we use node metrics as backup option

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

if state_node, node metricsets are disabled,
- container/pod with limit sets
  - all 4 metrics are present
- container/pod without limit sets
  - all 4 metrics metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
if state_node, node metricsets are enabled,
- container/pod with limit sets
  - all 4 metrics metrics are present
- container/pod without limit sets
  - all 4 metrics metrics are present but they use node limits instead of container limits. This is CORRECT since limits are not set on containers and we use node metrics as backup option

How to test this PR locally

Run metricbeat on kubernetes with either Kind with multiple nodes or a k8s cluster on GCP/AWS
Add/remove test pods and see metrics appear/disappear for those pods using this manifest
nginx-daemons.yaml.zip
import the following dashboard.ndjson.zip and check for metrics according to author's checklist

Related issues

Closes Metricbeat: Missing kubernetes.pod cpu and memory usage percentage on non-leader nodes #32232

Use cases

Screenshots

When a pod is deleted (like in the following screenshot at 11:09:37) you might see the container still reporting a cpu/memory usage but not the limit.pct. When the pod is added back (like in the following screenshot at 11:12:31) you might see the container with cpu usage = 0 and limit.pct = 0 for a few seconds until the cpu is reported again and limit.pct to be different from 0.

Logs

elasticmachine · 2022-07-28T17:07:57Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-08-11T14:42:46.014+0000
Duration: 57 min 49 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3630
Skipped	873
Total	4503

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

dev-tools/kubernetes/metricbeat/manifest.run.dev.yaml

metricbeat/module/kubernetes/util/metrics_storage.go

metricbeat/module/kubernetes/container/data.go

metricbeat/module/kubernetes/util/metrics_storage.go

ChrsMark

Thanks for the changes! I left some thoughts already about how to struct the "repo"/"storage" to tackle some of the open questions. Not sure if my suggestion is sane but maybe it worths checking it.

dev-tools/kubernetes/metricbeat/manifest.debug.dev.yaml

metricbeat/module/kubernetes/util/metrics_storage.go

metricbeat/module/kubernetes/pod/data.go

metricbeat/module/kubernetes/container/data.go

gsantoro · 2022-08-01T13:41:57Z

@ChrsMark

type MetricsRepo struct {
	sync.RWMutex
	metrics *MetricsStore
}

// MetricsStore are indexed per NodeName
type MetricsStore struct {
	Store map[key]nodeStore
}

type nodeStore struct {
   nodeMetrics NodeMetrics
   containersStore map[key]containerMetrics  // key is Container name in combination with Pod name and namespace
   podStore          map[key]podMetrics.           // key is Pod name, which is unique per namespace}

type NodeMetrics struct {
        metricX int
        metricY int
}

type containerMetrics struct {
        metricZ int
        metricW int
}

type podMetrics struct {
        metricP int
        metricT int
}

Assuming the comments on the containersStore and podStore were actually swapped. Should be correct in the above code.

Thanks for the suggestion, I have a couple of counter points:

In theory I like the hierarchical structure of metrics in (node_name, (nodeMetrics, containerMetrics, podMetrics)) so that if you delete a node you delete all the metrics associated. Unfortunately this is not fully hierarchical since if I delete a node, I need to delete the associated key for that node but also all the container metrics for that pod. I am wondering if we should have 3 levels of nesting node/pod/container

ChrsMark · 2022-08-01T14:10:58Z

@ChrsMark

type MetricsRepo struct {
	sync.RWMutex
	metrics *MetricsStore
}

// MetricsStore are indexed per NodeName
type MetricsStore struct {
	Store map[key]nodeStore
}

type nodeStore struct {
   nodeMetrics NodeMetrics
   containersStore map[key]containerMetrics  // key is Container name in combination with Pod name and namespace
   podStore          map[key]podMetrics.           // key is Pod name, which is unique per namespace}

type NodeMetrics struct {
        metricX int
        metricY int
}

type containerMetrics struct {
        metricZ int
        metricW int
}

type podMetrics struct {
        metricP int
        metricT int
}

Assuming the comments on the containersStore and podStore were actually swapped. Should be correct in the above code.

Yeap, my bad for the swapped comments 🤦🏼‍♂️ .

Thanks for the suggestion, I have a couple of counter points:

In theory I like the hierarchical structure of metrics in (node_name, (nodeMetrics, containerMetrics, podMetrics)) so that if you delete a node you delete all the metrics associated. Unfortunately this is not fully hierarchical since if I delete a node, I need to delete the associated key for that node but also all the container metrics for that pod. I am wondering if we should have 3 levels of nesting node/pod/container

Hmm it depends on how you store the metrics. When you store the container level metrics do you know the NodeName? I guess yes. In that case can't you attach those container metrics under a container key which is under a node key?

For example:

{
  nodeA: {
     nodeMetrics: {...},
     containersStore: {
         containerA: {...},
         containerB: {...},
         ......
     },
     podStore: {
         podA: {...},
         podB: {...},
     },
  }, nodeB: {...}
}

Would sth like the above work?

Actually having a 3-level nesting could also work. It should be faster in case of Pod deletion where you would skip looping over the containers and you would directly delete the attached containers' metrics for the given Pod, right?

tetianakravchenko

could you please as well update documentation:
https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-kubernetes.html#_cpu_7
with info that node.pct might be missed in some cases

metricbeat/module/kubernetes/util/metrics_repo.go

metricbeat/module/kubernetes/container/data.go

metricbeat/module/kubernetes/pod/pod_test.go

metricbeat/module/kubernetes/util/kubernetes.go

metricbeat/module/kubernetes/util/metrics_repo.go

ChrsMark · 2022-08-04T13:46:22Z

Btw we should make sure that this feature supports cases like the following:

- module: kubernetes
  metricsets:
    - state_pod
  period: 10s
  hosts: ["0.0.0.0:8081"]
  add_metadata: false

- module: kubernetes
metricsets:
- state_container
period: 10s
hosts: ["0.0.0.0:8081"]
add_metadata: false

This is what Elastic Agent will produce.

See the description of #25640 for more details and #25640 (comment).

dev-tools/kubernetes/Tiltfile

mergify · 2022-08-09T10:11:31Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/remove_k8s_cache upstream/feature/remove_k8s_cache
git merge upstream/main
git push upstream feature/remove_k8s_cache

gsantoro · 2022-08-11T15:54:01Z

/package

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/container/container_test.go # metricbeat/module/kubernetes/container/data.go # metricbeat/module/kubernetes/pod/data.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/pod/pod_test.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/container/container_test.go # metricbeat/module/kubernetes/container/data.go # metricbeat/module/kubernetes/pod/data.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/pod/pod_test.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761) # Conflicts: # metricbeat/module/kubernetes/container/data.go # metricbeat/module/kubernetes/pod/data.go

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761)

gsantoro · 2022-08-15T12:25:17Z

@Mergifyio backport 8.1.0

mergify · 2022-08-15T12:25:20Z

backport 8.1.0

❌ No backport have been created

Backport to branch 8.1.0 failed: Branch not found

gsantoro · 2022-08-15T12:27:29Z

@Mergifyio backport 8.1

gsantoro · 2022-08-15T12:28:00Z

@Mergifyio backport 8.0

mergify · 2022-08-15T12:28:23Z

backport 8.1

✅ Backports have been created

#32690 Feature/remove k8s cache (backport #32539) has been created for branch 8.1

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/container/container_test.go # metricbeat/module/kubernetes/container/data.go # metricbeat/module/kubernetes/pod/data.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/pod/pod_test.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go

mergify · 2022-08-15T12:29:07Z

backport 8.0

✅ Backports have been created

#32691 [8.0](backport #32539) Feature/remove k8s cache has been created for branch 8.0

gsantoro · 2022-08-15T13:59:33Z

@Mergifyio backport 8.1

mergify · 2022-08-15T13:59:48Z

backport 8.1

✅ Backports have been created

#32690 Feature/remove k8s cache (backport #32539) has been created for branch 8.1

* Feature/remove k8s cache (elastic#32539) Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled (cherry picked from commit 5503761) Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

* replaced internal expiring cache with non expiring dictionary in memory * fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

gsantoro added bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Jul 28, 2022

gsantoro requested a review from a team July 28, 2022 16:44

gsantoro self-assigned this Jul 28, 2022

gsantoro requested review from a team as code owners July 28, 2022 16:44

gsantoro requested review from belimawr and fearful-symmetry and removed request for a team July 28, 2022 16:44

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 28, 2022

ChrsMark previously requested changes Jul 29, 2022

View reviewed changes

gsantoro requested review from ChrsMark and tetianakravchenko August 1, 2022 08:33

ChrsMark reviewed Aug 1, 2022

View reviewed changes

tetianakravchenko reviewed Aug 1, 2022

View reviewed changes

gsantoro requested review from ChrsMark and tetianakravchenko August 2, 2022 18:08

ChrsMark reviewed Aug 3, 2022

View reviewed changes

metricbeat/module/kubernetes/util/metrics_repo.go Outdated Show resolved Hide resolved

ChrsMark reviewed Aug 3, 2022

View reviewed changes

metricbeat/module/kubernetes/util/metrics_repo.go Outdated Show resolved Hide resolved

ChrsMark reviewed Aug 3, 2022

View reviewed changes

metricbeat/module/kubernetes/util/metrics_repo.go Outdated Show resolved Hide resolved

gsantoro commented Aug 4, 2022

View reviewed changes

dev-tools/kubernetes/Tiltfile Outdated Show resolved Hide resolved

gsantoro requested a review from ChrsMark August 10, 2022 16:46

gsantoro added 2 commits August 11, 2022 12:19

coresLimit = nodeCores only if nodeCores > 0 & coresLimit > nodeCores

3fed429

tmp cache expiration to 1000h

67fa04d

gsantoro added backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.4.0 Automated backport with mergify labels Aug 11, 2022

gsantoro mentioned this pull request Aug 11, 2022

Improve scalability of Kubernetes module in metricbeat #32662

Open

gsantoro merged commit 5503761 into elastic:main Aug 12, 2022

gsantoro deleted the feature/remove_k8s_cache branch August 12, 2022 11:10

mergify bot mentioned this pull request Aug 12, 2022

[7.17](backport #32539) Feature/remove k8s cache #32667

Merged

mergify bot mentioned this pull request Aug 12, 2022

[8.2](backport #32539) Feature/remove k8s cache #32668

Closed

mergify bot mentioned this pull request Aug 12, 2022

[8.3](backport #32539) Feature/remove k8s cache #32669

Closed

mergify bot mentioned this pull request Aug 12, 2022

[8.4](backport #32539) Feature/remove k8s cache #32670

Merged

mergify bot mentioned this pull request Aug 15, 2022

[8.1](backport #32539) Feature/remove k8s cache #32690

Closed

mergify bot mentioned this pull request Aug 15, 2022

[8.0](backport #32539) Feature/remove k8s cache #32691

Closed

gsantoro mentioned this pull request Aug 18, 2022

added extra docs on metrics for cpu/memory for pod/container in the kubernetes module #32733

Merged

2 tasks

v1v pushed a commit to v1v/beats that referenced this pull request Aug 22, 2022

[7.17](backport elastic#32539) Feature/remove k8s cache (elastic#32667)

3fdb372

* Feature/remove k8s cache (elastic#32539) Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/remove k8s cache #32539

Feature/remove k8s cache #32539

gsantoro commented Jul 28, 2022 •

edited

Loading

elasticmachine commented Jul 28, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

ChrsMark left a comment

gsantoro commented Aug 1, 2022

ChrsMark commented Aug 1, 2022

tetianakravchenko left a comment

ChrsMark commented Aug 4, 2022 •

edited

Loading

mergify bot commented Aug 9, 2022

gsantoro commented Aug 11, 2022

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

gsantoro commented Aug 15, 2022

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

mergify bot commented Aug 15, 2022

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

Feature/remove k8s cache #32539

Feature/remove k8s cache #32539

Conversation

gsantoro commented Jul 28, 2022 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

elasticmachine commented Jul 28, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

ChrsMark left a comment

Choose a reason for hiding this comment

gsantoro commented Aug 1, 2022

ChrsMark commented Aug 1, 2022

tetianakravchenko left a comment

Choose a reason for hiding this comment

ChrsMark commented Aug 4, 2022 • edited Loading

mergify bot commented Aug 9, 2022

gsantoro commented Aug 11, 2022

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

❌ No backport have been created

gsantoro commented Aug 15, 2022

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

✅ Backports have been created

mergify bot commented Aug 15, 2022

✅ Backports have been created

gsantoro commented Aug 15, 2022

mergify bot commented Aug 15, 2022

✅ Backports have been created

gsantoro commented Jul 28, 2022 •

edited

Loading

elasticmachine commented Jul 28, 2022 •

edited by jenkins-beats-ci bot

Loading

ChrsMark commented Aug 4, 2022 •

edited

Loading