Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/remove k8s cache #32539

Merged
merged 41 commits into from
Aug 12, 2022
Merged

Conversation

gsantoro
Copy link
Contributor

@gsantoro gsantoro commented Jul 28, 2022

What does this PR do?

Replaced time expiring cache in the Kubernetes module with a dictionary that never expires.

Since some metrics in this internal cache were only set once at startup and not regularly updated, we needed to replace the internal cache with a dictionary that never expires.

Why is it important?

It fixes an issue that caused some metrics kubernetes.container.cpu.usage.limit.pct and kubernetes.container.memory.usage.limit.pct, kubernetes.pod.memory.usage.limit.pct and kubernetes.pod.cpu.usage.limit.pct to be missing at times:

  • if state_node, node metricsets are disabled,
    • container with limits set
      • kubernetes.container.* metrics are present
    • container without limits set
      • kubernetes.container.* metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
    • pod with all container limits set
      • kubernetes.pod.* metrics are missing (Bug to be fixed here)
    • pod with all containers without limits set
      • kubernetes.pod.* metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
  • if state_node, node metricsets are enabled,
    • container/pod with limits set
      • all 4 metrics are present
    • container/pod without limits set
      • all 4 metrics are present but they use node limits instead of container limits. This is CORRECT since limits are not set on containers and we use node metrics as backup option

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • if state_node, node metricsets are disabled,
    • container/pod with limit sets
      • all 4 metrics are present
    • container/pod without limit sets
      • all 4 metrics metrics are missing. This is CORRECT since limits are not set on containers and we cannot use node metrics as backup option
  • if state_node, node metricsets are enabled,
    • container/pod with limit sets
      • all 4 metrics metrics are present
    • container/pod without limit sets
      • all 4 metrics metrics are present but they use node limits instead of container limits. This is CORRECT since limits are not set on containers and we use node metrics as backup option

How to test this PR locally

  • Run metricbeat on kubernetes with either Kind with multiple nodes or a k8s cluster on GCP/AWS
  • Add/remove test pods and see metrics appear/disappear for those pods using this manifest
    nginx-daemons.yaml.zip
  • import the following dashboard.ndjson.zip and check for metrics according to author's checklist

Related issues

Use cases

Screenshots

When a pod is deleted (like in the following screenshot at 11:09:37) you might see the container still reporting a cpu/memory usage but not the limit.pct. When the pod is added back (like in the following screenshot at 11:12:31) you might see the container with cpu usage = 0 and limit.pct = 0 for a few seconds until the cpu is reported again and limit.pct to be different from 0.

Screenshot 2022-08-01 at 11 13 21

Logs

@gsantoro gsantoro added bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Jul 28, 2022
@gsantoro gsantoro requested a review from a team July 28, 2022 16:44
@gsantoro gsantoro self-assigned this Jul 28, 2022
@gsantoro gsantoro requested review from a team as code owners July 28, 2022 16:44
@gsantoro gsantoro requested review from belimawr and fearful-symmetry and removed request for a team July 28, 2022 16:44
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 28, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 28, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-11T14:42:46.014+0000

  • Duration: 57 min 49 sec

Test stats 🧪

Test Results
Failed 0
Passed 3630
Skipped 873
Total 4503

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

dev-tools/kubernetes/metricbeat/manifest.run.dev.yaml Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/container/data.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I left some thoughts already about how to struct the "repo"/"storage" to tackle some of the open questions. Not sure if my suggestion is sane but maybe it worths checking it.

dev-tools/kubernetes/metricbeat/manifest.debug.dev.yaml Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/metrics_storage.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/pod/data.go Outdated Show resolved Hide resolved
@gsantoro
Copy link
Contributor Author

gsantoro commented Aug 1, 2022

@ChrsMark

type MetricsRepo struct {
	sync.RWMutex
	metrics *MetricsStore
}

// MetricsStore are indexed per NodeName
type MetricsStore struct {
	Store map[key]nodeStore
}

type nodeStore struct {
   nodeMetrics NodeMetrics
   containersStore map[key]containerMetrics  // key is Container name in combination with Pod name and namespace
   podStore          map[key]podMetrics.           // key is Pod name, which is unique per namespace}

type NodeMetrics struct {
        metricX int
        metricY int
}

type containerMetrics struct {
        metricZ int
        metricW int
}

type podMetrics struct {
        metricP int
        metricT int
}

Assuming the comments on the containersStore and podStore were actually swapped. Should be correct in the above code.

Thanks for the suggestion, I have a couple of counter points:

  • In theory I like the hierarchical structure of metrics in (node_name, (nodeMetrics, containerMetrics, podMetrics)) so that if you delete a node you delete all the metrics associated. Unfortunately this is not fully hierarchical since if I delete a node, I need to delete the associated key for that node but also all the container metrics for that pod. I am wondering if we should have 3 levels of nesting node/pod/container

@ChrsMark
Copy link
Member

ChrsMark commented Aug 1, 2022

@ChrsMark

type MetricsRepo struct {
	sync.RWMutex
	metrics *MetricsStore
}

// MetricsStore are indexed per NodeName
type MetricsStore struct {
	Store map[key]nodeStore
}

type nodeStore struct {
   nodeMetrics NodeMetrics
   containersStore map[key]containerMetrics  // key is Container name in combination with Pod name and namespace
   podStore          map[key]podMetrics.           // key is Pod name, which is unique per namespace}

type NodeMetrics struct {
        metricX int
        metricY int
}

type containerMetrics struct {
        metricZ int
        metricW int
}

type podMetrics struct {
        metricP int
        metricT int
}

Assuming the comments on the containersStore and podStore were actually swapped. Should be correct in the above code.

Yeap, my bad for the swapped comments 🤦🏼‍♂️ .

Thanks for the suggestion, I have a couple of counter points:

  • In theory I like the hierarchical structure of metrics in (node_name, (nodeMetrics, containerMetrics, podMetrics)) so that if you delete a node you delete all the metrics associated. Unfortunately this is not fully hierarchical since if I delete a node, I need to delete the associated key for that node but also all the container metrics for that pod. I am wondering if we should have 3 levels of nesting node/pod/container

Hmm it depends on how you store the metrics. When you store the container level metrics do you know the NodeName? I guess yes. In that case can't you attach those container metrics under a container key which is under a node key?

For example:

{
  nodeA: {
     nodeMetrics: {...},
     containersStore: {
         containerA: {...},
         containerB: {...},
         ......
     },
     podStore: {
         podA: {...},
         podB: {...},
     },
  }, nodeB: {...}
}

Would sth like the above work?

Actually having a 3-level nesting could also work. It should be faster in case of Pod deletion where you would skip looping over the containers and you would directly delete the attached containers' metrics for the given Pod, right?

Copy link
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please as well update documentation:
https://www.elastic.co/guide/en/beats/metricbeat/current/exported-fields-kubernetes.html#_cpu_7
with info that node.pct might be missed in some cases

metricbeat/module/kubernetes/util/metrics_repo.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/container/data.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/pod/pod_test.go Outdated Show resolved Hide resolved
metricbeat/module/kubernetes/util/kubernetes.go Outdated Show resolved Hide resolved
@ChrsMark
Copy link
Member

ChrsMark commented Aug 4, 2022

Btw we should make sure that this feature supports cases like the following:

- module: kubernetes
  metricsets:
    - state_pod
  period: 10s
  hosts: ["0.0.0.0:8081"]
  add_metadata: false

- module: kubernetes
metricsets:
- state_container
period: 10s
hosts: ["0.0.0.0:8081"]
add_metadata: false

This is what Elastic Agent will produce.

See the description of #25640 for more details and #25640 (comment).

@mergify
Copy link
Contributor

mergify bot commented Aug 9, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/remove_k8s_cache upstream/feature/remove_k8s_cache
git merge upstream/main
git push upstream feature/remove_k8s_cache

@gsantoro gsantoro requested a review from ChrsMark August 10, 2022 16:46
@gsantoro gsantoro added backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.4.0 Automated backport with mergify labels Aug 11, 2022
@gsantoro
Copy link
Contributor Author

/package

@gsantoro gsantoro merged commit 5503761 into elastic:main Aug 12, 2022
@gsantoro gsantoro deleted the feature/remove_k8s_cache branch August 12, 2022 11:10
mergify bot pushed a commit that referenced this pull request Aug 12, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/container/container_test.go
#	metricbeat/module/kubernetes/container/data.go
#	metricbeat/module/kubernetes/pod/data.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/pod/pod_test.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_daemonset/state_daemonset.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
mergify bot pushed a commit that referenced this pull request Aug 12, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/container/container_test.go
#	metricbeat/module/kubernetes/container/data.go
#	metricbeat/module/kubernetes/pod/data.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/pod/pod_test.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
mergify bot pushed a commit that referenced this pull request Aug 12, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

# Conflicts:
#	metricbeat/module/kubernetes/container/data.go
#	metricbeat/module/kubernetes/pod/data.go
mergify bot pushed a commit that referenced this pull request Aug 12, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)
@gsantoro
Copy link
Contributor Author

@Mergifyio backport 8.1.0

@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2022

backport 8.1.0

❌ No backport have been created

  • Backport to branch 8.1.0 failed: Branch not found

@gsantoro
Copy link
Contributor Author

@Mergifyio backport 8.1

@gsantoro
Copy link
Contributor Author

@Mergifyio backport 8.0

@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2022

backport 8.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Aug 15, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/container/container_test.go
#	metricbeat/module/kubernetes/container/data.go
#	metricbeat/module/kubernetes/pod/data.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/pod/pod_test.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
mergify bot pushed a commit that referenced this pull request Aug 15, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

# Conflicts:
#	metricbeat/module/kubernetes/container/container.go
#	metricbeat/module/kubernetes/container/container_test.go
#	metricbeat/module/kubernetes/container/data.go
#	metricbeat/module/kubernetes/pod/data.go
#	metricbeat/module/kubernetes/pod/pod.go
#	metricbeat/module/kubernetes/pod/pod_test.go
#	metricbeat/module/kubernetes/state_cronjob/state_cronjob.go
#	metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go
#	metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go
#	metricbeat/module/kubernetes/util/kubernetes.go
@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2022

backport 8.0

✅ Backports have been created

@gsantoro
Copy link
Contributor Author

@Mergifyio backport 8.1

@mergify
Copy link
Contributor

mergify bot commented Aug 15, 2022

backport 8.1

✅ Backports have been created

v1v pushed a commit to v1v/beats that referenced this pull request Aug 22, 2022
* Feature/remove k8s cache (elastic#32539)

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
cmacknz pushed a commit that referenced this pull request Aug 24, 2022
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled

(cherry picked from commit 5503761)

Co-authored-by: Giuseppe Santoro <giuseppe.santoro@elastic.co>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* replaced internal expiring cache with non expiring dictionary in memory
* fixed a bug that prevented to export pod/container metrics when node/state_node metricsets were disabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.0.0 Automated backport with mergify backport-v8.1.0 Automated backport with mergify backport-v8.2.0 Automated backport with mergify backport-v8.3.0 Automated backport with mergify backport-v8.4.0 Automated backport with mergify bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metricbeat: Missing kubernetes.pod cpu and memory usage percentage on non-leader nodes
4 participants