Use working set bytes if usage bytes is zero #25428

brianharwell · 2021-04-29T13:17:43Z

What does this PR do?

This will use workingSetBytes for the memory usage if usageMem is zero

Why is it important?

We have Windows containers running in Kubernetes but the kubelet only reports the working set bytes for a pod and not the memory usage bytes. As a result the field kubernetes.pod.memory.usage.limit.pct is reported as 0 even though the pod has a memory limit.

This is important because without kubernetes.pod.memory.usage.limit.pct we cannot alert or monitor based on how close the pod's memory is compared to it's memory limit.

The memory usage bytes is reported for linux pods.

Here is a sample from the kubelet. Metricbeat uses this json to report pod metrics.

{
   "podRef":{
      "name":"metricbeat-node-kkl29",
      "namespace":"logging",
      "uid":"5be7cd99-712f-47f0-84e1-eba75f00f671"
   },
   "startTime":"2021-04-21T19:56:15Z",
   "containers":[
      {
         "name":"metricbeat",
         "startTime":"2021-04-21T19:56:17Z",
         "cpu":{
            "time":"2021-04-22T18:34:31Z",
            "usageNanoCores":21051518,
            "usageCoreNanoSeconds":367953125000
         },
         "memory":{
            "time":"2021-04-22T18:34:31Z",
            "workingSetBytes":196808704
         },
         "rootfs":{
            "time":"2021-04-22T18:34:31Z",
            "availableBytes":43572236288,
            "capacityBytes":106846744576,
            "usedBytes":0
         },
         "logs":{
            "time":"2021-04-22T18:34:32Z",
            "availableBytes":43572236288,
            "capacityBytes":106846744576,
            "usedBytes":0,
            "inodesUsed":0
         }
      }
   ],
   "cpu":{
      "time":"2021-04-21T19:56:17Z",
      "usageNanoCores":21051518,
      "usageCoreNanoSeconds":367953125000
   },
   "memory":{
      "time":"2021-04-22T18:34:31Z",
      "availableBytes":0,
      "usageBytes":0,
      "workingSetBytes":196808704,
      "rssBytes":0,
      "pageFaults":0,
      "majorPageFaults":0
   },
   "network":{
      "time":"2021-04-22T18:34:32Z",
      "name":"e94f1b6bc8728316148684734375542f05a93e9a5df43fa8392f08af9bf68e1b_cbr0",
      "rxBytes":302076010,
      "txBytes":70445444,
      "interfaces":[
         {
            "name":"e94f1b6bc8728316148684734375542f05a93e9a5df43fa8392f08af9bf68e1b_cbr0",
            "rxBytes":302076010,
            "txBytes":70445444
         }
      ]
   },
   "volume":[
      {
         "time":"2021-04-21T19:56:26Z",
         "availableBytes":43691851776,
         "capacityBytes":106846744576,
         "usedBytes":5229,
         "inodesFree":0,
         "inodes":0,
         "inodesUsed":0,
         "name":"config"
      },
      {
         "time":"2021-04-21T19:56:26Z",
         "availableBytes":43691851776,
         "capacityBytes":106846744576,
         "usedBytes":6050,
         "inodesFree":0,
         "inodes":0,
         "inodesUsed":0,
         "name":"metricbeat-token-dnf9s"
      }
   ],
   "ephemeral-storage":{
      "time":"2021-04-22T18:34:32Z",
      "availableBytes":43572236288,
      "capacityBytes":106846744576,
      "usedBytes":5229,
      "inodesUsed":0
   }
}

Checklist

I have zero experience with Go. I didn't want to ask someone else to make the change because this seemed rather simple.

[X ] My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Logic is correct

How to test this PR locally

Test using a Linux container and a Windows container. I can assist with both of these.

Related issues

Elastic Support Ticket #00711427

Use cases

N/A

Screenshots

Logs

elasticmachine · 2021-04-29T13:22:17Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: jsoriano commented: /test
Start Time: 2021-04-30T15:44:31.671+0000
Duration: 89 min 35 sec
Commit: f5bad30

Test stats 🧪

Test	Results
Failed	0
Passed	8200
Skipped	2350
Total	10550

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	8200
Skipped	2350
Total	10550

elasticmachine · 2021-04-29T14:01:07Z

Pinging @elastic/integrations (Team:Integrations)

jsoriano · 2021-04-29T14:01:26Z

/test

jsoriano · 2021-04-29T14:02:39Z

Hey @brianharwell, thanks for opening this PR! Could you please also add a changelog entry in CHANGELOG.next.asciidoc?

brianharwell · 2021-04-29T14:08:58Z

@jsoriano Done.

I can help test if you want.

jsoriano · 2021-04-29T16:15:32Z

@brianharwell I have been thing about this and I have conflicting thoughts about what to do with these fields. I don't think that we can consider working set to be the same as the usage memory. These are not reported as the same by docker or kubernetes and Metricbeat is not reporting it as the same in this case or the docker case. We had similar discussions when adding these fields to the docker module, in #12172.

But, if a pod is killed when its working set grows over its limit, I guess that we can use this value to calculate the percentage.

Do you know if that is the case? Is a pod killed if its working set grows over its limit?

jsoriano · 2021-04-29T16:16:20Z

CHANGELOG.asciidoc

@@ -46,6 +46,7 @@ https://github.com/elastic/beats/compare/v7.12.0...v7.12.1[View commits]
 *Metricbeat*

 - Ignore unsupported derive types for filesystem metricset. {issue}22501[22501] {pull}24502[24502]
+- Use working set bytes when memory usage is not reported. {pull}25428[25428]


This line should be added in CHANGELOG.next.asciidoc.

Whoops! Sorry!

brianharwell · 2021-04-29T19:42:55Z

@brianharwell I have been thing about this and I have conflicting thoughts about what to do with these fields. I don't think that we can consider working set to be the same as the usage memory. These are not reported as the same by docker or kubernetes and Metricbeat is not reporting it as the same in this case or the docker case. We had similar discussions when adding these fields to the docker module, in #12172.

I read through that discussion and I understand your point. This comment stuck out to me: #12172 (comment)

Kubernetes supports limits for Windows pods.

But, if a pod is killed when its working set grows over its limit, I guess that we can use this value to calculate the percentage.

Do you know if that is the case? Is a pod killed if its working set grows over its limit?

Yes, I have test cases that prove that Kubernetes will terminate and restart a pod when the pod goes over a memory threshold.

I created a console application that adds 1M guids to a list, waits a few seconds, and adds another 1M guids and continues until it consumes as much memory as possible.

It appears that the pod is terminated when the kubernetes.pod.memory.working_set.bytes gets to about 72% of the memory limit. I tried this with a 3,000Mi memory limit and a 6,000Mi memory limit.

Here are the console logs of my test app...

And here is the memory usage chart from Kibana...

And here is output from describing the pod...

Containers:
  memory:
    Container ID:  docker://79337d3f0ce1881213ee78aec5d6067f99bcc367356efc8a4df02637b26ba002
    Image:         brianharwell/memorystress:1809
    Image ID:      docker-pullable://brianharwell/memorystress@sha256:272efaef128ad3d3e1ee1ba18e7dff9f8e6958dc662a2453bbba7a2b25a570cb
    Port:          <none>
    Host Port:     <none>
    Command:
    State:          Running
      Started:      Thu, 29 Apr 2021 13:39:25 -0500
    Last State:     Terminated
      Exit Code:    -532462766
      Started:      Thu, 29 Apr 2021 13:37:10 -0500
      Finished:     Thu, 29 Apr 2021 13:39:17 -0500
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     500m
      memory:  3000Mi
    Requests:
      cpu:     500m
      memory:  3000Mi

Based on this, in Windows the working set bytes may not be synonymous with usage bytes. But the biggest issue, and the reason for this PR is I have no way to chart or alert on memory usage compared to the limit. So I am open to ideas.

jsoriano · 2021-04-30T08:55:57Z

Thanks for the investigation, then I think that it makes sense to calculate the percentages based on the working set.

I would propose to do the following:

Keep the OS-specific metrics as they are, as this is how they are reported by docker and kubernetes and they don't mean exactly the same.
But don't report the zero-valued ones, this way it'd be less confusing to see so many zeroed values. It'd be to conditionally set these values, something like this:

if usageMem > 0 {
  // This is a linux container.
  podEvent["memory"] = common.MapStr{
	"usage": common.MapStr{
		"bytes": usageMem,
	},
	"available": common.MapStr{
		"bytes": availMem,
	},
	"rss": common.MapStr{
		"bytes": rss,
	},
	"page_faults":       pageFaults,
	"major_page_faults": majorPageFaults,
  }
}
if workingSet > 0 {
  // This is a windows container.
  podEvent["memory"] = common.MapStr{
	"working_set": common.MapStr{
		"bytes": workingSet,
	},
  }
}

Calculate the percentages with the OS-specific values depending on the case. It'd be something like this:

		if usageMem > 0 {
			if nodeMem > 0 {
				podEvent.Put("memory.usage.node.pct", float64(usageMem)/nodeMem)
			}
			if memLimit > 0 {
				podEvent.Put("memory.usage.limit.pct", float64(usageMem)/memLimit)
			}
		}
		if workingSet > 0 {
			if nodeMem > 0 {
				podEvent.Put("memory.usage.node.pct", float64(workingSet)/nodeMem)
			}
			if memLimit > 0 {
				podEvent.Put("memory.usage.limit.pct", float64(workingSet)/memLimit)
			}
		}

brianharwell · 2021-04-30T12:23:56Z

Keep the OS-specific metrics as they are, as this is how they are reported by docker and kubernetes and they don't mean exactly the same.

I agree.

But don't report the zero-valued ones, this way it'd be less confusing to see so many zeroed values. It'd be to conditionally set these values, something like this:

I am not sure about this. Ideally we would be able to tell if this was a Linux pod or a Windows pod and then show fields based on that. That would make things clearer and less confusing. If (or when) a change is made on the Kubernetes side some of these values may start to appear. For example, usageMem and availMem may be reported but rss being a linux concept would not be reported so the value will show up as 0 but then workingSet would disappear. For me, I think a consistent document with notes in the documentation that some fields are not reported for Windows pods would be better.

Isn't this a breaking change? Would the removal of these fields potentially break usages in charts, queries, and watchers?

jsoriano · 2021-04-30T12:30:44Z

I am not sure about this. Ideally we would be able to tell if this was a Linux pod or a Windows pod and then show fields based on that. That would make things clearer and less confusing. If (or when) a change is made on the Kubernetes side some of these values may start to appear. For example, usageMem and availMem may be reported but rss being a linux concept would not be reported so the value will show up as 0 but then workingSet would disappear. For me, I think a consistent document with notes in the documentation that some fields are not reported for Windows pods would be better.

Isn't this a breaking change? Would the removal of these fields potentially break usages in charts, queries, and watchers?

Well, other modules report metrics only when they are available, to differentiate this from zero values. But you are right, this could be breaking for some cases, and in any case this change wouldn't be needed to solve the lack of percentages, so we can leave this by now.

brianharwell · 2021-04-30T12:35:55Z

I see you assigned it to yourself, does that mean you are going to make the required changes? Or do the changes I made cover this?

jsoriano · 2021-04-30T12:44:52Z

I see you assigned it to yourself, does that mean you are going to make the required changes? Or do the changes I made cover this?

No, sorry for the confusion, we sometimes do this to indicate the team that someone is helping and reviewing community contributions 🙂 I have assigned it to you too.

I can follow with this. The pending thing I see would be to don't override usageMem in Windows, and calculate the percentage based on the workingSet, as proposed in the third bullet point in #25428 (comment)

brianharwell · 2021-04-30T12:47:00Z

I can follow with this. The pending thing I see would be to don't override usageMem in Windows, and calculate the percentage based on the workingSet, as proposed in the third bullet point in #25428 (comment)

Yeah I like that better, I'm on it

brianharwell · 2021-04-30T12:53:00Z

I made a slight variation...

if workingSet > 0 && usageMem == 0 {
	if nodeMem > 0 {
		podEvent.Put("memory.usage.node.pct", float64(workingSet)/nodeMem)
	}
	if memLimit > 0 {
		podEvent.Put("memory.usage.limit.pct", float64(workingSet)/memLimit)
	}
}

I added && usageMem == 0 because I think usageMem is the more appropriate field and if it is reported I think it should take precedence. What do you think?

jsoriano

I added && usageMem == 0 because I think usageMem is the more appropriate field and if it is reported I think it should take precedence. What do you think?

Looks good to me.

If current code solves the problem you were having I think we can go on with it.

metricbeat/module/kubernetes/pod/data.go

mergify · 2021-04-30T15:13:52Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b windows-memory-usage-bytes-take-2 upstream/windows-memory-usage-bytes-take-2
git merge upstream/master
git push upstream windows-memory-usage-bytes-take-2

…s-memory-usage-bytes-take-2

jsoriano

Thanks!

jsoriano · 2021-04-30T15:44:12Z

/test

In Windows native containers usage memory is reported using the workingSet. Use this value to calculate memory usage percentage. (cherry picked from commit 381e062)

brianharwell · 2021-04-30T17:52:21Z

What version will this be in? 7.14?

jsoriano · 2021-04-30T17:56:51Z

What version will this be in? 7.14?

We are not accepting new features in 7.13. But I think this can be considered a bugfix, let me backport this to 7.13 too.

In Windows native containers usage memory is reported using the workingSet. Use this value to calculate memory usage percentage. (cherry picked from commit 381e062)

brianharwell · 2021-04-30T18:19:07Z

Sweet thanks! I'd love to implement as soon as I can. 😀

…#25470) In Windows native containers usage memory is reported using the workingSet. Use this value to calculate memory usage percentage. (cherry picked from commit 381e062) Co-authored-by: Brian Harwell <brianharwell@gmail.com> Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

…#25473) In Windows native containers usage memory is reported using the workingSet. Use this value to calculate memory usage percentage. (cherry picked from commit 381e062) Co-authored-by: Brian Harwell <brianharwell@gmail.com> Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

brianharwell · 2021-05-10T13:20:07Z

@jsoriano Any idea when the v7.13 release will be cut?

jsoriano · 2021-05-10T13:31:45Z

@jsoriano Any idea when the v7.13 release will be cut?

@brianharwell we don't have closed dates for releases. But as an estimation, 7.13.0 will likely happen in two or three weeks.

jsoriano · 2021-05-10T13:33:15Z

@brianharwell if you want to try with your own build, the 7.13 branch is already open.

jsoriano · 2021-05-11T11:28:05Z

Similar issue has been reported with container metrics: #25657

Use working set bytes if usage bytes is zero

0e8ce05

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 29, 2021

brianharwell mentioned this pull request Apr 29, 2021

Use WorkingSetBytes if memory usage is not reported #25407

Closed

6 tasks

jsoriano assigned jsoriano and unassigned jsoriano Apr 29, 2021

jsoriano added Team:Integrations Label for the Integrations team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 29, 2021

Update CHANGELOG.asciidoc

94dea08

jsoriano reviewed Apr 29, 2021

View reviewed changes

jsoriano self-assigned this Apr 30, 2021

Add comment to CHANGELOG.next.asciidoc

f54606c

jsoriano assigned brianharwell Apr 30, 2021

Moved comment to bug fixes

e713226

Calculate pod memory pct based on field existence

dc4b6ab

Updated description of change

8f65818

jsoriano reviewed Apr 30, 2021

View reviewed changes

metricbeat/module/kubernetes/pod/data.go Show resolved Hide resolved

Remove unneeded lines

5da426f

jsoriano reviewed Apr 30, 2021

View reviewed changes

metricbeat/module/kubernetes/pod/data.go Show resolved Hide resolved

brianharwell added 2 commits April 30, 2021 09:51

I removed too many lines

0112df2

More removals

cf07d3d

Merge branch 'master' of https://github.com/elastic/beats into window…

f5bad30

…s-memory-usage-bytes-take-2

jsoriano approved these changes Apr 30, 2021

View reviewed changes

jsoriano added the backport-v7.14.0 Automated backport with mergify label Apr 30, 2021

jsoriano merged commit 381e062 into elastic:master Apr 30, 2021

mergify bot mentioned this pull request Apr 30, 2021

[7.x](backport #25428) Use working set bytes if usage bytes is zero #25470

Merged

jsoriano added the backport-v7.13.0 Automated backport with mergify label Apr 30, 2021

mergify bot mentioned this pull request Apr 30, 2021

[7.13](backport #25428) Use working set bytes if usage bytes is zero #25473

Merged

jsoriano mentioned this pull request May 11, 2021

Metricbeat 7.12.1 kubernetes.container.memory.usage.limit.pct is calculated incorrectly #25657

Closed

jsoriano mentioned this pull request May 28, 2021

Common Fields for Container Inventory Schema #22179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use working set bytes if usage bytes is zero #25428

Use working set bytes if usage bytes is zero #25428

brianharwell commented Apr 29, 2021

elasticmachine commented Apr 29, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

elasticmachine commented Apr 29, 2021

jsoriano commented Apr 29, 2021

jsoriano commented Apr 29, 2021

brianharwell commented Apr 29, 2021

jsoriano commented Apr 29, 2021

jsoriano Apr 29, 2021

brianharwell Apr 29, 2021

brianharwell commented Apr 29, 2021 •

edited

Loading

jsoriano commented Apr 30, 2021 •

edited

Loading

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano left a comment

mergify bot commented Apr 30, 2021

jsoriano left a comment

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

brianharwell commented May 10, 2021

jsoriano commented May 10, 2021

jsoriano commented May 10, 2021

jsoriano commented May 11, 2021

Use working set bytes if usage bytes is zero #25428

Use working set bytes if usage bytes is zero #25428

Conversation

brianharwell commented Apr 29, 2021

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

elasticmachine commented Apr 29, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

elasticmachine commented Apr 29, 2021

jsoriano commented Apr 29, 2021

jsoriano commented Apr 29, 2021

brianharwell commented Apr 29, 2021

jsoriano commented Apr 29, 2021

jsoriano Apr 29, 2021

Choose a reason for hiding this comment

brianharwell Apr 29, 2021

Choose a reason for hiding this comment

brianharwell commented Apr 29, 2021 • edited Loading

jsoriano commented Apr 30, 2021 • edited Loading

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano left a comment

Choose a reason for hiding this comment

mergify bot commented Apr 30, 2021

jsoriano left a comment

Choose a reason for hiding this comment

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

jsoriano commented Apr 30, 2021

brianharwell commented Apr 30, 2021

brianharwell commented May 10, 2021

jsoriano commented May 10, 2021

jsoriano commented May 10, 2021

jsoriano commented May 11, 2021

elasticmachine commented Apr 29, 2021 •

edited by jenkins-beats-ci bot

Loading

brianharwell commented Apr 29, 2021 •

edited

Loading

jsoriano commented Apr 30, 2021 •

edited

Loading