Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric container_memory_working_set_bytes includes slab reclaimable memory #3081

Open
cyrus-mc opened this issue Mar 17, 2022 · 7 comments
Open

Comments

@cyrus-mc
Copy link

I ran into a somewhat unique situation in which a POD had very high slab memory - high as in 1.1GB worth. In terms of anonymous and active file the memory usage was only around 25MB. Working set calculate shows 1.1GB due to the fact that slab reclaimable memory isn't subtracted from workingSet calculation.

Working set calculation

  ret.Memory.Usage = s.MemoryStats.Usage.Usage
  workingSet := ret.Memory.Usage
  if v, ok := s.MemoryStats.Stats[inactiveFileKeyName]; ok {
    if workingSet < v {
      workingSet = 0
    } else {
      workingSet -= v
    }
  }
  ret.Memory.WorkingSet = workingSet

Where MemoryStats.Usage.Usage is the value from memory.current (cgroup v2) or memory.usage_in_bytes (cgroup v1). Memory statistics (memory.stat) contains the following fields:

anon 663552
file 10313728
kernel_stack 49152
...
inactive_anon 573440
active_anon 32768
inactive_file 5066752
active_file 5246976
unevictable 0
slab_reclaimable 1232589368
slab_unreclaimable 128408
slab 1232717776
...

Of which slab_reclaimable is memory that can be reclaimed by the OS when needed. Should we be subtracting this value when calculating workingSet?

@bwplotka
Copy link

Good question! I don't want to overcrowd this issue, but why we don't subtract inactive_anon as well?

Rationale: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt#:~:text=inactive_anon%09%2D%20%23%20of%20bytes%20of%20anonymous%20and%20swap%20cache%20memory%20on%20inactive%0A%09%09LRU%20list.

@cyrus-mc
Copy link
Author

cyrus-mc commented Apr 1, 2022

@bwplotka I can understand why inactive_anon isn't subtracted since most container clusters don't run with swap so that memory can't be swapped out and thus is part of the working set.

Is slab_reclaimable the same? Since it is more of a cache thing it can be reclaimed by the OS when it needs memory.

@bwplotka
Copy link

bwplotka commented Aug 8, 2022

Yea, agree, something is off, but for me, it's not really slab. I can reproduce this problem with a large number of open file descriptors. The WSS shows quite large memory usage:

WSS:
image

(file_mapped = 0)

RSS:
image

Stat file:

sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20172800
file 3391561728
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30097408
active_anon 4096
inactive_file 3391561728
active_file 0
unevictable 0
slab_reclaimable 102486032
slab_unreclaimable 501088
slab 102987120
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 177528
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44

Now, what's interesting dropping all cache pages on the host machine using sudo sysctl -w vm.drop_caches=1 cleans WSS to almost RSS🙃 Which kind of tell us it's reclaimable, no?

image

Simply dropping cache from WSS is no-go as cache is extremely large, yet kind of affecting the WSS:

image

Stats after:

sudo cat /sys/fs/cgroup/system.slice/docker-40dc294092fde3c01f9c715c20a224aa34ff13e1efdb99526f93ec70c25533c7.scope/memory.stat
anon 20291584
file 0
kernel_stack 311296
pagetables 282624
percpu 504
sock 4096
vmalloc 8192
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 30216192
active_anon 4096
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 1661888
slab_unreclaimable 496376
slab 2158264
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgfault 211983
pgmajfault 0
pgrefill 0
pgscan 0
pgsteal 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 58
thp_collapse_alloc 44

@saswatac
Copy link

Why is not the Active(file) memory subtracted as well? I observe on my containers that the Active(file) memory is high even after a memory-heavy job ends, and the container_memory_working_set_bytes remains at a high value.

From what I understand, even the Active(file) memory is reclaimable, although it has a lesser priority than InActive(file).

@cyrus-mc
Copy link
Author

@saswatac you don't want to subtract active file because the working set is meant to give you a metric for the active memory your container needs. Every app needs some file cache. If you just ignore that when assigning memory requests for your app it will degrade the performance.

@astronaut0131
Copy link

astronaut0131 commented Aug 28, 2024

image
After the pod accepted a large number of socket connections, the container_memory_working_set_bytes remained around 2GB and never dropped, even when there were no new requests. Meanwhile, container_memory_rss was only around 200+MB.

Before dropping the cache, I examined the memory stats:

# cat /sys/fs/cgroup/memory/memory.stat 
cache 0
rss 218734592
rss_huge 109051904
shmem 0
mapped_file 0
dirty 0
writeback 0
swap 0
pgpgin 14037837
pgpgout 14021095
pgfault 14053809
pgmajfault 0
inactive_anon 2554662912
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 0
total_rss 218734592
total_rss_huge 109051904
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14037837
total_pgpgout 14021095
total_pgfault 14053809
total_pgmajfault 0
total_inactive_anon 2554662912
total_active_anon 0
total_inactive_file 0
total_active_file 0
total_unevictable 0
I also checked the current memory usage:
# cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
2556129280

To free reclaimable slab objects (which include dentries and inodes):

echo 2 > /proc/sys/vm/drop_caches

After executing the drop_caches command:

# cat /sys/fs/cgroup/memory/memory.stat 
cache 1081344
rss 219250688
rss_huge 109051904
shmem 0
mapped_file 540672
dirty 0
writeback 0
swap 0
pgpgin 14038926
pgpgout 14592264
pgfault 14055426
pgmajfault 0
inactive_anon 219336704
active_anon 0
inactive_file 540672
active_file 675840
unevictable 0
hierarchical_memory_limit 17179869184
hierarchical_memsw_limit 17179869184
total_cache 1081344
total_rss 219250688
total_rss_huge 109051904
total_shmem 0
total_mapped_file 540672
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 14038926
total_pgpgout 14592264
total_pgfault 14055426
total_pgmajfault 0
total_inactive_anon 219336704
total_active_anon 0
total_inactive_file 540672
total_active_file 675840
total_unevictable 0

I checked the memory usage again:

# cat /sys/fs/cgroup/memory/memory.usage_in_bytes 
221339648

This issue is affecting the decisions made by the Horizontal Pod Autoscaler (HPA) regarding memory.

@CharlieR-o-o-t
Copy link

CharlieR-o-o-t commented Nov 21, 2024

I think that "container_memory_working_set_bytes" should contain only unreclaimable mem, only in this way it'll be possible to effectively use this metric for scaling and as alert before OOM.

In my example, I have 1 GB of mem in slab_reclaimable, application itself consume only 100MB.

Is any update on this? Can I make PR to fix that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants