Set soft memory limits to 80% of hard limits #4284

OhmSpectator · 2024-09-24T09:13:30Z

By default, we now set the soft memory limits to 80% of the hard memory limits for EVE cgroups. This adjustment allows the kernel to start reclaiming memory earlier, giving processes a chance to free up memory before reaching the hard limit. Updated the default values for dom0_mem, eve_mem, and ctrd_mem in the documentation and configuration files to reflect this change.

This change is inspired by PR #4273 by @rouming.

To be merged after #4300

pkg/grub/rootfs.cfg

europaul

Trigger Eden tests.

rouming · 2024-09-24T11:59:05Z

I'm afraid soft limit in kernel is not what we can expect from similar "soft limit" GC Golang runtime. Memory reclaim always happens prior the drastic measures taken (OOM killer) when hard limit is hit, and if memory was reclaimed successfully, then OOM killer is not invoked (because we still stay below hard limit, we just reused page taken from another place). Soft limit is more about efficient memory utilization, for example soft limit is reached and that means that this cgroup will be chosen for memory reclamation in case of global OOM under memory pressure (and not hard limit hit). Or there are several cgroups competing for memory and one has lower soft limit, which again makes it a target for memory reclamation for another cgroup, which is under memory pressure.

But I'm afraid we can't expect that when soft limit is reached then kernel does some magic and suddenly more free pages appear (which is the case for Golang garbage collector). Yes, we can expect memory to be reclaimed from file-backed pages, but that should happen anyway when hard limit is hit.

So if your PR aims to help kernel to decide what cgroup yields memory first in case of OOM (no more memory on host), then perfect. If you PR aims to fix OOM triggered by the hard limit - then it won't help unfortunately.

OhmSpectator · 2024-09-24T12:36:44Z

I'm afraid soft limit in kernel is not what we can expect from similar "soft limit" GC Golang runtime.

I understand it. The idea of the PR was not to replace the Goland GC direct setting, like your PR does, but rather to help the kernel trigger similar functionality in other processes.

Memory reclaim always happens prior the drastic measures taken (OOM killer) when hard limit is hit, and if memory was reclaimed successfully, then OOM killer is not invoked

Yeah, I got it. But I guess the problem is that Kernel starts to reclaim the memory too late, for example, in case a lot of new allocations start to happen when we are already close to the limit. Reclaiming memory at the moment close to the hard limit is fine when it's not a repetitive allocation of a lot of chunks. Otherwise it can be tool late.

From the documentation:

When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.

Please note that soft limits is a best-effort feature; it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd).

So, soft limit does trigger memory reclamation. And I think it would be helpful to do it a little bit in advance before we are close to the hard limit.

OhmSpectator · 2024-09-24T12:52:11Z

We can test the approach to understand what exactly it will mean for us.

OhmSpectator · 2024-09-24T13:04:43Z

Now I'm confused...
Does the soft limit not trigger immediate memory reclamation within the cgroup when it's reached?
And even if it's the case, would it help to set the value to which reclaims the memory when reclaim finally happens? Reclaiming several Kb is not the same as reclaiming 20%...

rouming · 2024-09-24T13:25:20Z

So, soft limit does trigger memory reclamation.

Only if there is memory contention, i.e. no memory on the host (global OOM) and multiple cgroups are competing for memory (this is what doc says). If one cgroup bloats and reaches soft limit - nothing happens until the moment there is system-wide OOM. if no system-wide OOM, then hard limit is reached, and then reclaim attempt happens anyway.

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

OhmSpectator · 2024-09-24T13:39:15Z

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

Are you 100% sure it happens only when global OOM is coming? While testing the memory monitor, I saw many memory pressure events generated by Kernel even when the system was far from OOM. I had to adapt my threshold settings accordingly. These events are generated even when a regular reclaim of caches happens. If memory balancing according to soft limits can be triggered by these events (what I could expect), it can still be helpful.

And yeah, it is also helpful to set an expectation for how much to reclaim.

In any case, I want the soft limits decreased. It will not hurt and may help. But I want to understand what exactly it means to reflect this in the document properly.

rouming · 2024-09-24T14:22:27Z

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

Are you 100% sure it happens only when global OOM is coming?

This is what doc says (you've posted) and this what I see in the sources: mm/vmscan.c, invocation of the mem_cgroup_soft_limit_reclaim:

balance_pgdat - called from kswapd(), when balancing happens
shrink_zones - called from do_try_free_pages() , contains the following comment:

			/*
			 * This steals pages from memory cgroups over softlimit
			 * and returns the number of reclaimed pages and
			 * scanned pages. This works for global memory pressure
			 * and balancing, not for a memcg's limit.
			 */
			nr_soft_scanned = 0;
			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
						sc->order, sc->gfp_mask,
						&nr_soft_scanned);

kind of explicit and aligns to what doc says.

The do_try_free_pages() contains this comment:

/*
 * This is the main entry point to direct page reclaim.
 *
 * If a full scan of the inactive list fails to free enough memory then we
 * are "out of memory" and something needs to be killed.
 *
...

While testing the memory monitor, I saw many memory pressure events generated by Kernel even when the system was far from OOM. I had to adapt my threshold settings accordingly. These events are generated even when a regular reclaim of caches happens.

I assume you can setup what actually triggers those. Cache reclaim can happen by the timer, depends what you mean by cache, but slab.c for example calls cache_reap() from timer.

What you are saying is all valid, but not for the hard limit case. That's my point.

rouming · 2024-09-24T14:58:33Z

Also this callstack is possible:

__alloc_page()
   get_page_from_freelist()
      node_reclaim()
      	/*
	 * Node reclaim reclaims unmapped file backed pages and
	 * slab pages if we are over the defined limits.
	 *
	 * A small portion of unmapped file backed pages is needed for
	 * file I/O otherwise pages read by file I/O will be immediately
	 * thrown out if the node is overallocated. So we do not reclaim
	 * if less than a specified percentage of the node is used by
	 * unmapped file backed pages.
	 */
         ...     
         __node_reclaim()
             shrink_node()
                  vmpressure()  <<<< generates vmpressure event

Which means you get vmpressure events on regular allocation path when fast allocation path failed (zone_watermark_fast) and reclaim was called, but that does not specifically mean you experience global OOM.

OhmSpectator · 2024-09-24T15:11:21Z

Offtopic.
This is just an interesting finding. The do_try_free_pages() is called not only when the system is low on memory but also when the system prepares to create a snapshot: as it helps to free mem before a mem-consuming operation:
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/kernel/power/hibernate.c#L390
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/kernel/power/snapshot.c#L1923

Or another option. It's called when someone writes to the "memory.force_empty" file of the cgroup:
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/mm/memcontrol-v1.c#L2912
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/mm/memcontrol-v1.c#L2380

OhmSpectator · 2024-09-24T15:15:31Z

There is also a bunch of try_to_free_mem_cgroup_pages() calls in the https://github.com/torvalds/linux/blob/master/mm/memcontrol.c file... They also lead to a call to do_try_free_pages()

Ah, it's cgroup v2. Not our case.

OhmSpectator · 2024-09-24T15:49:32Z

balance_pgdat - called from kswapd(), when balancing happens

Also interesting when it happens. As far as I remember, the daemon runs when the "high" watermark is reached. But it would be interesting to understand the details.

OhmSpectator · 2024-09-26T15:10:25Z

I just found that the soft limit is the one that Pillar uses to count its memory requirement for EVE.

eve/pkg/pillar/types/locationconsts.go

Line 104 in 2aa31d2

    
           EveMemoryLimitFile = "/hostfs/sys/fs/cgroup/memory/eve/memory.soft_limit_in_bytes"

So, before merging the PR, it's better to change the logic to hard limit

OhmSpectator · 2024-09-27T20:16:47Z

Updated soft limit for the kubevirt flavor.
Updated the doc and commit message to reflect the real impact of soft limit.

By default, we now set the soft memory limits to 80% of the hard memory limits for EVE cgroups. This adjustment sets the target values for memory reclamation when it's triggered by the Kernel. Updated the default values for dom0_mem, eve_mem, and ctrd_mem in the documentation and configuration files to reflect this change. Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>

eriknordmark

@rouming are you OK with this? I didn't understand all the details in your discussion.

Run the tests

OhmSpectator requested a review from rouming September 24, 2024 09:13

OhmSpectator requested review from rene and eriknordmark as code owners September 24, 2024 09:13

github-actions bot requested a review from rucoder September 24, 2024 09:14

eriknordmark reviewed Sep 24, 2024

View reviewed changes

pkg/grub/rootfs.cfg Outdated Show resolved Hide resolved

europaul approved these changes Sep 24, 2024

View reviewed changes

rene approved these changes Sep 24, 2024

View reviewed changes

OhmSpectator force-pushed the feature/use-soft-limit branch from cf6b569 to 1f33832 Compare September 24, 2024 12:54

github-actions bot requested review from eriknordmark and rene September 24, 2024 12:55

OhmSpectator force-pushed the feature/use-soft-limit branch from 1f33832 to fd7927f Compare September 24, 2024 12:57

OhmSpectator marked this pull request as draft September 26, 2024 15:04

OhmSpectator force-pushed the feature/use-soft-limit branch from fd7927f to 911b9bc Compare September 27, 2024 20:15

OhmSpectator force-pushed the feature/use-soft-limit branch from 911b9bc to e9dec89 Compare September 27, 2024 20:17

OhmSpectator marked this pull request as ready for review September 27, 2024 20:17

eriknordmark approved these changes Sep 30, 2024

View reviewed changes

OhmSpectator self-assigned this Sep 30, 2024

eriknordmark merged commit da2c37b into lf-edge:master Sep 30, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set soft memory limits to 80% of hard limits #4284

Set soft memory limits to 80% of hard limits #4284

OhmSpectator commented Sep 24, 2024 •

edited

Loading

europaul left a comment

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024 •

edited

Loading

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

rouming commented Sep 24, 2024

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024 •

edited

Loading

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 26, 2024

OhmSpectator commented Sep 27, 2024

eriknordmark left a comment

Set soft memory limits to 80% of hard limits #4284

Set soft memory limits to 80% of hard limits #4284

Conversation

OhmSpectator commented Sep 24, 2024 • edited Loading

europaul left a comment

Choose a reason for hiding this comment

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024 • edited Loading

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

rouming commented Sep 24, 2024

rouming commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 24, 2024 • edited Loading

OhmSpectator commented Sep 24, 2024

OhmSpectator commented Sep 26, 2024

OhmSpectator commented Sep 27, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

OhmSpectator commented Sep 24, 2024 •

edited

Loading

OhmSpectator commented Sep 24, 2024 •

edited

Loading

OhmSpectator commented Sep 24, 2024 •

edited

Loading