Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set soft memory limits to 80% of hard limits #4284

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

OhmSpectator
Copy link
Member

@OhmSpectator OhmSpectator commented Sep 24, 2024

By default, we now set the soft memory limits to 80% of the hard memory limits for EVE cgroups. This adjustment allows the kernel to start reclaiming memory earlier, giving processes a chance to free up memory before reaching the hard limit. Updated the default values for dom0_mem, eve_mem, and ctrd_mem in the documentation and configuration files to reflect this change.

This change is inspired by PR #4273 by @rouming.

To be merged after #4300

pkg/grub/rootfs.cfg Outdated Show resolved Hide resolved
Copy link
Contributor

@europaul europaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trigger Eden tests.

@rouming
Copy link
Contributor

rouming commented Sep 24, 2024

I'm afraid soft limit in kernel is not what we can expect from similar "soft limit" GC Golang runtime. Memory reclaim always happens prior the drastic measures taken (OOM killer) when hard limit is hit, and if memory was reclaimed successfully, then OOM killer is not invoked (because we still stay below hard limit, we just reused page taken from another place). Soft limit is more about efficient memory utilization, for example soft limit is reached and that means that this cgroup will be chosen for memory reclamation in case of global OOM under memory pressure (and not hard limit hit). Or there are several cgroups competing for memory and one has lower soft limit, which again makes it a target for memory reclamation for another cgroup, which is under memory pressure.

But I'm afraid we can't expect that when soft limit is reached then kernel does some magic and suddenly more free pages appear (which is the case for Golang garbage collector). Yes, we can expect memory to be reclaimed from file-backed pages, but that should happen anyway when hard limit is hit.

So if your PR aims to help kernel to decide what cgroup yields memory first in case of OOM (no more memory on host), then perfect. If you PR aims to fix OOM triggered by the hard limit - then it won't help unfortunately.

@OhmSpectator
Copy link
Member Author

I'm afraid soft limit in kernel is not what we can expect from similar "soft limit" GC Golang runtime.

I understand it. The idea of the PR was not to replace the Goland GC direct setting, like your PR does, but rather to help the kernel trigger similar functionality in other processes.

Memory reclaim always happens prior the drastic measures taken (OOM killer) when hard limit is hit, and if memory was reclaimed successfully, then OOM killer is not invoked

Yeah, I got it. But I guess the problem is that Kernel starts to reclaim the memory too late, for example, in case a lot of new allocations start to happen when we are already close to the limit. Reclaiming memory at the moment close to the hard limit is fine when it's not a repetitive allocation of a lot of chunks. Otherwise it can be tool late.

From the documentation:

When the system detects memory contention or low memory, control groups
are pushed back to their soft limits. If the soft limit of each control
group is very high, they are pushed back as much as possible to make
sure that one control group does not starve the others of memory.

Please note that soft limits is a best-effort feature; it comes with
no guarantees, but it does its best to make sure that when memory is
heavily contended for, memory is allocated based on the soft limit
hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd).

So, soft limit does trigger memory reclamation. And I think it would be helpful to do it a little bit in advance before we are close to the hard limit.

@OhmSpectator
Copy link
Member Author

We can test the approach to understand what exactly it will mean for us.

@OhmSpectator
Copy link
Member Author

OhmSpectator commented Sep 24, 2024

Now I'm confused...
Does the soft limit not trigger immediate memory reclamation within the cgroup when it's reached?
And even if it's the case, would it help to set the value to which reclaims the memory when reclaim finally happens? Reclaiming several Kb is not the same as reclaiming 20%...

@rouming
Copy link
Contributor

rouming commented Sep 24, 2024

So, soft limit does trigger memory reclamation.

Only if there is memory contention, i.e. no memory on the host (global OOM) and multiple cgroups are competing for memory (this is what doc says). If one cgroup bloats and reaches soft limit - nothing happens until the moment there is system-wide OOM. if no system-wide OOM, then hard limit is reached, and then reclaim attempt happens anyway.

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

@OhmSpectator
Copy link
Member Author

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

Are you 100% sure it happens only when global OOM is coming? While testing the memory monitor, I saw many memory pressure events generated by Kernel even when the system was far from OOM. I had to adapt my threshold settings accordingly. These events are generated even when a regular reclaim of caches happens. If memory balancing according to soft limits can be triggered by these events (what I could expect), it can still be helpful.

And yeah, it is also helpful to set an expectation for how much to reclaim.

In any case, I want the soft limits decreased. It will not hurt and may help. But I want to understand what exactly it means to reflect this in the document properly.

@rouming
Copy link
Contributor

rouming commented Sep 24, 2024

Soft limit is all about to say "I promise not allocate above this value, if I lie, please reclaim memory from me in case of global OOM" which is not the hard limit case.

Are you 100% sure it happens only when global OOM is coming?

This is what doc says (you've posted) and this what I see in the sources: mm/vmscan.c, invocation of the mem_cgroup_soft_limit_reclaim:

  • balance_pgdat - called from kswapd(), when balancing happens
  • shrink_zones - called from do_try_free_pages() , contains the following comment:
			/*
			 * This steals pages from memory cgroups over softlimit
			 * and returns the number of reclaimed pages and
			 * scanned pages. This works for global memory pressure
			 * and balancing, not for a memcg's limit.
			 */
			nr_soft_scanned = 0;
			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
						sc->order, sc->gfp_mask,
						&nr_soft_scanned);

kind of explicit and aligns to what doc says.

The do_try_free_pages() contains this comment:

/*
 * This is the main entry point to direct page reclaim.
 *
 * If a full scan of the inactive list fails to free enough memory then we
 * are "out of memory" and something needs to be killed.
 *
...

While testing the memory monitor, I saw many memory pressure events generated by Kernel even when the system was far from OOM. I had to adapt my threshold settings accordingly. These events are generated even when a regular reclaim of caches happens.

I assume you can setup what actually triggers those. Cache reclaim can happen by the timer, depends what you mean by cache, but slab.c for example calls cache_reap() from timer.

What you are saying is all valid, but not for the hard limit case. That's my point.

@rouming
Copy link
Contributor

rouming commented Sep 24, 2024

Also this callstack is possible:

__alloc_page()
   get_page_from_freelist()
      node_reclaim()
      	/*
	 * Node reclaim reclaims unmapped file backed pages and
	 * slab pages if we are over the defined limits.
	 *
	 * A small portion of unmapped file backed pages is needed for
	 * file I/O otherwise pages read by file I/O will be immediately
	 * thrown out if the node is overallocated. So we do not reclaim
	 * if less than a specified percentage of the node is used by
	 * unmapped file backed pages.
	 */
         ...     
         __node_reclaim()
             shrink_node()
                  vmpressure()  <<<< generates vmpressure event

Which means you get vmpressure events on regular allocation path when fast allocation path failed (zone_watermark_fast) and reclaim was called, but that does not specifically mean you experience global OOM.

@OhmSpectator
Copy link
Member Author

Offtopic.
This is just an interesting finding. The do_try_free_pages() is called not only when the system is low on memory but also when the system prepares to create a snapshot: as it helps to free mem before a mem-consuming operation:
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/kernel/power/hibernate.c#L390
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/kernel/power/snapshot.c#L1923

Or another option. It's called when someone writes to the "memory.force_empty" file of the cgroup:
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/mm/memcontrol-v1.c#L2912
https://github.com/torvalds/linux/blob/abf2050f51fdca0fd146388f83cddd95a57a008d/mm/memcontrol-v1.c#L2380

@OhmSpectator
Copy link
Member Author

OhmSpectator commented Sep 24, 2024

There is also a bunch of try_to_free_mem_cgroup_pages() calls in the https://github.com/torvalds/linux/blob/master/mm/memcontrol.c file... They also lead to a call to do_try_free_pages()

Ah, it's cgroup v2. Not our case.

@OhmSpectator
Copy link
Member Author

  • balance_pgdat - called from kswapd(), when balancing happens

Also interesting when it happens. As far as I remember, the daemon runs when the "high" watermark is reached. But it would be interesting to understand the details.

@OhmSpectator OhmSpectator marked this pull request as draft September 26, 2024 15:04
@OhmSpectator
Copy link
Member Author

I just found that the soft limit is the one that Pillar uses to count its memory requirement for EVE.

EveMemoryLimitFile = "/hostfs/sys/fs/cgroup/memory/eve/memory.soft_limit_in_bytes"

So, before merging the PR, it's better to change the logic to hard limit

@OhmSpectator
Copy link
Member Author

Updated soft limit for the kubevirt flavor.
Updated the doc and commit message to reflect the real impact of soft limit.

By default, we now set the soft memory limits to 80% of the hard memory limits
for EVE cgroups. This adjustment sets the target values for memory reclamation
when it's triggered by the Kernel.
Updated the default values for dom0_mem, eve_mem, and ctrd_mem in the
documentation and configuration files to reflect this change.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator OhmSpectator marked this pull request as ready for review September 27, 2024 20:17
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rouming are you OK with this? I didn't understand all the details in your discussion.

Run the tests

@OhmSpectator OhmSpectator self-assigned this Sep 30, 2024
@eriknordmark eriknordmark merged commit da2c37b into lf-edge:master Sep 30, 2024
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants