Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set initial arc_c to arc_c_min instead of arc_c_max. #10437

Merged
merged 1 commit into from
Jun 17, 2020

Conversation

amotin
Copy link
Member

@amotin amotin commented Jun 10, 2020

For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory preassure. I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return. All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamantion of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem clamation, so
arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and pagedaemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion. It is also getting in sync with code in
arc_get_data_impl().

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory preassure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamantion of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and pagedaemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored-By:	iXsystems, Inc.
@codecov
Copy link

codecov bot commented Jun 11, 2020

Codecov Report

Merging #10437 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #10437   +/-   ##
=======================================
  Coverage   79.43%   79.44%           
=======================================
  Files         391      391           
  Lines      123866   123867    +1     
=======================================
+ Hits        98397    98406    +9     
+ Misses      25469    25461    -8     
Flag Coverage Δ
#kernel 80.00% <100.00%> (-0.04%) ⬇️
#user 65.69% <100.00%> (+0.22%) ⬆️
Impacted Files Coverage Δ
module/zfs/arc.c 89.40% <100.00%> (-0.19%) ⬇️
module/os/linux/spl/spl-kmem-cache.c 75.58% <0.00%> (-8.50%) ⬇️
cmd/ztest/ztest.c 74.69% <0.00%> (-2.47%) ⬇️
module/zfs/vdev_raidz_math.c 76.57% <0.00%> (-2.26%) ⬇️
lib/libzpool/util.c 78.12% <0.00%> (-1.05%) ⬇️
module/os/linux/zfs/zfs_file_os.c 84.15% <0.00%> (-1.00%) ⬇️
module/zfs/vdev_queue.c 94.62% <0.00%> (-0.90%) ⬇️
module/os/linux/zfs/vdev_disk.c 84.43% <0.00%> (-0.78%) ⬇️
module/os/linux/zfs/zfs_znode.c 84.93% <0.00%> (-0.57%) ⬇️
module/zfs/bpobj.c 90.61% <0.00%> (-0.54%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update feff3f6...e38c5ba. Read the comment docs.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

@adamdmoss
Copy link
Contributor

I can't say whether it's correct or not but I've been running with this for a couple of days with no appreciable ill effects...

Copy link
Contributor

@snajpa snajpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes absolutely no sense to me. How is setting anything at boot going to change the system's behavior down the road?

The system starts with arc_c as arc_c_max for a reason - to cache all it possibly reads during the boot, so it doesn't have to read it twice. If it started with arc_c_min, it would potentially have to re-read a lot of stuff during the boot, which would only slow it down.

I don't see a problem with having arc_c set to arc_c_max at boot. What occupies your RAM from start during the boot? I always have tons of RAM free, it'd love for it to be put in good use, as it is being done now.

The problem you're trying to solve is in the reclaim - but it doesn't have anything in common with the system's boot.

Reclaim kicks in way later - in our systems, we see reclaim after at least 3-4 hours after boot (~100ish 4G-RAM sized containers on a 256G host, with arc_c set to 128G at boot). Hey, we're not buying tons of RAM just to have it sitting around unused.

I just don't get the reasoning, how is this in any way beneficial. For a server storage solution like ZFS? How?

Btw, over time, ARC would still grow to the arc_c_max you're complaining about (and btw2, it is really, really, never full right from the system's boot). I must have missed something...

@allanjude
Copy link
Contributor

This makes absolutely no sense to me. How is setting anything at boot going to change the system's behavior down the road?

The system starts with arc_c as arc_c_max for a reason - to cache all it possibly reads during the boot, so it doesn't have to read it twice. If it started with arc_c_min, it would potentially have to re-read a lot of stuff during the boot, which would only slow it down.

I don't see a problem with having arc_c set to arc_c_max at boot. What occupies your RAM from start during the boot? I always have tons of RAM free, it'd love for it to be put in good use, as it is being done now.

The problem you're trying to solve is in the reclaim - but it doesn't have anything in common with the system's boot.

Reclaim kicks in way later - in our systems, we see reclaim after at least 3-4 hours after boot (~100ish 4G-RAM sized containers on a 256G host, with arc_c set to 128G at boot). Hey, we're not buying tons of RAM just to have it sitting around unused.

I just don't get the reasoning, how is this in any way beneficial. For a server storage solution like ZFS? How?

Btw, over time, ARC would still grow to the arc_c_max you're complaining about (and btw2, it is really, really, never full right from the system's boot). I must have missed something...

The default for arc_min is 1/4 of arc_max. I don't expect your boot process is going to consume near that much memory. And as you say, if there is no memory presure, the ARC will grow towards arc_max anyway, so there is no difference what the initial value is set to.

This patch changes the default behaviour to 'let the ARC grow if there is no shortage', from 'We will shrink the ARC is there is a shortage'. As @amotin explained, ZFS is waiting for arc_kmem_reap_soon() to return, and for delay(reap_retry_ms) to expire. This can causes the system to run critically short of memory during that time.

@snajpa
Copy link
Contributor

snajpa commented Jun 13, 2020

This patch changes the default behaviour to 'let the ARC grow if there is no shortage', from 'We will shrink the ARC is there is a shortage'. As @amotin explained, ZFS is waiting for arc_kmem_reap_soon() to return, and for delay(reap_retry_ms) to expire. This can causes the system to run critically short of memory during that time.

Oh yeah, I understand the first part, explained by the first sentence of that paragraph. But how is it in any way related to the second sentence? My point is, that the second sentence is actually the actual problem statement, but the change described by the first sentence (and this PR) doesn't really offer a solution for that.

We should be going after the "ZFS is waiting for arc_kmem_reap_soon() to return, and for delay(reap_retry_ms) to expire" bit :-)

@amotin
Copy link
Member Author

amotin commented Jun 14, 2020

We should be going after the "ZFS is waiting for arc_kmem_reap_soon() to return, and for delay(reap_retry_ms) to expire" bit :-)

We have to wait for reclamation to complete before we'll get free memory and be able to estimate new arc_c value. It may actually happen that there will be no memory pressure after arc_kmem_reap_soon() completion, and ARC will be allowed to grow further. The delay() gives some time for caches to minimally recover to get a bit more steady state. But ARC should not grow while we are deciding all this, and we can't really skip it, or the kmem caches will press ARC to the minimum with time.

@snajpa
Copy link
Contributor

snajpa commented Jun 15, 2020

I need to try this out on a live system, but after reading more through where arc_adapt() is actually called, it looks like my worries about re-reading things are absolutely unsubstantiated. Actually, this looks okay; but I'm still not sure it'll do much about the whacky behavior when out of memory and in arc_no_grow situation. Just an idea - couldn't we use atomic access to mostly read-only variable, so it can be modified lock-free, and immediately take effect on change?

@amotin
Copy link
Member Author

amotin commented Jun 15, 2020

This patch just makes first ARC warmup to work the same as the later ones. If there is something to address in steady-state operation, then it is a different problem.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think switching the initial value to arc_c_min has the additional benefit that it makes this code a little easier to reason about. Rather than assuming all of the memory is available to us and then adapting down, the ARC can initially grow as long as there is memory available.

The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem clamation, so arc_kmem_reap_soon() can take more time

This may be slightly less of an issue on Linux because there is a mechanism to asynchronously reclaim from the kmem caches. The Linux spl layer registers a callback (spl_kmem_cache_generic_shrinker) with the kernel which is called when memory is starting to get low. It's behavior is identical to kmem_reap().

@snajpa any testing you can offer would definitely be welcome.

@snajpa
Copy link
Contributor

snajpa commented Jun 15, 2020

@behlendorf yeah don't wait for me or anything, I'll just silently deploy this to staging and we'll see :)

I'll open up an issue/PR if there's anything I come up with along the way.

Thanks and sorry for not taking the time to read through the codepath properly on the first go :)

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 15, 2020
@behlendorf behlendorf merged commit 17ca301 into openzfs:master Jun 17, 2020
lundman referenced this pull request in openzfsonosx/openzfs Jun 19, 2020
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10437
DeHackEd pushed a commit to DeHackEd/zfs that referenced this pull request Aug 19, 2020
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#10437
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#10437
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory pressure.  I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return.  All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamation of just allocated memory.

The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem reclamation,
so arc_kmem_reap_soon() can take more time.  On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and page daemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.

Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion.  It is also getting in sync with code in
arc_get_data_impl().

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#10437
@amotin amotin deleted the arc_c branch August 24, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants