-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set initial arc_c to arc_c_min instead of arc_c_max. #10437
Conversation
For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory preassure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamantion of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and pagedaemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc.
Codecov Report
@@ Coverage Diff @@
## master #10437 +/- ##
=======================================
Coverage 79.43% 79.44%
=======================================
Files 391 391
Lines 123866 123867 +1
=======================================
+ Hits 98397 98406 +9
+ Misses 25469 25461 -8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me.
I can't say whether it's correct or not but I've been running with this for a couple of days with no appreciable ill effects... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes absolutely no sense to me. How is setting anything at boot going to change the system's behavior down the road?
The system starts with arc_c as arc_c_max for a reason - to cache all it possibly reads during the boot, so it doesn't have to read it twice. If it started with arc_c_min, it would potentially have to re-read a lot of stuff during the boot, which would only slow it down.
I don't see a problem with having arc_c set to arc_c_max at boot. What occupies your RAM from start during the boot? I always have tons of RAM free, it'd love for it to be put in good use, as it is being done now.
The problem you're trying to solve is in the reclaim - but it doesn't have anything in common with the system's boot.
Reclaim kicks in way later - in our systems, we see reclaim after at least 3-4 hours after boot (~100ish 4G-RAM sized containers on a 256G host, with arc_c set to 128G at boot). Hey, we're not buying tons of RAM just to have it sitting around unused.
I just don't get the reasoning, how is this in any way beneficial. For a server storage solution like ZFS? How?
Btw, over time, ARC would still grow to the arc_c_max you're complaining about (and btw2, it is really, really, never full right from the system's boot). I must have missed something...
The default for arc_min is 1/4 of arc_max. I don't expect your boot process is going to consume near that much memory. And as you say, if there is no memory presure, the ARC will grow towards arc_max anyway, so there is no difference what the initial value is set to. This patch changes the default behaviour to 'let the ARC grow if there is no shortage', from 'We will shrink the ARC is there is a shortage'. As @amotin explained, ZFS is waiting for |
Oh yeah, I understand the first part, explained by the first sentence of that paragraph. But how is it in any way related to the second sentence? My point is, that the second sentence is actually the actual problem statement, but the change described by the first sentence (and this PR) doesn't really offer a solution for that. We should be going after the "ZFS is waiting for |
We have to wait for reclamation to complete before we'll get free memory and be able to estimate new arc_c value. It may actually happen that there will be no memory pressure after arc_kmem_reap_soon() completion, and ARC will be allowed to grow further. The delay() gives some time for caches to minimally recover to get a bit more steady state. But ARC should not grow while we are deciding all this, and we can't really skip it, or the kmem caches will press ARC to the minimum with time. |
I need to try this out on a live system, but after reading more through where |
This patch just makes first ARC warmup to work the same as the later ones. If there is something to address in steady-state operation, then it is a different problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think switching the initial value to arc_c_min
has the additional benefit that it makes this code a little easier to reason about. Rather than assuming all of the memory is available to us and then adapting down, the ARC can initially grow as long as there is memory available.
The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem clamation, so arc_kmem_reap_soon() can take more time
This may be slightly less of an issue on Linux because there is a mechanism to asynchronously reclaim from the kmem caches. The Linux spl layer registers a callback (spl_kmem_cache_generic_shrinker) with the kernel which is called when memory is starting to get low. It's behavior is identical to kmem_reap()
.
@snajpa any testing you can offer would definitely be welcome.
@behlendorf yeah don't wait for me or anything, I'll just silently deploy this to staging and we'll see :) I'll open up an issue/PR if there's anything I come up with along the way. Thanks and sorry for not taking the time to read through the codepath properly on the first go :) |
For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes #10437
For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#10437
For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#10437
For at least 15 years since OpenSolaris arc_c was set by default to arc_c_max, later decreased under memory pressure. I've noticed that if arc_c was set high enough to cause memory pressure as considered by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms) return. All that time ZFS can continue increasing its effective ARC size, causing more memory pressure, potentially up to the point when OS low memory handler activates and reduces arc_c, requesting fast reclamation of just allocated memory. The problem seems to be more serious on FreeBSD and I guess Linux, since neither of them implement/use asynchronous kmem reclamation, so arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not supporting multiple memory domains system with lots of RAM can get completely unresponsive for minutes due to heavy lock congestion between ARC reclamation and page daemon kmem reclamation threads. With this change to more conservative arc_c value ARC stops growing just it time and does not need later reclamation. Also while there, since now growing arc_c is a more often situation, use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt() to reduce lock congestion. It is also getting in sync with code in arc_get_data_impl(). Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#10437
For at least 15 years since OpenSolaris arc_c was set by default to
arc_c_max, later decreased under memory preassure. I've noticed that
if arc_c was set high enough to cause memory pressure as considered
by ZFS, setting of arc_no_grow to TRUE in arc_reap_cb_check() makes
no effect until both arc_kmem_reap_soon() and delay(reap_retry_ms)
return. All that time ZFS can continue increasing its effective ARC
size, causing more memory pressure, potentially up to the point when
OS low memory handler activates and reduces arc_c, requesting fast
reclamantion of just allocated memory.
The problem seems to be more serious on FreeBSD and I guess Linux,
since neither of them implement/use asynchronous kmem clamation, so
arc_kmem_reap_soon() can take more time. On older FreeBSD 11 not
supporting multiple memory domains system with lots of RAM can get
completely unresponsive for minutes due to heavy lock congestion
between ARC reclamation and pagedaemon kmem reclamation threads.
With this change to more conservative arc_c value ARC stops growing
just it time and does not need later reclamation.
Also while there, since now growing arc_c is a more often situation,
use aggsum_upper_bound() instead of aggsum_compare() in arc_adapt()
to reduce lock congestion. It is also getting in sync with code in
arc_get_data_impl().
Types of changes
Checklist:
Signed-off-by
.