-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARC grows over zfs_arc_max (700M) and causes OOM panic #2840
Comments
@thegreatgazoo If you're using a kernel >= 3.12, you need the patch in #2837 (and, potentially openzfs/spl#403) for this work work properly. |
This sounds like a metadata workload. I've run into something similar a lot in my time. Rsync or other big file operations? I have a bug report open at #1932 You can try this patch to see if it improves things: DeHackEd/zfs@2827b75 It's one line and can be easily applied by hand for a manual recompile. It's not a solution, but may make the problem much more mild for reasonable workloads. |
@dweeezil It's CentOS 6.5 2.6.32-431.29.2.el6_lustre.g9835a2a.x86_64. Currently I can't find the source to verify but I'd tend to believe that there wasn't a backport of a 3.12 feature. |
@DeHackEd It's not meta-data heavy. It's some Lustre/osd-zfs tests, no ZPL involved just DMU. I didn't capture full arcstats but from arcstat.py outputs I had the meta-data reads was about 3% of total ARC reads. |
@thegreatgazoo Your're right, the 2.6 EL kernels definitely have the old shrinker interface. I suppose it might be interesting to sample at least the arc reclaim counters with |
@dweeezil Yes there's swap directly on /dev/vda2, no ZVOL or anything ZFS involved. But it's on a same disk as the partitions of the zpools. I'll capture arcstats once I have a chance to rerun those tests again. |
It would be interesting to see if #2826 helps. We've seen considerable contention on these locks in the ARC and I wonder if the contention might end up having some subtle side effects like overly aggressive reclaim. It could be good for performance too.
That's an interesting idea. There are certainly times when that may make some sense for example when
Actually Lustre does end up being pretty metadata heavy from ZFS perspective due to all the xattrs. They're large enough to force a spill block to be needed for every object which causes additional IO and metadata objects in the ARC. This is one of the reasons I'm so keen on the large dnode work. I expect it to help considerably. |
@behlendorf I'll try #2826 once I have a chance. I think prefetching seemed to be the trouble maker here. The # of prefetch reads was almost always more than 90% of total reads, as shown here (edited from arcstat.py outputs):
The prefetcher kept adding data to the ARC, despite that:
The OSD lu_cache_nr was at the default 10240, which might prevent ARC from shrinking quickly. But when ARC shrinking began at 22:11:25, the ARC size was already twice its limit due to prefetching. |
@thegreatgazoo Could you grab |
We were seeing something similar to this on our larger high IO load file servers. (36 Terabyte Raid10, dual 10G Intel Nics). In the end it was the crappy Adaptec kernel driver not supporting MSI-X that seems to have caused the kernel panics and runaway ARC size. Patching our kernels with a newer driver that properly suppored MSI-X stopped the OOM, 300% max arc growth, and CPU stalls. A single interrupt on a single CPU for your drive subsystem is BAD :) |
https://github.com/DeHackEd/zfs/blob/2827b75ae1fd5c0c1627dd46eac0b81992bf4afd/include/sys/zfs_vfsops.h to a higher value , going with experimental value of 2048
This reverts commit f3f5ece.
Closing. Since this issue was last updated multiple improvements have been made to the ARC to address this kind of issue. |
This is almost 100% reproducible with ZFS v0.6.3-1 on a test VM with 1.8G memory:
[root@eagle-44vm1 ~]# arcstat.py 2
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
......
22:10:48 1.0K 1.0K 98 32 66 992 100 0 0 700M 700M
22:10:50 682 672 98 21 66 651 100 0 0 700M 700M
22:10:52 715 704 98 22 66 682 100 0 0 700M 700M
22:10:54 801 789 98 25 66 764 100 0 0 700M 700M
22:10:56 638 632 99 32 85 600 100 13 71 722M 700M
22:10:58 495 495 100 30 100 465 100 15 100 728M 700M
22:11:00 396 396 100 24 100 372 100 12 100 774M 700M
22:11:02 579 579 100 35 100 544 100 17 100 858M 700M
22:11:05 278 278 100 16 100 262 100 8 100 888M 700M
22:11:07 331 331 100 21 100 310 100 11 100 929M 700M
22:11:09 296 296 99 17 97 279 100 8 100 974M 700M
22:11:11 330 330 100 20 100 310 100 10 100 1017M 700M
22:11:13 264 264 100 16 100 248 100 8 100 1.0G 700M
22:11:15 330 330 100 20 100 310 100 10 100 1.1G 700M
22:11:17 278 278 99 18 97 260 100 9 100 1.1G 700M
22:11:19 231 228 98 14 84 214 100 7 100 1.1G 700M
22:11:21 279 274 98 18 78 256 100 10 100 1.2G 700M
22:11:23 278 276 99 15 88 261 100 7 100 1.2G 700M
22:11:25 827 827 100 52 100 775 100 26 100 1.4G 633M
22:11:27 314 314 99 40 98 274 100 29 100 1.4G 208M
22:11:29 625 621 99 54 92 566 100 37 100 1.5G 4.0M
At 22:10:54 the ARC size began to go out of control and grew over the 700M limit, and eventually caused an OOM panic.
At 22:11:23, the system seemed to begin trying to shrink the ARC. It seemed overly aggressive to set the ARC target size to only 4.0M. Also once the shrink began, there were still way too many prefetching requests (in fact, the majority of the request were prefetching). It seemed to me that once we began shrinking the ARC, it'd make sense to limit prefetching and leave precious ARC space to demand reads.
So in short there appeared to be 3 problems here:
Then I disabled prefetching and ran the test again (ARC increased to 800M, all else the same). The test completed successfully, and ARC size never grew over the limit. So the prefetching seemed to be the trouble maker here.
As this is very reproducible, if any debugging information is needed, please let me know.
The text was updated successfully, but these errors were encountered: