ARC grows over zfs_arc_max (700M) and causes OOM panic #2840

thegreatgazoo · 2014-10-28T06:57:58Z

This is almost 100% reproducible with ZFS v0.6.3-1 on a test VM with 1.8G memory:
[root@eagle-44vm1 ~]# arcstat.py 2
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
......
22:10:48 1.0K 1.0K 98 32 66 992 100 0 0 700M 700M
22:10:50 682 672 98 21 66 651 100 0 0 700M 700M
22:10:52 715 704 98 22 66 682 100 0 0 700M 700M
22:10:54 801 789 98 25 66 764 100 0 0 700M 700M
22:10:56 638 632 99 32 85 600 100 13 71 722M 700M
22:10:58 495 495 100 30 100 465 100 15 100 728M 700M
22:11:00 396 396 100 24 100 372 100 12 100 774M 700M
22:11:02 579 579 100 35 100 544 100 17 100 858M 700M
22:11:05 278 278 100 16 100 262 100 8 100 888M 700M
22:11:07 331 331 100 21 100 310 100 11 100 929M 700M
22:11:09 296 296 99 17 97 279 100 8 100 974M 700M
22:11:11 330 330 100 20 100 310 100 10 100 1017M 700M
22:11:13 264 264 100 16 100 248 100 8 100 1.0G 700M
22:11:15 330 330 100 20 100 310 100 10 100 1.1G 700M
22:11:17 278 278 99 18 97 260 100 9 100 1.1G 700M
22:11:19 231 228 98 14 84 214 100 7 100 1.1G 700M
22:11:21 279 274 98 18 78 256 100 10 100 1.2G 700M
22:11:23 278 276 99 15 88 261 100 7 100 1.2G 700M
22:11:25 827 827 100 52 100 775 100 26 100 1.4G 633M
22:11:27 314 314 99 40 98 274 100 29 100 1.4G 208M
22:11:29 625 621 99 54 92 566 100 37 100 1.5G 4.0M

At 22:10:54 the ARC size began to go out of control and grew over the 700M limit, and eventually caused an OOM panic.

At 22:11:23, the system seemed to begin trying to shrink the ARC. It seemed overly aggressive to set the ARC target size to only 4.0M. Also once the shrink began, there were still way too many prefetching requests (in fact, the majority of the request were prefetching). It seemed to me that once we began shrinking the ARC, it'd make sense to limit prefetching and leave precious ARC space to demand reads.

So in short there appeared to be 3 problems here:

ARC grows way over zfs_arc_max (to over 2*zfs_arc_max in the end).
ARC shrinking seemed too aggressive.
Should limit prefetching when shrinking the ARC.

Then I disabled prefetching and ran the test again (ARC increased to 800M, all else the same). The test completed successfully, and ARC size never grew over the limit. So the prefetching seemed to be the trouble maker here.

As this is very reproducible, if any debugging information is needed, please let me know.

dweeezil · 2014-10-28T12:44:32Z

@thegreatgazoo If you're using a kernel >= 3.12, you need the patch in #2837 (and, potentially openzfs/spl#403) for this work work properly.

DeHackEd · 2014-10-28T14:24:54Z

This sounds like a metadata workload. I've run into something similar a lot in my time. Rsync or other big file operations? I have a bug report open at #1932

You can try this patch to see if it improves things: DeHackEd/zfs@2827b75 It's one line and can be easily applied by hand for a manual recompile. It's not a solution, but may make the problem much more mild for reasonable workloads.

thegreatgazoo · 2014-10-28T20:36:35Z

@dweeezil It's CentOS 6.5 2.6.32-431.29.2.el6_lustre.g9835a2a.x86_64. Currently I can't find the source to verify but I'd tend to believe that there wasn't a backport of a 3.12 feature.

thegreatgazoo · 2014-10-28T20:54:04Z

@DeHackEd It's not meta-data heavy. It's some Lustre/osd-zfs tests, no ZPL involved just DMU. I didn't capture full arcstats but from arcstat.py outputs I had the meta-data reads was about 3% of total ARC reads.

dweeezil · 2014-10-29T12:18:47Z

@thegreatgazoo Your're right, the 2.6 EL kernels definitely have the old shrinker interface. I suppose it might be interesting to sample at least the arc reclaim counters with grep memory /proc/spl/kstat/zfs/arcstats to get an idea of the type and frequency of reclaim. Is the system configured with any swap space? If so, what is the swap's backing store?

thegreatgazoo · 2014-10-31T04:01:08Z

@dweeezil Yes there's swap directly on /dev/vda2, no ZVOL or anything ZFS involved. But it's on a same disk as the partitions of the zpools. I'll capture arcstats once I have a chance to rerun those tests again.

behlendorf · 2014-10-31T16:41:35Z

ARC grows way over zfs_arc_max (to over 2*zfs_arc_max in the end).

ARC shrinking seemed too aggressive.

It would be interesting to see if #2826 helps. We've seen considerable contention on these locks in the ARC and I wonder if the contention might end up having some subtle side effects like overly aggressive reclaim. It could be good for performance too.

Should limit prefetching when shrinking the ARC.

That's an interesting idea. There are certainly times when that may make some sense for example when arc_no_grow is set. That would indicate a rapid change in memory pressure and that we shouldn't make things worse by prefetching. But I still think the bigger question is what it preventing up from being able to quickly and cheaply reclaim ARC buffers. This is why I suggested talking a look at #2826 above.

It's not meta-data heavy. It's some Lustre/osd-zfs tests

Actually Lustre does end up being pretty metadata heavy from ZFS perspective due to all the xattrs. They're large enough to force a spill block to be needed for every object which causes additional IO and metadata objects in the ARC. This is one of the reasons I'm so keen on the large dnode work. I expect it to help considerably.

thegreatgazoo · 2014-11-02T04:58:21Z

@behlendorf I'll try #2826 once I have a chance.

I think prefetching seemed to be the trouble maker here. The # of prefetch reads was almost always more than 90% of total reads, as shown here (edited from arcstat.py outputs):

time     read pread arcsz c
22:10:54 801  764   700M 700M
22:10:56 638  600   722M 700M
22:10:58 495  465   728M 700M
22:11:00 396  372   774M 700M
22:11:02 579  544   858M 700M
22:11:05 278  262   888M 700M
22:11:07 331  310   929M 700M
22:11:09 296  279   974M 700M
22:11:11 330  310   1017M 700M
22:11:13 264  248   1.0G 700M
22:11:15 330  310   1.1G 700M
22:11:17 278  260   1.1G 700M
22:11:19 231  214   1.1G 700M
22:11:21 279  256   1.2G 700M
22:11:23 278  261   1.2G 700M
22:11:25 827  775   1.4G 633M
22:11:27 314  274   1.4G 208M
22:11:29 625  566   1.5G 4.0M

The prefetcher kept adding data to the ARC, despite that:

at 22:10:56 ARC size already exceeded its limit
at 22:11:25 ARC began to try to shrink itself

The OSD lu_cache_nr was at the default 10240, which might prevent ARC from shrinking quickly. But when ARC shrinking began at 22:11:25, the ARC size was already twice its limit due to prefetching.

dweeezil · 2014-11-03T03:30:22Z

@thegreatgazoo Could you grab /proc/spl/kstat/zfs/zfetchstats as well? I'm doing a bunch of xattr work now and as part of it, will rig up a test to stress a filesystem on which (virtually) every file has something in a spill block to see if I can't duplicate this behavior.

mooney6023 · 2014-12-01T17:38:44Z

We were seeing something similar to this on our larger high IO load file servers. (36 Terabyte Raid10, dual 10G Intel Nics).

In the end it was the crappy Adaptec kernel driver not supporting MSI-X that seems to have caused the kernel panics and runaway ARC size.

Patching our kernels with a newer driver that properly suppored MSI-X stopped the OOM, 300% max arc growth, and CPU stalls.

A single interrupt on a single CPU for your drive subsystem is BAD :)

https://github.com/DeHackEd/zfs/blob/2827b75ae1fd5c0c1627dd46eac0b81992bf4afd/include/sys/zfs_vfsops.h to a higher value , going with experimental value of 2048

This reverts commit f3f5ece.

behlendorf · 2016-11-15T05:35:34Z

Closing. Since this issue was last updated multiple improvements have been made to the ARC to address this kind of issue.

behlendorf added Component: Memory Management kernel memory management Difficulty - Hard labels Oct 31, 2014

kernelOfTruth mentioned this issue Mar 14, 2015

More aggressively maintain arc_meta_limit (WIP) #3181

Closed

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this issue Mar 14, 2015

Revert "raising ZFS_OBJ_MTX_SZ in accordance with openzfs#2840 ,"

dc60b1b

This reverts commit f3f5ece.

behlendorf removed Bug - Major labels Sep 30, 2016

behlendorf closed this as completed Nov 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARC grows over zfs_arc_max (700M) and causes OOM panic #2840

ARC grows over zfs_arc_max (700M) and causes OOM panic #2840

thegreatgazoo commented Oct 28, 2014

dweeezil commented Oct 28, 2014

DeHackEd commented Oct 28, 2014

thegreatgazoo commented Oct 28, 2014

thegreatgazoo commented Oct 28, 2014

dweeezil commented Oct 29, 2014

thegreatgazoo commented Oct 31, 2014

behlendorf commented Oct 31, 2014

thegreatgazoo commented Nov 2, 2014

dweeezil commented Nov 3, 2014

mooney6023 commented Dec 1, 2014

behlendorf commented Nov 15, 2016

ARC grows over zfs_arc_max (700M) and causes OOM panic #2840

ARC grows over zfs_arc_max (700M) and causes OOM panic #2840

Comments

thegreatgazoo commented Oct 28, 2014

dweeezil commented Oct 28, 2014

DeHackEd commented Oct 28, 2014

thegreatgazoo commented Oct 28, 2014

thegreatgazoo commented Oct 28, 2014

dweeezil commented Oct 29, 2014

thegreatgazoo commented Oct 31, 2014

behlendorf commented Oct 31, 2014

thegreatgazoo commented Nov 2, 2014

dweeezil commented Nov 3, 2014

mooney6023 commented Dec 1, 2014

behlendorf commented Nov 15, 2016