Ordinary dd from a disk device can kill the ARC even after it finishes #3680

siebenmann · 2015-08-11T17:03:35Z

I wanted to see the raw speed of my L2ARC SSD, so I did the obvious thing:

dd if=/dev/sdd of=/dev/null bs=1024k

To my surprise, this wound up destroying my ZFS ARC; the size shrunk to 32 MB and stayed there even after the dd finished. In the end I had to do 'echo 1 >/proc/sys/vm/drop_caches' and then reset zfs_arc_max in order to have the ARC recover (although it looks like a slower recovery would have happened over time anyways).

What seems to be happening here is that the straightforward dd is flooding the Linux page cache. This drops free memory down to the point where arc_no_grow turns on and ZFS starts shrinking the ARC. Unfortunately the page cache is voracious and will eat all the memory that ZFS will free up, sometimes driving the ARC all the way down to its 32 MB minimum size on my machine (and when the ARC stays above this, it is non-data that remains; data size goes to zero). When the dd finishes, the machine is still low on free memory and so arc_no_grow doesn't turn off and ZFS never tries to grow the ARC and thus never puts pressure on the page cache to evict all of those uselessly cached pages from the dd. In order to unlock the ARC, I can wind up needing to explicitly purge the page cache; in the mean time, ARC performance is terrible. Even once arc_no_grow goes back to zero, ZFS is very reluctant to grow the ARC for data even in the face of demand for it.

(eg 'arcstat.py 1' shows terrible ARC hit rates and the data_size remains almost zero.)

All of this is with the latest git tip, 6bec435 as I file this report.

The text was updated successfully, but these errors were encountered:

dweeezil · 2015-08-12T05:25:37Z

@siebenmann As of 11f552f, arc_available_memory() is using nr_free_pages() which of course will drop as other parts of the system use the page cache, buffers, etc. There may be a better calculation we can use which takes into account evictable memory but in any case, this will need to be re-thought.

ronnyegner · 2015-08-12T10:04:18Z

@siebenmann You want to use Direct I/O with dd which bypasses the file cache and prevents this kind of issue. Just specify "oflag=direct".

dweeezil · 2015-08-12T12:46:46Z

@ronnyegner The problem is a bit deeper than that. In the current implementation, a simple md5sum /ext4mntpoint/bigfiles/* for example (reading lots of large files on an ext4-mounted filesystem) will consume the page cache and cause the ARC to collapse. The problem @siebenmann noted could be worked around by using nr_free_pages() + nr_blockdev_pages() for freemem but that wouldn't solve the bigger issue of other page cache consumers ruling the whole system.

The current scheme under Linux w.r.t. the page cache appears to be that filesystems are supposed to use it for their cache but are also supposed to yield to user programs if possible, however, the scheme has problems when there are multiple filesystems competing for the page cache (which one wins). In the case of ZFS, at least we can set zfs_arc_min. Even when ABD is fully integrated, a system running multiple page cache-using filesystems may exhibit awful performance in tight memory situations as the various filesystems fight for the memory. Again, at least in ZFS we have some tunables to, say, prevent the ARC from consuming all of memory whereas ext4 (and likely all the other native filesystem) have no such cap on the amount of page cache they might use.

EDIT: And, nr_blockdev_pages() isn't exported so there's no good way to get what we need (which would also include global_page_state(NR_INACTIVE_FILE)).

EDIT[2]: actually, it seems NR_INACTIVE_FILE might be enough to hack around the problem.

dweeezil · 2015-08-12T14:59:38Z

A quick test shows that https://github.com/dweeezil/spl/tree/freemem works around most issues related to this.

EDIT: This may very well be incompatible with ABD.

dweeezil · 2015-08-13T13:59:41Z

I've added a couple more sources of freeable memory in dweeezil/spl@08807f8 (https://github.com/dweeezil/spl/tree/freemem branch).

kernelOfTruth · 2015-08-13T14:14:03Z

EDIT: This may very well be incompatible with ABD.

@dweeezil what makes you say that ? Please explain (I'm eager to learn about internals :) )

If it compiles with ABD, I'll give it a try, thanks !

dweeezil · 2015-08-13T14:53:32Z

@kernelOfTruth Under ABD, we'd like to exclude ABD's own reclaimable pages from the freemem calculation if they are accounted for in global_page_state(NR_INACTIVE_FILE).

EDIT: Thinking about this a bit more, it might not matter. I've not looked at ABD in awhile.

siebenmann · 2015-08-13T22:38:09Z

@dweeezil Also, including NR_SLAB_RECLAIMABLE only works (to the extent that it works) due to #3679. Including it may be overly aggressive in general. Sadly, all of this smells a little bit of heuristics.

behlendorf · 2015-08-18T23:23:30Z

At the moment ABD isn't integrated with the page cache (it's a first step) but once it is we'll need to revisit this. As for this being a heuristic that's definitely the cache. But FWIW the entire Linux VM is just a collection of evolved heuristics that happen to work well for many workloads. @dweeezil's proposed patch looks like a very sensible way to handle this case for now. Let me run in through to additional testing and get it merged.

behlendorf · 2015-08-19T16:30:22Z

Thanks guys, I've merged the fix.

kernelOfTruth · 2015-08-19T17:39:04Z

@dweeezil I've been running your fix with ABD for some time now and observed no issues

Thanks

grizzlyfred · 2015-12-30T15:20:06Z

I have ZoL 0.6.5-3 in mint from the ppa. Not sure if that is too old to have the patch or not. I too have seen a shrinking arc in conjunction with the creation of a sparse 1TB zvol and putting xfs over luks on it.

Now I am observing a gradual shrinking of ARC (initially 12GB of the 32GB of the machine to the 1GB limit I set), then at some fraction of the read-rate (like 50MB per second the ARC shrinks, but recovers, again but not to the maximum such as 12-11.9-11.8-11.7...11...11.1 11.5 11.4 11.3... 11.1 10.9... kinda sine-wave-with-attenuation-until-faded-like) while copying data from another pool to the zvol - used to just send the zvol to the backup pool, but that resulted in non-sparseness, so I thought I'd give simply copying the files over a chance in order to get the sparse zvol to sane sizes again.

As soon as I issue the "drop caches", the 12GB are reached quickly (at source-ppols read-rate) again.
For now, I put drop_cache in cron.hourly for lack of a better solution. (there is only a btrfs for the system os on ssd, so I do not much care about the block cache). I am "only" using 12G because I need space for VMs as well.

EDIT: I just discovered (closed) #548. Sure, my backup pool is on 6-Disk raidz2. So when writing e.g. 8 meg of random data to the newly created ZVOL, only with -o volblocksize=16K, 32, 64, 128k did the space requirements NOT double... OK, so I learned, I have to have block sized that can be striped across four disks.

siebenmann mentioned this issue Aug 11, 2015

Git tip can lock arc_no_grow to B_TRUE, resulting in a total ARC size collapse #3637

Closed

angstymeat mentioned this issue Aug 11, 2015

Stuck in IOWAIT state on multiple CPUs #3668

Closed

behlendorf added the Bug - Minor label Aug 18, 2015

behlendorf added this to the 0.6.5 milestone Aug 18, 2015

behlendorf closed this as completed in openzfs/spl@851a549 Aug 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

siebenmann commented Aug 11, 2015

dweeezil commented Aug 12, 2015

ronnyegner commented Aug 12, 2015

dweeezil commented Aug 12, 2015

dweeezil commented Aug 12, 2015

dweeezil commented Aug 13, 2015

kernelOfTruth commented Aug 13, 2015

dweeezil commented Aug 13, 2015

siebenmann commented Aug 13, 2015

behlendorf commented Aug 18, 2015

behlendorf commented Aug 19, 2015

kernelOfTruth commented Aug 19, 2015

grizzlyfred commented Dec 30, 2015

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

Comments

siebenmann commented Aug 11, 2015

dweeezil commented Aug 12, 2015

ronnyegner commented Aug 12, 2015

dweeezil commented Aug 12, 2015

dweeezil commented Aug 12, 2015

dweeezil commented Aug 13, 2015

kernelOfTruth commented Aug 13, 2015

dweeezil commented Aug 13, 2015

siebenmann commented Aug 13, 2015

behlendorf commented Aug 18, 2015

behlendorf commented Aug 19, 2015

kernelOfTruth commented Aug 19, 2015

grizzlyfred commented Dec 30, 2015