Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordinary dd from a disk device can kill the ARC even after it finishes #3680

Closed
siebenmann opened this issue Aug 11, 2015 · 12 comments
Closed
Milestone

Comments

@siebenmann
Copy link
Contributor

I wanted to see the raw speed of my L2ARC SSD, so I did the obvious thing:

dd if=/dev/sdd of=/dev/null bs=1024k

To my surprise, this wound up destroying my ZFS ARC; the size shrunk to 32 MB and stayed there even after the dd finished. In the end I had to do 'echo 1 >/proc/sys/vm/drop_caches' and then reset zfs_arc_max in order to have the ARC recover (although it looks like a slower recovery would have happened over time anyways).

What seems to be happening here is that the straightforward dd is flooding the Linux page cache. This drops free memory down to the point where arc_no_grow turns on and ZFS starts shrinking the ARC. Unfortunately the page cache is voracious and will eat all the memory that ZFS will free up, sometimes driving the ARC all the way down to its 32 MB minimum size on my machine (and when the ARC stays above this, it is non-data that remains; data size goes to zero). When the dd finishes, the machine is still low on free memory and so arc_no_grow doesn't turn off and ZFS never tries to grow the ARC and thus never puts pressure on the page cache to evict all of those uselessly cached pages from the dd. In order to unlock the ARC, I can wind up needing to explicitly purge the page cache; in the mean time, ARC performance is terrible. Even once arc_no_grow goes back to zero, ZFS is very reluctant to grow the ARC for data even in the face of demand for it.

(eg 'arcstat.py 1' shows terrible ARC hit rates and the data_size remains almost zero.)

All of this is with the latest git tip, 6bec435 as I file this report.

@dweeezil
Copy link
Contributor

@siebenmann As of 11f552f, arc_available_memory() is using nr_free_pages() which of course will drop as other parts of the system use the page cache, buffers, etc. There may be a better calculation we can use which takes into account evictable memory but in any case, this will need to be re-thought.

@ronnyegner
Copy link

@siebenmann You want to use Direct I/O with dd which bypasses the file cache and prevents this kind of issue. Just specify "oflag=direct".

@dweeezil
Copy link
Contributor

@ronnyegner The problem is a bit deeper than that. In the current implementation, a simple md5sum /ext4mntpoint/bigfiles/* for example (reading lots of large files on an ext4-mounted filesystem) will consume the page cache and cause the ARC to collapse. The problem @siebenmann noted could be worked around by using nr_free_pages() + nr_blockdev_pages() for freemem but that wouldn't solve the bigger issue of other page cache consumers ruling the whole system.

The current scheme under Linux w.r.t. the page cache appears to be that filesystems are supposed to use it for their cache but are also supposed to yield to user programs if possible, however, the scheme has problems when there are multiple filesystems competing for the page cache (which one wins). In the case of ZFS, at least we can set zfs_arc_min. Even when ABD is fully integrated, a system running multiple page cache-using filesystems may exhibit awful performance in tight memory situations as the various filesystems fight for the memory. Again, at least in ZFS we have some tunables to, say, prevent the ARC from consuming all of memory whereas ext4 (and likely all the other native filesystem) have no such cap on the amount of page cache they might use.

EDIT: And, nr_blockdev_pages() isn't exported so there's no good way to get what we need (which would also include global_page_state(NR_INACTIVE_FILE)).

EDIT[2]: actually, it seems NR_INACTIVE_FILE might be enough to hack around the problem.

@dweeezil
Copy link
Contributor

A quick test shows that https://github.com/dweeezil/spl/tree/freemem works around most issues related to this.

EDIT: This may very well be incompatible with ABD.

@dweeezil
Copy link
Contributor

I've added a couple more sources of freeable memory in dweeezil/spl@08807f8 (https://github.com/dweeezil/spl/tree/freemem branch).

@kernelOfTruth
Copy link
Contributor

EDIT: This may very well be incompatible with ABD.

@dweeezil what makes you say that ? Please explain (I'm eager to learn about internals :) )

If it compiles with ABD, I'll give it a try, thanks !

@dweeezil
Copy link
Contributor

@kernelOfTruth Under ABD, we'd like to exclude ABD's own reclaimable pages from the freemem calculation if they are accounted for in global_page_state(NR_INACTIVE_FILE).

EDIT: Thinking about this a bit more, it might not matter. I've not looked at ABD in awhile.

@siebenmann
Copy link
Contributor Author

@dweeezil Also, including NR_SLAB_RECLAIMABLE only works (to the extent that it works) due to #3679. Including it may be overly aggressive in general. Sadly, all of this smells a little bit of heuristics.

@behlendorf behlendorf added this to the 0.6.5 milestone Aug 18, 2015
@behlendorf
Copy link
Contributor

At the moment ABD isn't integrated with the page cache (it's a first step) but once it is we'll need to revisit this. As for this being a heuristic that's definitely the cache. But FWIW the entire Linux VM is just a collection of evolved heuristics that happen to work well for many workloads. @dweeezil's proposed patch looks like a very sensible way to handle this case for now. Let me run in through to additional testing and get it merged.

@behlendorf
Copy link
Contributor

Thanks guys, I've merged the fix.

@kernelOfTruth
Copy link
Contributor

@dweeezil I've been running your fix with ABD for some time now and observed no issues

Thanks

@grizzlyfred
Copy link

I have ZoL 0.6.5-3 in mint from the ppa. Not sure if that is too old to have the patch or not. I too have seen a shrinking arc in conjunction with the creation of a sparse 1TB zvol and putting xfs over luks on it.

Now I am observing a gradual shrinking of ARC (initially 12GB of the 32GB of the machine to the 1GB limit I set), then at some fraction of the read-rate (like 50MB per second the ARC shrinks, but recovers, again but not to the maximum such as 12-11.9-11.8-11.7...11...11.1 11.5 11.4 11.3... 11.1 10.9... kinda sine-wave-with-attenuation-until-faded-like) while copying data from another pool to the zvol - used to just send the zvol to the backup pool, but that resulted in non-sparseness, so I thought I'd give simply copying the files over a chance in order to get the sparse zvol to sane sizes again.

As soon as I issue the "drop caches", the 12GB are reached quickly (at source-ppols read-rate) again.
For now, I put drop_cache in cron.hourly for lack of a better solution. (there is only a btrfs for the system os on ssd, so I do not much care about the block cache). I am "only" using 12G because I need space for VMs as well.

EDIT: I just discovered (closed) #548. Sure, my backup pool is on 6-Disk raidz2. So when writing e.g. 8 meg of random data to the newly created ZVOL, only with -o volblocksize=16K, 32, 64, 128k did the space requirements NOT double... OK, so I learned, I have to have block sized that can be striped across four disks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants