-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support swap on zvol #342
Comments
If at all possible could you include a stack trace from the deadlocked system. The reproducer is great but the stack would be helpful to identify the exact issue your hitting, |
Managed to capture some stuff, though all in binary form, not text. Images here: http://www.dropbox.com/gallery/34890805/1/SwapZVOL?h=74755c Those show first an OOMKiller that the kernel did itself, followed by what looks like the actual oops in ZFS. I've also got some various output from magic sysrq showing blocked processes & full backtraces. That's in video format (easiest way was to screen cap the VM). I've got to trim & transcode that & will post URL shortly. |
Dumps from magic sysrq here: (13MB x264): |
Quick update: Suggestion was made on zfs-discuss to set ZVOL block size to same as Linux page size. It might have made the situation slightly better as I remember this being instant death on swap previously whereas with |
@pendor would you retest this against the zvol patches that Brian merged into HEAD today? |
Latest trunk (at 34037af) has different symptoms, but I don't think it's working yet. No more kernel oops, no output in dmesg that I can see anymore (had it on a 1-second loop to dump the last 20 lines), but as soon as it hits swap, the system pretty much locks, and any I/O to ZFS blocks indefinitely. I tried the same with swap on a separate dedicated drive (raw block, no ZFS involved), and the system worked as expected. Slow as death, but still functioning while thrashing the disk. With the ZVOL, there was no response to Ctrl-Alt-Del. It's a VM, and I can't get SysRq to it easily from the host unfortunately. Not sure if there's any other way to capture more useful info? |
Bumping up the kernel printk level and then watching |
@pendor, I believe that you are using VMWare Player. You can configure your system to use a serial console if you add a serial port to your VM (from server to application, using a socket), compile your kernel with serial support, uncomment the s0 line in /etc/inittab and optionally pass modify your kernel commandline to include 'console=tty0 console=ttyS0,115200'. The ttyS0 is the one that is needed for the serial port to have kernel messages, but having both statements permits both consoles to receive the kernel messages during startup. Doing that should enable most of the stuff that Brian's comment likely assumes you can access. There are likely a few bits missing, which in specific are permitting kernel selection from GRUB and ensuring that kernel panics are written to the serial console. The second part is because I haven't had a crash to be able to check. |
I revisited this issue today by filling a tmpfs on my desktop. The system will freeze completely. The only useful information that I have is that I used a sparse zvol that was 27.7M in size after the crash according to |
Any luck getting a stack trace from the system before (or after) it crashes? |
Unfortunately, no. My desktop lacks a serial port, so my ability to obtain stack traces is limited. I recall hearing that Linux has the ability to send stack traces from kernel panics over the network, although I am not familiar with any documentation on how that could be configured. Right now, the best I could do is write a script that will run this with X disabled and hope that something is printed to my pty so that I can take a picture, although I doubt that I would capture the entire trace, assuming that it is printed. I would appreciate any suggestions of methods that might improve my ability to obtain a stack trace. |
I have asked in Gentoo development channels for advice on this topic. The main suggestions that I have had are the following: http://www.kernel.org/doc/Documentation/networking/netconsole.txt I will try to apply these to the problem as I have time. |
I happen to use KVM and virsh for my VM needs which allows me to use |
Does the kernel assume that a swap device can be acessed asynchronous ? It maybe a problem within issue #223. |
@pyavdr, if that were the case, I doubt that I could have gotten anything into my swap file. On a more general note, I found a comment regarding this on the FreeBSD forums:
http://forums.freebsd.org/showpost.php?p=155452&postcount=16 |
I did an experiment. I disabled the shrinker callback registration in ZFS, recompiled, rebuilt my initramfs and rebooted. Then I did A plausible explanation would seem to be the comment on the FreeBSD forums. In specific, ARC is atempting to memory when it is being asked to write to a zvol by kswapd, which fails. That renders swap useless and anything else that requests memory will then fail. |
I just duplicated this experiment with two changes.
That resulted in the following:
This implies that swap on a zvol works provided that min_free_kbytes is high enough. I also believes that it validates the explanation in my previous comment based on the statement at the FreeBSD forums. Note that I did these experiments with the shrinker logic disabled, so it is possible that bugs in it will prevent this from working. I need to do more tests with it enabled again. |
I just repeated these tests with the shrinker logic enabled. I can confirm that swap on zvols works with it enabled as well. I also tried testing a few lower values of min_free_kbytes. 262144 appeared to almost work, but I had a lockup after about 160MB had been written. I did my testing on my desktop with Brian's VM patch. His patch had caused some rather prominent lags on my swapless system, but desktop performance was rather good during my tests with swap. When swapping, there was a small drop in interactive response, but performance remained good. This needs further testing, but the initial results are rather promising. |
My comment in issue #618 seems relevant here:
I am not certain if that would solve this problem, but it should lessen it. |
@gentoofan I also saw a hint in the BSD forums concerning swap on ZFS (there are some threads), which says, that the problem may be solved by reserving some memory for zfs. Solaris also reserves some memory for zfs. Only changing parameters may lead to shifted levels of the deadlock situation, not solving the real cause and the problem. Is there enough memory freed for zfs, that zfs can continue to allocate arc, when a page goes to swap? You are really searching around, is there no solution for this problem at the other zfs ports? |
Do you have links to those threads? |
There is a "storage" section on the bsd forums, some zfs related threads in there, maybe http://forums.freebsd.org/showthread.php?t=27855 or this is the one i mentioned : http://forums.freebsd.org/showthread.php?t=30298 |
I did some more testing. It seems that this will only work when CONFIG_PREEMPT_NONE=y is set. Voluntary preemption causes deadlocks. Furthermore, it seems that it is still possible for deadlocks to occur when the system is being stressed by software compilation despite the It looks like we need to be setting something like PF_MEMALLOC in zvol_write() in ./module/zfs/zvol.c when the device to which are are writing is a swap device so that the kernel will make its best effort to fulfill our memory allocations. |
I have filed pull request #669, which partially addresses this. My system will fail to write to swap with the default vm.min_free_kbytes value, but it will no longer deadlock immediately. Increasing the value of vm.min_free_kbytes from 65536 to 131072 will enable swap on zvols to function normally on my system. |
The latest revision of the patch in issue #669 eliminates the need to change vm.min_free_kbytes. However, there is still some potential for a deadlock under heavy load. |
Previously, it was possible for the direct reclaim path to be invoked when a write to a zvol was made. When a zvol is used as a swap device, this often causes swap requests to depend on additional swap requests, which deadlocks. We address this by disabling the direct reclaim path on zvols. This closes issue openzfs#342.
Alternative to commit cfc9a5c Might fix bug openzfs#342
As another data point (skip if you don't need more info), I tried to use cyptsetup (LUKS) on a zvol, and use the encrypted device as swap. I'm using the head sources from yesterday (29 May). The system locks up after writing a little less than 4 GB to swap. I have tried to increase vm.min_free_kbytes to 4 GB (!), but it still crashes, and I've played with the settings for the zvol. No stack trace is printed from the kernel. I copy large files to a tmpfs (this is a use case I actually need, and how I ran into the problem), and I monitor top and iostat: in top, 99.8 % of the CPU is "wa", right before and when it crashes. In iostat, the "tps" of the encrypted volume is huge, 11833, while it's only 131 for the underlying zvol. The write rate similarly about 2 orders of magnitude greater for "dm-0", the encrypted volume, than for "zd16", the zvol, in the entries I see before it locks up. I can run some commands if you would like, but I suspect this is easier for the developers to test themselves. |
For those interested in using a zvol as a swap device. Could you attempt to run with the following patches. They still need some polish but they should enable the basic functionality. |
The issues surrounding swapping to a ZVOL have been resolved in the latest master. Please give it a try and report any issues you encounter. |
Using swap on ZFS seems very stable, but I do get some errors in dmesg when I tried to use a lot of memory, starting VMs and moving files to a tmpfs. There was still a lot of free swap at this point. The messages are at http://www.fa2k.net/misc/dmesg , but I don't know if it's related to ZFS or helpful at all. The patches are a great improvement either way |
On 09/24/2012 11:18 AM, fa2k wrote:
This is an external fragmentation issue in the Linux kernel that ARC can |
I'm still seeing this issue. Is this even supposed to be supported? The zfs(8) man page says no.
|
eatnumber1: Swap on zvol should work, swap in a fine should not. It is not the same. |
It's supported and does work but thanks for pointing out the inconsistency in the man page we'll get that fixed. |
Is it a good idea to set a refreservation on a swap volume? Such as:
This way the normal files don't try to take up your swap space. |
@CMCDragonkai Unless the zvol is created sparse, it should already have a reservation. |
Ok, but leaving the command I ran shouldn't cause any issues right? |
@CMCDragonkai It is likely fine. Speaking of which, you might be interested in the patches in #2484. The patches there are very new, but I expect them to improve swap on zvols. |
Has anybody tried using a sparse ZVOL as the swap space? There's a blog post talking about it: http://blog.thilelli.net/post/2011/08/03/Interesting-Use-Case-Of-Solaris-Swap-Space It seems like an interesting way to allow SWAP to dynamically grow and contract. What could be disadvantages of it? |
@CMCDragonkai further fragmentation, which would make swapping even slower. I just tried and I see the same freeze as with actual reservation. |
@fejiso What do you mean by the "same freeze"? I'm not so concerned with swap space speed, as I just think of it as backup memory. |
@CMCDragonkai I experience the same type of deadlocks as described in this issue, both with reservation enabled and disabled. I haven't performed any performance test but dynamic allocation will certainly cause fragmentation and make swap access slower. |
With the issue being closed, I thought the problem is solved. You should
|
After looking all around the web this is the only issue that looks remotely similar to what I experienced. In my case overcommitting VMs did it, when memory pressure from VMs started to compete with ARC got the load skyrocketed and eventually the system went unresponsive. ARC downsized as memory pressure increased, but that turned into slower disk access, that kicked the load further up. Very little, if any, swap space from the zvol was used. I've tried pretty much every setting I could find: from running memory hogs to trigger swapping, limiting RES size of apps through cgroups, disabling transparent (and not) huge tables... The only thing that fixed the situation was to have a separate swap partition. Zram devices also were used as expected. All the sysctl vm values were default except for vm.swappiness set to 100 and inotify limit to 1000000. The affected machine is now running BTRFS. As I couldn't easily shrink the mirrored zpool to accommodate a swap partition, I chose to switch as I might need to change the array geometry relatively often. I can, though, try to reproduce the issue using the partitions I have now as swap space. |
I just stress tested it as well with 32GB zvol swap, 32 GB ram and 32GB
|
Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Closes openzfs#342 Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: openzfs/openzfs@01acb46
Currently placing swap inside a zvol leads to deadlocks and system death as soon as memory usage grows into swap. Running ZFS with full-device for a root pool will require swap on zvol for systems that require (or prefer to have) swap. Supporting this isn't critical priority as the system can always be setup with dedicated swap partitions and ZFS on a partition of its own, but running on full device can give better performance and is more in line with "The ZFS Way".
Discussion on this is present here:
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/699693ebd2706b45#
Lock can be reproduced with:
The text was updated successfully, but these errors were encountered: