Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support swap on zvol #342

Closed
pendor opened this issue Jul 26, 2011 · 47 comments
Closed

Support swap on zvol #342

pendor opened this issue Jul 26, 2011 · 47 comments
Labels
Component: Memory Management kernel memory management Component: ZVOL ZFS Volumes Type: Feature Feature request or new feature
Milestone

Comments

@pendor
Copy link
Contributor

pendor commented Jul 26, 2011

Currently placing swap inside a zvol leads to deadlocks and system death as soon as memory usage grows into swap. Running ZFS with full-device for a root pool will require swap on zvol for systems that require (or prefer to have) swap. Supporting this isn't critical priority as the system can always be setup with dedicated swap partitions and ZFS on a partition of its own, but running on full device can give better performance and is more in line with "The ZFS Way".

Discussion on this is present here:

http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/699693ebd2706b45#

Lock can be reproduced with:

zpool create rpool mirror sda sdb
zfs create -V 2G rpool/swap
mkswap /dev/rpool/swap
swapon /dev/rpool/swap
memtester 1800m # Some value near the max free in your system
cd /usr/src/linux && make clean && make -j20 # Do something that will hit memory a lot
@behlendorf
Copy link
Contributor

If at all possible could you include a stack trace from the deadlocked system. The reproducer is great but the stack would be helpful to identify the exact issue your hitting,

@pendor
Copy link
Contributor Author

pendor commented Jul 26, 2011

Managed to capture some stuff, though all in binary form, not text. Images here: http://www.dropbox.com/gallery/34890805/1/SwapZVOL?h=74755c

Those show first an OOMKiller that the kernel did itself, followed by what looks like the actual oops in ZFS.

I've also got some various output from magic sysrq showing blocked processes & full backtraces. That's in video format (easiest way was to screen cap the VM). I've got to trim & transcode that & will post URL shortly.

@pendor
Copy link
Contributor Author

pendor commented Jul 26, 2011

Dumps from magic sysrq here: (13MB x264):
http://dl.dropbox.com/u/34890805/swapzvolsysrq.avi

@pendor
Copy link
Contributor Author

pendor commented Jan 25, 2012

Quick update: Suggestion was made on zfs-discuss to set ZVOL block size to same as Linux page size. It might have made the situation slightly better as I remember this being instant death on swap previously whereas with zfs create -b 4k -V 2G rpool/swap I was able to get a few MB's into swap before it broke. Still the same page allocation failure & all ZFS-based IO dead afterwards.

@ryao
Copy link
Contributor

ryao commented Feb 8, 2012

@pendor would you retest this against the zvol patches that Brian merged into HEAD today?

@pendor
Copy link
Contributor Author

pendor commented Feb 9, 2012

Latest trunk (at 34037af) has different symptoms, but I don't think it's working yet. No more kernel oops, no output in dmesg that I can see anymore (had it on a 1-second loop to dump the last 20 lines), but as soon as it hits swap, the system pretty much locks, and any I/O to ZFS blocks indefinitely. I tried the same with swap on a separate dedicated drive (raw block, no ZFS involved), and the system worked as expected. Slow as death, but still functioning while thrashing the disk.

With the ZVOL, there was no response to Ctrl-Alt-Del. It's a VM, and I can't get SysRq to it easily from the host unfortunately. Not sure if there's any other way to capture more useful info?

@behlendorf
Copy link
Contributor

Bumping up the kernel printk level and then watching virsh console for any stack traces dump is probably the easiest. I may find a little time in the next week or two to take a look at this, thus far I haven't spent any time on it.

@ryao
Copy link
Contributor

ryao commented Feb 10, 2012

@pendor, I believe that you are using VMWare Player. You can configure your system to use a serial console if you add a serial port to your VM (from server to application, using a socket), compile your kernel with serial support, uncomment the s0 line in /etc/inittab and optionally pass modify your kernel commandline to include 'console=tty0 console=ttyS0,115200'. The ttyS0 is the one that is needed for the serial port to have kernel messages, but having both statements permits both consoles to receive the kernel messages during startup.

Doing that should enable most of the stuff that Brian's comment likely assumes you can access. There are likely a few bits missing, which in specific are permitting kernel selection from GRUB and ensuring that kernel panics are written to the serial console. The second part is because I haven't had a crash to be able to check.

@ryao
Copy link
Contributor

ryao commented Apr 9, 2012

I revisited this issue today by filling a tmpfs on my desktop. The system will freeze completely. The only useful information that I have is that I used a sparse zvol that was 27.7M in size after the crash according to zfs list.

@behlendorf
Copy link
Contributor

Any luck getting a stack trace from the system before (or after) it crashes?

@ryao
Copy link
Contributor

ryao commented Apr 10, 2012

Unfortunately, no. My desktop lacks a serial port, so my ability to obtain stack traces is limited.

I recall hearing that Linux has the ability to send stack traces from kernel panics over the network, although I am not familiar with any documentation on how that could be configured. Right now, the best I could do is write a script that will run this with X disabled and hope that something is printed to my pty so that I can take a picture, although I doubt that I would capture the entire trace, assuming that it is printed.

I would appreciate any suggestions of methods that might improve my ability to obtain a stack trace.

@ryao
Copy link
Contributor

ryao commented Apr 13, 2012

I have asked in Gentoo development channels for advice on this topic. The main suggestions that I have had are the following:

http://www.kernel.org/doc/Documentation/networking/netconsole.txt
http://www.kernel.org/doc/Documentation/lockdep-design.txt
http://www.kernel.org/doc/Documentation/nmi_watchdog.txt

I will try to apply these to the problem as I have time.

@behlendorf
Copy link
Contributor

I happen to use KVM and virsh for my VM needs which allows me to use virsh console <vm-name> to get the console output of a running VM. But the above suggestions are good as well, frankly I haven't had time to look at this issue at all yet.

@pyavdr
Copy link
Contributor

pyavdr commented Apr 14, 2012

Does the kernel assume that a swap device can be acessed asynchronous ? It maybe a problem within issue #223.

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

@pyavdr, if that were the case, I doubt that I could have gotten anything into my swap file.

On a more general note, I found a comment regarding this on the FreeBSD forums:

Not sure if this has been fixed yet or not, but there's one little niggle to using swap on ZFS: you need memory to track disk usage and whatnot in the ARC. So, if you get into a situation where you need to swap something to disk, but don't have any non-wired memory to use to track the writes and whatnot ... you end up with a locked up system.

http://forums.freebsd.org/showpost.php?p=155452&postcount=16

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

I did an experiment. I disabled the shrinker callback registration in ZFS, recompiled, rebuilt my initramfs and rebooted. Then I did zfs create -o primarycache=metadata -V 8G rpool/swap && mkswap -f /dev/zvol/rpool/swap && swapon /dev/zvol/rpool/swap. I then proceeded to play music in my web browser while executing python -c "print 2**10**10". I had observed free -m in konsole and saw that swap began to be used. X froze immediately afterward. However, the music continued to play until the end of the song. The system also still responded to pings.

A plausible explanation would seem to be the comment on the FreeBSD forums. In specific, ARC is atempting to memory when it is being asked to write to a zvol by kswapd, which fails. That renders swap useless and anything else that requests memory will then fail.

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

I just duplicated this experiment with two changes.

  1. I did sysctl vm.min_free_kbytes=524288
  2. I ran 3 instances of python in screen because of an odd issue where python's memory usage would plateau and drop after a few gigabytes of usage.

That resulted in the following:

$ free -m
total used free shared buffers cached
Mem: 7980 7371 608 0 0 48
-/+ buffers/cache: 7323 657
Swap: 8191 1717 6474

This implies that swap on a zvol works provided that min_free_kbytes is high enough. I also believes that it validates the explanation in my previous comment based on the statement at the FreeBSD forums.

Note that I did these experiments with the shrinker logic disabled, so it is possible that bugs in it will prevent this from working. I need to do more tests with it enabled again.

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

I just repeated these tests with the shrinker logic enabled. I can confirm that swap on zvols works with it enabled as well. I also tried testing a few lower values of min_free_kbytes. 262144 appeared to almost work, but I had a lockup after about 160MB had been written.

I did my testing on my desktop with Brian's VM patch. His patch had caused some rather prominent lags on my swapless system, but desktop performance was rather good during my tests with swap. When swapping, there was a small drop in interactive response, but performance remained good. This needs further testing, but the initial results are rather promising.

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

My comment in issue #618 seems relevant here:

@behlendorf I am not sure if increasing min_free_kbytes is the right way to address this. This issue looks like a mix of internal and external fragmentation in the SLUB allocator. It is internal because SLUB has wasted up to 50% of our allocated space and some of that wasted free space is likely adjacent to unallocated free space that might have been big enough to satisfy these allocation requests had the wasted space been available. It is also external because if the free space chunks were more continuous, these allocations would have succeed.

I imagine that we could patch the ZFS code to do allocations from a heap in a virtual address space, which should permit us to fight this kind of fragmentation. Another possibility is to patch the kernel to provide better guarantees on the buckets available when entering an ATOMIC section, but that does not seem to be as good of a solution to me.

What do you think?

I am not certain if that would solve this problem, but it should lessen it.

@pyavdr
Copy link
Contributor

pyavdr commented Apr 15, 2012

@gentoofan

I also saw a hint in the BSD forums concerning swap on ZFS (there are some threads), which says, that the problem may be solved by reserving some memory for zfs. Solaris also reserves some memory for zfs. Only changing parameters may lead to shifted levels of the deadlock situation, not solving the real cause and the problem. Is there enough memory freed for zfs, that zfs can continue to allocate arc, when a page goes to swap? You are really searching around, is there no solution for this problem at the other zfs ports?

@ryao
Copy link
Contributor

ryao commented Apr 15, 2012

Do you have links to those threads?

@pyavdr
Copy link
Contributor

pyavdr commented Apr 15, 2012

There is a "storage" section on the bsd forums, some zfs related threads in there, maybe http://forums.freebsd.org/showthread.php?t=27855 or this is the one i mentioned : http://forums.freebsd.org/showthread.php?t=30298

@ryao
Copy link
Contributor

ryao commented Apr 16, 2012

I did some more testing. It seems that this will only work when CONFIG_PREEMPT_NONE=y is set. Voluntary preemption causes deadlocks. Furthermore, it seems that it is still possible for deadlocks to occur when the system is being stressed by software compilation despite the sysctl vm.min_free_kbytes=524288 hack above.

It looks like we need to be setting something like PF_MEMALLOC in zvol_write() in ./module/zfs/zvol.c when the device to which are are writing is a swap device so that the kernel will make its best effort to fulfill our memory allocations.

@ryao
Copy link
Contributor

ryao commented Apr 16, 2012

I have filed pull request #669, which partially addresses this. My system will fail to write to swap with the default vm.min_free_kbytes value, but it will no longer deadlock immediately. Increasing the value of vm.min_free_kbytes from 65536 to 131072 will enable swap on zvols to function normally on my system.

@ryao
Copy link
Contributor

ryao commented Apr 17, 2012

The latest revision of the patch in issue #669 eliminates the need to change vm.min_free_kbytes. However, there is still some potential for a deadlock under heavy load.

ryao added a commit to ryao/zfs that referenced this issue Apr 19, 2012
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.

This closes issue openzfs#342.
ryao added a commit to ryao/zfs that referenced this issue May 7, 2012
@fa2k
Copy link

fa2k commented May 30, 2012

As another data point (skip if you don't need more info), I tried to use cyptsetup (LUKS) on a zvol, and use the encrypted device as swap. I'm using the head sources from yesterday (29 May). The system locks up after writing a little less than 4 GB to swap. I have tried to increase vm.min_free_kbytes to 4 GB (!), but it still crashes, and I've played with the settings for the zvol. No stack trace is printed from the kernel. I copy large files to a tmpfs (this is a use case I actually need, and how I ran into the problem), and I monitor top and iostat: in top, 99.8 % of the CPU is "wa", right before and when it crashes. In iostat, the "tps" of the encrypted volume is huge, 11833, while it's only 131 for the underlying zvol. The write rate similarly about 2 orders of magnitude greater for "dm-0", the encrypted volume, than for "zd16", the zvol, in the entries I see before it locks up. I can run some commands if you would like, but I suspect this is easier for the developers to test themselves.

@behlendorf
Copy link
Contributor

For those interested in using a zvol as a swap device. Could you attempt to run with the following patches. They still need some polish but they should enable the basic functionality.

openzfs/spl#161
#883

@behlendorf
Copy link
Contributor

The issues surrounding swapping to a ZVOL have been resolved in the latest master. Please give it a try and report any issues you encounter.

@fa2k
Copy link

fa2k commented Sep 24, 2012

Using swap on ZFS seems very stable, but I do get some errors in dmesg when I tried to use a lot of memory, starting VMs and moving files to a tmpfs. There was still a lot of free swap at this point. The messages are at http://www.fa2k.net/misc/dmesg , but I don't know if it's related to ZFS or helpful at all. The patches are a great improvement either way

@ryao
Copy link
Contributor

ryao commented Sep 24, 2012

On 09/24/2012 11:18 AM, fa2k wrote:

Using swap on ZFS seems very stable, but I do get some errors in dmesg when I tried to use a lot of memory, starting VMs and moving files to a tmpfs. There was still a lot of free swap at this point. The messages are at http://www.fa2k.net/misc/dmesg , but I don't know if it's related to ZFS or helpful at all. The patches are a great improvement either way


Reply to this email directly or view it on GitHub:
#342 (comment)

This is an external fragmentation issue in the Linux kernel that ARC can
worsen, triggering these messages. These failures only occur under
memory pressure and they likely won't kill your system. The negative
effect of ARC on external fragmentation should be mitigated when ARC is
mapped into the Linux page cache.

@eatnumber1
Copy link

I'm still seeing this issue.

Is this even supposed to be supported? The zfs(8) man page says no.

ZFS Volumes as Swap
    Do not swap to a file on a ZFS file system. A ZFS swap file configuration is not supported.

@aikudinov
Copy link

eatnumber1: Swap on zvol should work, swap in a fine should not. It is not the same.
You can create zvol like this: zfs create -V 2gb testpool/swapvol

@behlendorf
Copy link
Contributor

It's supported and does work but thanks for pointing out the inconsistency in the man page we'll get that fixed.

@CMCDragonkai
Copy link

Is it a good idea to set a refreservation on a swap volume? Such as:

zfs set refreservation=${SWAP}M pool/swap

This way the normal files don't try to take up your swap space.

@ryao
Copy link
Contributor

ryao commented Jul 10, 2014

@CMCDragonkai Unless the zvol is created sparse, it should already have a reservation.

@CMCDragonkai
Copy link

Ok, but leaving the command I ran shouldn't cause any issues right?

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

@CMCDragonkai It is likely fine. Speaking of which, you might be interested in the patches in #2484. The patches there are very new, but I expect them to improve swap on zvols.

@CMCDragonkai
Copy link

Has anybody tried using a sparse ZVOL as the swap space? There's a blog post talking about it: http://blog.thilelli.net/post/2011/08/03/Interesting-Use-Case-Of-Solaris-Swap-Space

It seems like an interesting way to allow SWAP to dynamically grow and contract.

What could be disadvantages of it?

@fejiso
Copy link

fejiso commented Dec 8, 2015

@CMCDragonkai further fragmentation, which would make swapping even slower.

I just tried and I see the same freeze as with actual reservation.

@CMCDragonkai
Copy link

@fejiso What do you mean by the "same freeze"?

I'm not so concerned with swap space speed, as I just think of it as backup memory.

@fejiso
Copy link

fejiso commented Dec 9, 2015

@CMCDragonkai I experience the same type of deadlocks as described in this issue, both with reservation enabled and disabled. I haven't performed any performance test but dynamic allocation will certainly cause fragmentation and make swap access slower.

@CMCDragonkai
Copy link

With the issue being closed, I thought the problem is solved. You should
post some diagnostics in case we can fix it again?
On 10/12/2015 1:05 AM, "Fernando Jiménez" notifications@github.com wrote:

@CMCDragonkai https://github.com/CMCDragonkai I experience the same
type of deadlocks as described in this issue, both with reservation enabled
and disabled. I haven't performed any performance test but dynamic
allocation will certainly cause fragmentation and make swap access slower.


Reply to this email directly or view it on GitHub
#342 (comment).

@fejiso
Copy link

fejiso commented Dec 9, 2015

After looking all around the web this is the only issue that looks remotely similar to what I experienced. In my case overcommitting VMs did it, when memory pressure from VMs started to compete with ARC got the load skyrocketed and eventually the system went unresponsive. ARC downsized as memory pressure increased, but that turned into slower disk access, that kicked the load further up. Very little, if any, swap space from the zvol was used.

I've tried pretty much every setting I could find: from running memory hogs to trigger swapping, limiting RES size of apps through cgroups, disabling transparent (and not) huge tables...

The only thing that fixed the situation was to have a separate swap partition. Zram devices also were used as expected.

All the sysctl vm values were default except for vm.swappiness set to 100 and inotify limit to 1000000.

The affected machine is now running BTRFS. As I couldn't easily shrink the mirrored zpool to accommodate a swap partition, I chose to switch as I might need to change the array geometry relatively often. I can, though, try to reproduce the issue using the partitions I have now as swap space.

@CMCDragonkai
Copy link

I just stress tested it as well with 32GB zvol swap, 32 GB ram and 32GB
tmpfs with 330 MB zram. Tried to put a 33GB zeroed file into tmpfs and
everything locked up. Saw my harddrive led blink for a few times and then
nothing. Had to power down using SysRq.
On 10/12/2015 1:58 AM, "Fernando Jiménez" notifications@github.com wrote:

After looking all around the web this is the only issue that looks
remotely similar to what I experienced. In my case overcommitting VMs did
it, when memory pressure from VMs started to compete with ARC got the load
skyrocketed and eventually the system went unresponsive. ARC downsized as
memory pressure increased, but that turned into slower disk access, that
kicked the load further up. Very little, if any, swap space from the zvol
was used.

I've tried pretty much every setting I could find: from running memory
hogs to trigger swapping, limiting RES size of apps through cgroups,
disabling transparent (and not) huge tables...

The only thing that fixed the situation was to have a separate swap
partition. Zram devices also were used as expected.

All the sysctl vm values were default except for vm.swappiness set to 100
and inotify limit to 1000000.

The affected machine is now running BTRFS. As I couldn't easily shrink the
mirrored zpool to accommodate a swap partition, I chose to switch as I
might need to change the array geometry relatively often. I can, though,
try to reproduce the issue using the partitions I have now as swap space.


Reply to this email directly or view it on GitHub
#342 (comment).

gmelikov pushed a commit to gmelikov/zfs that referenced this issue May 8, 2017
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Closes openzfs#342
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/8063
OpenZFS-commit: openzfs/openzfs@01acb46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Component: ZVOL ZFS Volumes Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants