Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocks in DMU #726

Closed
wants to merge 4 commits into from
Closed

Fix deadlocks in DMU #726

wants to merge 4 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented May 7, 2012

DMU operations perform allocations using KM_SLEEP, which is mapped to GFP_KERNEL in the SPL. GFP_KERNEL permits paging to occur in low memory situations. When swap operations are done in zvols, memory allocations in the DMU will often trigger additional swap operations, which deadlocks.

I observed a nonsensical behavior where swap worked with the loop device, but not zvols. Commit cfc9a5c appears to have disabled the direct reclaim path on the filesystem operations that the loop device exercises, which explains the cause of that behavior. This lead me to replace all instances of KM_SLEEP with KM_PUSHPAGE in files containing functions listed in the backtrace in the commit message. That enabled swap on zvols to work. I tested swap on a zvol with KM_SLEEP redefined to be GFP_NOIO in the SPL code, but swap operations failed to function. As such, it seems that the use of KM_PUSHPAGE is necessary to address this issue.

In theory, this change should also prevent the original deadlock, but I have not yet tested it to verify that this prevents it. I have also not verified that all allocations that swap can trigger have been converted to KM_PUSHPAGE.

I am opening this pull request so that others can provide their thoughts while I continue to work on this. This will close #342 when it is merged.

@behlendorf
Copy link
Contributor

My initial testing of these changes look promising. I pulled the change in to a testing branch and ran it through the test suite for 12 different Linux distributions. The test suite includes xfstests and I was unable to hit the original issue on any of those 12 runs. I'll run it through a few more times for better coverage since I remember this issue not always reproducing. Also IIRC rtorrent which does a lot of mmap'ed I/O was particularly good to causing this deadlock.

@ryao
Copy link
Contributor Author

ryao commented May 14, 2012

I have pushed an additional patch that resolves a few theroetical deadlocks that were uncaught by my previous patch set, including one that could trigger the deadlock commit cfc9a5c was meant to address.

There is still more code that I need to review, but I am making steady progress. I hope to have another update in a few days.

@behlendorf
Copy link
Contributor

Great. Yes, the trick as I'm sure your aware here will be find all the potential call point where KM_PUSHPAGE needs to be used.

@ryao
Copy link
Contributor Author

ryao commented May 17, 2012

@behlendorf The last patch I pushed addressed all of the page allocation failures that were being output to dmesg during my stress tests as well as the consumption of pages from ZONE_DMA. The only known issue at this point are the following messages in dmesg, which are fairly harmless:

[ 239.786572] hrtimer: interrupt took 11339 ns
[ 1027.551953] CE: hpet increased min_delta_ns to 20113 nsec
[ 8825.579004] CE: hpet increased min_delta_ns to 30169 nsec

I have also finished reviewing code. It should be safe to merge these. They should close issue #342, issue #540, issue #746 and possibly a few that I missed.

@behlendorf
Copy link
Contributor

Excellent, I'll review the patch today and run it through my test suites. If everything looks good I'll get it merged. Thanks for working on this!

behlendorf pushed a commit that referenced this pull request May 21, 2012
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The z_wr_* threads
  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
behlendorf pushed a commit that referenced this pull request May 21, 2012
Commit eec8164 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
behlendorf pushed a commit that referenced this pull request May 21, 2012
The commit, cfc9a5c, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
behlendorf pushed a commit that referenced this pull request May 21, 2012
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0cc eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
@behlendorf
Copy link
Contributor

@ryao I've taken your work and massaged it a little bit on the following branch. Basically just squashing a few of the patches together and reworking the commit messages. So far it's been testing well for me in the VM and if it holds up for another day or so on some real test systems I'll merge it in to master. I just wanted to make sure your OK with having this branch merged as is:

https://github.com/zfsonlinux/zfs/commits/swap

@ryao
Copy link
Contributor Author

ryao commented May 21, 2012

@behlendorf I am fine with that.

@pyavdr
Copy link
Contributor

pyavdr commented May 22, 2012

I tried to use swap on zfs with the latest commits/swap branch. Stressing suse 12.1 kernel 3.1.9 (15GB RAM, 50 GB SWAP on zvol) with pythons gives a deadlock as soon as the system starts to use the zvol/swap.

@behlendorf
Copy link
Contributor

Can you provide a stack?

@ryao
Copy link
Contributor Author

ryao commented May 22, 2012

@pyavdr Did you use primarycache=all on your zvol? I use primarycache=metadata secondarycache=none on my system's swap zvol.

@ryao
Copy link
Contributor Author

ryao commented May 22, 2012

@behlendorf I had a deadlock last night. A 5GB memory leak occurred in knotify4 while I was recompiling my desktop. My desktop has 8GB of RAM with 8GB of swap. Nearly all RAM was in use and 5 to 6GB of swap were in use. I observed periods where X would freeze, the system would appear to do nothing and then disk activity would occur, restoring X. These periods increased in length as the system became more overloaded, reaching approximately 15 seconds, at which point, the system appeared to deadlock.

When I was reviewing code, I spotted the potential for direct reclaim to cause deadlocks in the tx_commit threads. At the time, I was unable to prove that deadlocks were either possible or impossible. Since swap did not appear to deadlock under my load tests, I left it alone.

Having thought about it in the context of last night's deadlock, I believe that the potential for deadlocks that I spotted in the tx_commit threads is the cause. Direct reclaim in a tx_commit thread will cause swap operations, which will be dispatched to other tx_commit threads. In addition, direct reclaim should not permit the original tx_commit thread to continue until the other threads have finished. Those threads can also enter direct reclaim and the tx_commit threads form a dependency tree. During that time, Linux will try to free memory and if enough is purged, then subsequent transaction group commits by other tx_commit threads will suceed, pruning the tree. If we should be so unlucky that all threads enter direct reclaim, we deadlock. The ability for us to form a dependency tree between the tx_commit threads explains why the system would hang and then suddenly become responsive upon IO activity. The ability for increasingly elaborate dependency trees could be formed also explains why the duration of the hangs become increasingly worse as my system's swap approached capacity.

I will push a patch for this issue either today or tomorrow.

@pyavdr
Copy link
Contributor

pyavdr commented May 23, 2012

@behlendorf
3 weeks ago i tried to catch a stack trace, as ryao provided the first patches for zfs on zvol, but never succeeded, even compiling ZFS source with DEBUG. If there is an easy "how to" to catch stacktraces when a system deadlocks and we need to hard reset, where nothing is written into the system log files, i will follow that.

@pyavdr
Copy link
Contributor

pyavdr commented May 23, 2012

@ryao
First time i used primarycache=all, now i set primarycache=metadata and secondarycache=none, but there is no change, fast deadlock when running 10 times python -c "print 2**10**10" , needed a hard reset.

@behlendorf
Copy link
Contributor

@ryao I think your observations are correct, and they pretty well sum up my major fear about these patches. While I agree that we need to stop using PF_MEMALLOC and shift to KM_PUSHPAGE we absolutely need to make sure we get all instances of memory allocations under these four critical contexts before these changes are merged.

  • The z_wr_* threads
  • The txg_sync thread
  • The zvol write/discard threads
  • The zpl_putpage() VFS callback

I was thinking about how to ensure we get all the allocations in these contexts, and how to make sure we don't accidentally introduce new ones

The best way I can think of is to take the current places in the code where PF_MEMALLOC is set/cleared, and change it to a harmless no-op PF_* flag. Then we can add a patch to the spl kmem/vmem_alloc()'s to check for the flags and issue a warning to the console when the flag is set and KM_PUSHPAGE is not used. For debug builds we could make this fatal to ensure we halt the thread and can get a back trace. We'd of course need to be very careful about which PF_ bit was selected, there are some unused areas but we really can't safely count on that going forward without an autoconf check.

This should allow us to pretty easily catch all the allocations. We would no longer need to trigger the unlikely deadlock, or find it through code inspection, we can just run the code and see what gets flags. This should certainly find all the common places in the major code paths. We just wouldn't get coverage for the more unlikely error paths.

@ryao
Copy link
Contributor Author

ryao commented May 30, 2012

@behlendorf I like your idea to use a no-op PF_* flag so that we can track these issues. My time to work on ZFS things is currently very thin, but I will try to do this when I get a chance.

With that said, I have an additional patch that will switch to KM_PUSHPAGE in some low level areas that I had previously overlooked because I could not find clear evidence that KM_SLEEP would cause problems. I will push it to this pull request soon.

@ryao
Copy link
Contributor Author

ryao commented Aug 24, 2012

@behlendorf has made dramatic improvements upon my previous work in pull request #883. This pull request is now obsolete.

@ryao ryao closed this Aug 24, 2012
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 27, 2012
Commit eec8164 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 27, 2012
The commit, cfc9a5c, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 27, 2012
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0cc eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Aug 27, 2012
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
chrisrd added a commit to chrisrd/zfs that referenced this pull request Oct 4, 2012
Commit eec8164 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC.   Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
chrisrd added a commit to chrisrd/zfs that referenced this pull request Oct 4, 2012
The commit, cfc9a5c, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC.   That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks.  This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.

The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it.  This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.

As such, we revert this patch in favor of a proper fix for both issues.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
chrisrd added a commit to chrisrd/zfs that referenced this pull request Oct 4, 2012
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0cc eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated.  The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#726
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support swap on zvol
3 participants