-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs hangs: "zvol blocked for more than 120 secs". #21
Comments
Thanks for the bug report. The first process is in the process of memory reclaim which should be OK but it is hung. The ZVOL does use twice as much memory as it should at the moment because there ends up being a copy in the arc as well as the Liunx buffer cache. With the stock distro kernels there's no easy way to disable the Linux buffer cache. Anyway, you can disable the arc cache which helps. Can you try this and let me know. zfs set primarycache=metadata |
Hi Brian, I am sorry, I can't. Jun 10 09:45:23 localhost kernel: Error: Unknown symbol get_zone_counts |
I would suggest running 'make distclean' in both the spl and zfs trees, re-running configure and building both projects and see if you get the same result. |
No, it's the same: make distclean && sh autogen.sh && ./configure && make && sudo make check ... make[1]: Entering directory `/home/seriv/work/seriv/spl/scripts' ./check.sh Loading ../module/spl/spl.ko insmod: error inserting '../module/spl/spl.ko': -1 Bad address check.sh: Failed to load ../module/spl/spl.ko make[1]: *** [check] Error 1 make[1]: Leaving directory `/home/seriv/work/seriv/spl/scripts' make: *** [check-recursive] Error 1 $ sudo tail -2 /var/log/messages Jun 10 17:13:49 gauntlet kernel: Error: Unknown symbol get_zone_counts Jun 10 17:13:49 gauntlet kernel: SPL: Failed to Load Solaris Porting Layer v0.4.9, rc = -14 |
Hi Brian, I've did it, - in FC12, with zfs set primarycache=metadata on the same hardware and with imported zpool, the same tasks are running fine. Thank you for advise. |
Seriv, thanks for the update and sticking with it. This is something I certainly hope to resolve for the next tag. |
Hi Brian, was it a mistake in config/spl-build.m4 of spl, or I accidentally got it compiled just for one particular kernel version: see http://github.com/seriv/spl/commit/3841d8eb6dd1863c2f7bbf51c945b7b06d26fe2e ? |
I believe the problem here is that one (or more) of the ZFS threads is getting deadlocked. This is likely happening because one of the ZFS threads which does I/O is trying to allocate memory, but since we're having memory pressure, it causes dirty data to be synced to the ZVol. However, this dirty data can't be synced because it needs the original thread to proceed with doing the I/O, but it can't proceed because it needs the data to be synced. I have observed a very similar issue while testing Lustre on top of the DMU. The proper solution in our case was to change this line in spl/include/sys/kmem.h: #define KM_SLEEP GFP_KERNEL Into this: #define KM_SLEEP GFP_NOFS This causes the ZFS threads not to call filesystem code, which in the Lustre case was causing a deadlock. However, given that I don't see filesystem code in the stack trace, I think for the ZVol case to work correctly, the proper solution may be even more extreme and the line above may have to be changed into this: #define KM_SLEEP GFP_NOIO I think that will fix these hangs. |
A colleague just pointed out to me that shrink_page_list() only calls pageout() if this condition is true:
Which means that it should be OK to use GFP_NOFS unless the ZVol is a swap device. However, you shouldn't even be using ZVols as swap devices anyway, at least not without a proper implementation, because the ZFS threads are very allocation-intensive, which doesn't play well when you're swapping out. I think even Solaris doesn't rely on the ZFS threads for swapping out to a ZVol (which is probably why it requires compression and checksumming to be disabled, if I'm not mistaken). |
Thanks wizeman, I think your right that would nicely explain the problem. That's a pretty a heavy handed fix though, it would be ideal if we could just identify the offending allocations and add a KM_NOFS and use it as appropriate. It looks like the offending allocation was the kmem_cache_alloc() in dbuf_create() this time but I'm sure there quite a few other places where the deadlock is possible. |
The problem is that almost any allocation might trigger the deadlock. For example, any allocation in the ZIO path (including the compression and checksum phases, etc) have to be GFP_NOFS because otherwise the ZIOs will deadlock waiting for data to be synced (which need the ZIOs to complete). Similarly, any allocation during any phase of the txg train (in the open txg, quiescing txg or syncing txg) cannot block waiting for new data to sync, because the data that needs to sync might not fit in the currently-open txg, but at the same time you may not be able to open a new txg because you'd need the txg train to advance, which cannot happen because you're blocked trying to allocate memory. This one alone probably rules out more than 90% of the code, and 99% of the allocations.. :-) It includes practicall all allocations in the DMU code, dbufs, nvpairs, dsl code, spa, metaslabs, dnodes, spacemaps, ZAPs, raidz, zil, zvol, objset, etc.. all of that is needed to proceed with syncing a txg. Also, probably all the code in the ARC, including the thread that reclaims memory from the ARC, has to use GFP_NOFS, otherwise the code could block while holding locks in the ARC, which could prevent IOs to complete. TBH I have trouble thinking of any allocations which should not use GFP_NOFS.. even the ZPL code will probably need it in most cases, if blocking on the allocation could prevent other data to be synced... |
Yes, I agree but I feel compelled to point our we'll be basically removing almost the entire synchronous memory reclaim path for ZFS. The only way we'll end up performing reclaim with be through the async reclaim threads, or by another non-ZFS related process in the kernel. This consequence of this will likely be that memory allocations will take longer when the system is low on memory since our allocation may need block until the async thread performs the reclaim. Anyway, this is still much much much better than deadlocking so let's give it try and see what actual consequences there are. |
OK, wizeman's proposed fix has been committed to the SPL as behlendorf/spl@82b8c8f. Hopefully, that will resolve this issue. I'm closing the bug until we see it again. |
I am getting this error reproducibly when sending many files by rsync over ssh to a xen virtual machine with a large sparse ZVOL. Debian Wheezy/unstable amd64, zfsonlinux 2~wheezy |
Flag openzfs#20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING, the DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to openzfs#21 and reserved in the OpenZFS implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
# This is the 1st commit message: Merge branch 'master' of https://github.com/zfsonlinux/zfs * 'master' of https://github.com/zfsonlinux/zfs: Enable QAT support in zfs-dkms RPM # This is the commit message openzfs#2: Import 0.6.5.7-0ubuntu3 # This is the commit message openzfs#3: gbp changes # This is the commit message openzfs#4: Bump ver # This is the commit message openzfs#5: -j9 baby # This is the commit message openzfs#6: Up # This is the commit message openzfs#7: Yup # This is the commit message openzfs#8: Add new module # This is the commit message openzfs#9: Up # This is the commit message openzfs#10: Up # This is the commit message openzfs#11: Bump # This is the commit message openzfs#12: Grr # This is the commit message openzfs#13: Yay # This is the commit message openzfs#14: Yay # This is the commit message openzfs#15: Yay # This is the commit message openzfs#16: Yay # This is the commit message openzfs#17: Yay # This is the commit message openzfs#18: Yay # This is the commit message openzfs#19: yay # This is the commit message openzfs#20: yay # This is the commit message openzfs#21: yay # This is the commit message openzfs#22: Update ppa script # This is the commit message openzfs#23: Update gbp conf with br changes # This is the commit message openzfs#24: Update gbp conf with br changes # This is the commit message openzfs#25: Bump # This is the commit message openzfs#26: No pristine # This is the commit message openzfs#27: Bump # This is the commit message openzfs#28: Lol whoops # This is the commit message openzfs#29: Fix name # This is the commit message openzfs#30: Fix name # This is the commit message openzfs#31: rebase # This is the commit message openzfs#32: Bump # This is the commit message openzfs#33: Bump # This is the commit message openzfs#34: Bump # This is the commit message openzfs#35: Bump # This is the commit message openzfs#36: ntrim # This is the commit message openzfs#37: Bump # This is the commit message openzfs#38: 9 # This is the commit message openzfs#39: Bump # This is the commit message openzfs#40: Bump # This is the commit message openzfs#41: Bump # This is the commit message openzfs#42: Revert "9" This reverts commit de488f1. # This is the commit message openzfs#43: Bump # This is the commit message openzfs#44: Account for zconfig.sh being removed # This is the commit message openzfs#45: Bump # This is the commit message openzfs#46: Add artful # This is the commit message openzfs#47: Add in zed.d and zpool.d scripts # This is the commit message openzfs#48: Bump # This is the commit message openzfs#49: Bump # This is the commit message openzfs#50: Bump # This is the commit message openzfs#51: Bump # This is the commit message openzfs#52: ugh # This is the commit message openzfs#53: fix zed upgrade # This is the commit message openzfs#54: Bump # This is the commit message openzfs#55: conf file zed.d # This is the commit message #56: Bump
Using zfs with Lustre, an arc_read can trigger kernel memory allocation that in turn leads to a memory reclaim callback and a deadlock within a single zfs process. This change uses spl_fstrans_mark and spl_trans_unmark to prevent the reclaim attempt and the deadlock (https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba). The stack trace observed is: #0 [ffffc9002b98adc8] __schedule at ffffffff81610f2e openzfs#1 [ffffc9002b98ae68] schedule at ffffffff81611558 openzfs#2 [ffffc9002b98ae70] schedule_preempt_disabled at ffffffff8161184a openzfs#3 [ffffc9002b98ae78] __mutex_lock at ffffffff816131e8 openzfs#4 [ffffc9002b98af18] arc_buf_destroy at ffffffffa0bf37d7 [zfs] openzfs#5 [ffffc9002b98af48] dbuf_destroy at ffffffffa0bfa6fe [zfs] openzfs#6 [ffffc9002b98af88] dbuf_evict_one at ffffffffa0bfaa96 [zfs] openzfs#7 [ffffc9002b98afa0] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs] openzfs#8 [ffffc9002b98b050] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs] openzfs#9 [ffffc9002b98b100] osd_object_delete at ffffffffa0b64ecc [osd_zfs] openzfs#10 [ffffc9002b98b118] lu_object_free at ffffffffa06d6a74 [obdclass] openzfs#11 [ffffc9002b98b178] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass] openzfs#12 [ffffc9002b98b220] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass] openzfs#13 [ffffc9002b98b278] shrink_slab at ffffffff811ca9d8 openzfs#14 [ffffc9002b98b338] shrink_node at ffffffff811cfd94 openzfs#15 [ffffc9002b98b3b8] do_try_to_free_pages at ffffffff811cfe63 openzfs#16 [ffffc9002b98b408] try_to_free_pages at ffffffff811d01c4 openzfs#17 [ffffc9002b98b488] __alloc_pages_slowpath at ffffffff811be7f2 openzfs#18 [ffffc9002b98b580] __alloc_pages_nodemask at ffffffff811bf3ed openzfs#19 [ffffc9002b98b5e0] new_slab at ffffffff81226304 openzfs#20 [ffffc9002b98b638] ___slab_alloc at ffffffff812272ab openzfs#21 [ffffc9002b98b6f8] __slab_alloc at ffffffff8122740c openzfs#22 [ffffc9002b98b708] kmem_cache_alloc at ffffffff81227578 openzfs#23 [ffffc9002b98b740] spl_kmem_cache_alloc at ffffffffa048a1fd [spl] openzfs#24 [ffffc9002b98b780] arc_buf_alloc_impl at ffffffffa0befba2 [zfs] openzfs#25 [ffffc9002b98b7b0] arc_read at ffffffffa0bf0924 [zfs] openzfs#26 [ffffc9002b98b858] dbuf_read at ffffffffa0bf9083 [zfs] openzfs#27 [ffffc9002b98b900] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs] Signed-off-by: Mark Roper <markroper@gmail.com>
Make 2 changes that are not strictly required but make sense given the changes in the 2021 edition: 1. TryFrom/TryInto no longer need to be imported, as they are part of the standard prelude. Remove the redundant `use` statements. 2. When possible, closures capture members of local variables, rather than the entire variable. This can cause the non-captured members to be dropped later than in the 2018 edition. In our case (in ZettaCache::open()), this doesn't really matter, but it brought to my attention that we're cloning and then discarding a bunch of the members of ZettaCache (without using them). So I changed this to just clone the few members that are actually needed by the closures.
I have compiled and installed spl and zfs rpms for standard FC13 kernel:
2.6.33.5-112.fc13.x86_64 (mockbuild@x86-09.phx2.fedoraproject.org) (gcc version 4.4.4 20100503...
I desided to test how it can handle multiple small files with random reads and writes. Formated as ext2, mounted and used for zimbra mail storage. After about an hour, when zimbra desktop pulled about 1 gig mail messages from my couple accounts, I tried some zimbra searches and reads of this messages and all zpool i/o hanged.
In /var/log/messages there were
The text was updated successfully, but these errors were encountered: