ZVOL Performance #9

behlendorf · 2010-05-14T23:14:07Z

While there is a full native ZVOL implementation in this code base performance likely is not what it could be. Mainly this is both the Linux VFS and the ZFS ARC both want to fully manage the cache so we unfortunately end up with two caches. This means our memory foot print is larger than it needs to be, and it means we have an extra copy between the caches, but it does not impact correctness. All syncs are barrier requests I believe are handled correctly. Longer term there is lots of room for improvement here but it will require fairly extensive changes to either the Linux VFS and VM layer, or additional DMU interfaces to handle managing buffer not directly allocated by the ARC.

One tuning which can be made in the short term here is to configure the ARC to only cache meta data. This reduces the memory footprint substantially but it does not eliminate that extra copy unfortunately.

sehe · 2010-07-06T10:19:03Z

Hi,

kernel 2.6.35-rc3 came with a new splice support (used in the fuse module to mitigate exactly this redundant copying). Miklos Szeredi specificly refers to it as the zero-copy path.
You can have a look at merge 003386fff3e02e51 in Linus's tree http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=003386fff3e02e51cea882e60f7d28290113964c

I've no conclusive benchmark on how this pans out for zfs-fuse, but here are the figures till now: http://downloads.sehe.nl/zfs-fuse/splice-support/splice_benchmarks.html

behlendorf · 2010-07-06T17:19:36Z

Thanks for the tip, I'll certainly check it out!

behlendorf · 2010-08-10T05:40:54Z

It looks like we can eliminate the extra copy by replacing the .write_begin and .write_end handlers in the address_space_operations structure with our own. These hooks are responsible for finding or allocating pages from the page cache based on a given index in to the address space. These pages are then used as buffer heads and sent down the stack.

Our custom versions would need to lookup a page in the ARC cache by index, lock it, and return it for use. Or in the case of a write allocate a new page directly from the ARC using the zero copy API. The Linux VFS would then be able to write directly in to this page bypassing the normal Linux page cache.

The devils going to be in the details here though and I suspect there will be a fair bit of work to get right. It may end up being rather kernel specific as well which would be a bad thing, but I'm not sure it can be avoid. We can always fallback to the current implementation if need be.

We will also likely want to start using blk_queue_make_request at some point to avoid the full elevator. If we can tie directly in to the ARC cache this will just add overhead we don't need. We should simply need to keep the ZIL up to date and trigger zil_commit()'s at the right time to keep everything correct.

sehe · 2010-08-10T07:19:52Z

Two thoughts:
(a) would this (bypassing kernel caching in favour of ARC-based cache) still allow memory-mapped files to 'just work'? In my understanding memory mapped files might be closely tied to kernel buffering/caching (not sure)? I'm also not sure whether this is even applicable for ZVOLs

(b) it would be best to verify whole-disk ZFS usage before avoiding the elevator. This could in the end of course be a mere recommendation to endusers

behlendorf · 2010-08-10T16:23:40Z

(a) would this (bypassing kernel caching in favour of ARC-based cache) still allow
memory-mapped files to 'just work'?

Good question. This is going to need to be one of those details that gets worked out. I'm not sure if it will just work by it may as long as we're just using out own pages from the ARC to back the original mapping. If not, I'm fairly sure custom hooks could be added which could do the job.

(b) it would be best to verify whole-disk ZFS usage before avoiding the elevator.

Bypassing the elevator isn't all that unorthodox and I actually almost did it in the original implementation. The trouble was getting the barrier support implemented correctly, but I think I see how to do that now. Currently the existing Linux md and loop devices bypass the elevator as well and register their own custom make_request function.

All that said, I doubt I'll get to all of this any time soon.

mauricev · 2010-09-27T04:50:49Z

I am testing this using a pool on VMWare VM. It's a RAIDZ2 with 4 20GB virtual drives. It was crazy slow, so I attached a ramdisk, /dev/ram0, which is 128 MB. I set this zvol up via the iSCSI so that it initiates on the same host. The performance I get is better, but still horrendous, about 512 K per second. Is this all because of the double caching problem described here or perhaps related to something in the Linux iSCSI stack?

behlendorf · 2010-09-27T17:18:55Z

There certainly are performance issues with the current zvol implementation. One such issue is the double caching observed above. The performance numbers delivered to zpios or lustre, both of which bypass the Linux caching, are very good. So that may be the big issue here, although personally I suspect that there is something in addition going on with the zvol to hurt performance. I'm sure it's all fixable it just isn't a high priority yet with other big features such as the ZPL missing. For the moment I'd hesitate to blame the ISCSI stack.

mauricev · 2010-09-27T17:45:35Z

I see similar numbers with a plain zvol. I am going to continue testing to see whether it's my configuration.

behlendorf · 2010-09-27T18:02:55Z

I doubt it's your configuration since I've seen similar issues myself, but still it's good to double check. Could you try disabling the l2arc cache for everything but metadata and see if that improves things.

zfs set primarycache=metadata pool/zvol

mauricev · 2010-09-27T23:36:43Z

I have real test results.
Initially, retesting with 5G RAM, via iSCSI, I get about 2.28 MB per second. As a plain block device, I get about 2.59 MB per second, so iSCSI is not slowing it down at all.

I retested again with 6GB RAM (since it looked like it had been running out with 5) and disabling the l2arc cache. Here I get 5.38 MB per second, so clearly your explanation and remedy do make a difference.

Then I built separately four disks with md, raid6 and formatted with ext3 (the disks above were also formatted ext3 at the top-level). Here I average 8.46 MB/second, which is clearly faster than a ZVOL even with its arc partially turned off. This further suggests, I think, that if the double caching were completely eliminated, it would be on par with regular Linux.

However, it's not clear to me how you use ZFS if you are not using ZVOLs.

Just for fun, I tried with yet another set of disks, but this time I used zfs-fuse, but no zil disk. This averaged 8.23 MB/second, which is virtually identical to the md result. So the apparent widespread belief that a zfs based on fuse is way slower is overblown.

However, as I mentioned, I now believe my configuration has to be killing these performance numbers across the board. I'm running all these tests in VMWare Fusion on my Mac, which itself has a hardware RAID5 and so all the Linux disks including the source disk (where I copied the lib64 directory using to do test cp timings) are virtual disks on this single underlying RAID5. I'm going to see if I can do some testing later this week giving ZFS access to real real hard drives.

Rudd-O · 2011-03-21T18:54:45Z

Will the suggestion of primarycache=metadata increase performance in the case of ZFS filesystems too?

behlendorf · 2011-03-21T19:11:09Z

No, ZFS filesystems already already entirely bypass the page cache (except for mmap) so this shouldn't help. The work going on in the GFP_NOFS branch is targeted toward getting a handle on the memory usage and performance issues.

Rudd-O · 2011-03-22T09:52:44Z

But binaries and libraries are mmap'ed, so if you run /usr/lib or /lib on ZFS, it would still be duplicate, right?

With that in mind, in that case, I repeat the question: will the suggestin of primarycache=metadata increase performance (by saving memory) in the case of ZFS filesystems too?

Another question: why can't we just rely on pagecache instead of ARC on linux?

behlendorf · 2011-03-22T17:22:41Z

Right, if /usr/lib of /lib or anything like that is on ZFS they will be duplicated in the page cache and the ARC cache. This is actually true for OpenSolaris systems too.

Setting primarycache=metadata will certainly save memory... what that exactly does to performance is a harder question to answer. For certain workloads it will probably help and for others it will certainly hurt. There really is no one right answer.

As for using the page cache instead of the ARC on Linux we have a couple options that need to be explored. Unfortunately, all are considerable amount of work and require a pretty detailed understanding on ZFS internals and Linux kernel memory management. As I see it there are really three options.

Use the ARC cache, avoid the page cache. This is what was done on OpenSolaris and required the minimum number of code changes to get more or less working in the Linux port. This is what works today although I admit significant tuning to get it well behaved needs to happen.
Never use the ARC cache, just use the Linux page cache. This should be entirely possible but it will be considerable work and require extensive/complicated/dangerous changes to the ZFS internals. It's not clear it's 100% desirable either because in many ways the ARC cache is much smarter than the Linux page cache.
Use the ARC cache and the Linux page cache. The ARC cache could be backed by Linux page cache pages instead of vmalloc()'ed buffers from the slab. This should give us the best of both worlds but it will require ZFS changes and probably linux kernel patches as well. This is probably the right long term solution, with fallback mode for unpatched kernels.

byteharmony · 2012-12-08T07:55:42Z

I see this isn't scheduled to be addressed for some time, I was wondering if it is where it was 2 years ago? I see it was a hot topic and one that is important to me at this time. I found this when it hit me tonight that I may be double caching between Kernel and ARC. Google pulled it up.

Metadata sounds like a good solution, it is inherited so shouldn't be screwed up much. Another idea I had in case someone did screw up was to reduce the zfs_arc_max in zfs.conf. I went from 4G down to 1G on a test system with no noticeable speed decrease on the guest running on the ZVOL. This should at least make any duplication of service reduced on overhead and arc would have less to look through.

I think i already did some testing with reasonably recent code and for meta vs all and got similar results. I can't remember for sure, but if that's the case at least for those of us who are running VMs in that RAM, the more available the BETTER!

BK

behlendorf · 2012-12-10T17:43:44Z

@byteharmony There's hope that this will be addressed for 0.7.0. This is one of the fundamental issues we really do need to fix sooner rather than latter, however it will be destabilizing in the short term.

byteharmony · 2012-12-10T17:46:41Z

OK, good to know. Almost everything we do is on ZVOL and since RC8 we've seen GREAT speed improvements. So while you're not activily on this yet, you've made progress ;).

BK

ryao · 2013-02-25T05:33:27Z

Why was this closed?

behlendorf · 2013-02-25T20:20:17Z

@ryao It was closed because it was overly broad and not particularly useful. I'm all for improving ZVOL performance but let's open new issues for specific problems.

yuriccp · 2014-03-28T14:04:42Z

Sorry to ress this topic. But I have some questions.

How the Linux and Arc cache are handling the metadata?
There's no dentries and inode repeated in the both of them?
What's the performance impact on it?
And how the linux are handling these caches?
It's recognize these caches as cache memory?
It's cleaning the arc caches in low memory cases?

# This is the 1st commit message: Merge branch 'master' of https://github.com/zfsonlinux/zfs * 'master' of https://github.com/zfsonlinux/zfs: Enable QAT support in zfs-dkms RPM # This is the commit message openzfs#2: Import 0.6.5.7-0ubuntu3 # This is the commit message openzfs#3: gbp changes # This is the commit message openzfs#4: Bump ver # This is the commit message openzfs#5: -j9 baby # This is the commit message openzfs#6: Up # This is the commit message openzfs#7: Yup # This is the commit message openzfs#8: Add new module # This is the commit message openzfs#9: Up # This is the commit message openzfs#10: Up # This is the commit message openzfs#11: Bump # This is the commit message openzfs#12: Grr # This is the commit message openzfs#13: Yay # This is the commit message openzfs#14: Yay # This is the commit message openzfs#15: Yay # This is the commit message openzfs#16: Yay # This is the commit message openzfs#17: Yay # This is the commit message openzfs#18: Yay # This is the commit message openzfs#19: yay # This is the commit message openzfs#20: yay # This is the commit message openzfs#21: yay # This is the commit message openzfs#22: Update ppa script # This is the commit message openzfs#23: Update gbp conf with br changes # This is the commit message openzfs#24: Update gbp conf with br changes # This is the commit message openzfs#25: Bump # This is the commit message openzfs#26: No pristine # This is the commit message openzfs#27: Bump # This is the commit message openzfs#28: Lol whoops # This is the commit message openzfs#29: Fix name # This is the commit message openzfs#30: Fix name # This is the commit message openzfs#31: rebase # This is the commit message openzfs#32: Bump # This is the commit message openzfs#33: Bump # This is the commit message openzfs#34: Bump # This is the commit message openzfs#35: Bump # This is the commit message openzfs#36: ntrim # This is the commit message openzfs#37: Bump # This is the commit message openzfs#38: 9 # This is the commit message openzfs#39: Bump # This is the commit message openzfs#40: Bump # This is the commit message openzfs#41: Bump # This is the commit message openzfs#42: Revert "9" This reverts commit de488f1. # This is the commit message openzfs#43: Bump # This is the commit message openzfs#44: Account for zconfig.sh being removed # This is the commit message openzfs#45: Bump # This is the commit message openzfs#46: Add artful # This is the commit message openzfs#47: Add in zed.d and zpool.d scripts # This is the commit message openzfs#48: Bump # This is the commit message openzfs#49: Bump # This is the commit message openzfs#50: Bump # This is the commit message openzfs#51: Bump # This is the commit message openzfs#52: ugh # This is the commit message openzfs#53: fix zed upgrade # This is the commit message openzfs#54: Bump # This is the commit message openzfs#55: conf file zed.d # This is the commit message #56: Bump

merge ZoL

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address 0x608000a1fcd0 at pc 0x7fe88b0c166e bp 0x7fe878414ad0 sp 0x7fe878414278 READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup ../../module/zfs/spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove ../../module/zfs/vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove /export/home/delphix/zfs/cmd/ztest/ztest.c:3229 #4 0x55ffbc769fba in ztest_execute /export/home/delphix/zfs/cmd/ztest/ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread /export/home/delphix/zfs/cmd/ztest/ztest.c:6761 #6 0x7fe889cbc6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da) #7 0x7fe8899e588e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12188e) 0x608000a1fcd0 is located 48 bytes inside of 88-byte region [0x608000a1fca0,0x608000a1fcf8) freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xde7b8) #1 0x7fe88ae541c5 in nvlist_free ../../module/nvpair/nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free ../../module/nvpair/nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair ../../module/nvpair/nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux ../../module/zfs/vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove ../../module/zfs/vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove /export/home/delphix/zfs/cmd/ztest/ztest.c:3229 #7 0x55ffbc769fba in ztest_execute /export/home/delphix/zfs/cmd/ztest/ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread /export/home/delphix/zfs/cmd/ztest/ztest.c:6761 #9 0x7fe889cbc6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address 0x608000a1fcd0 at pc 0x7fe88b0c166e bp 0x7fe878414ad0 sp 0x7fe878414278 READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup ../../module/zfs/spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove ../../module/zfs/vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove /export/home/delphix/zfs/cmd/ztest/ztest.c:3229 #4 0x55ffbc769fba in ztest_execute /export/home/delphix/zfs/cmd/ztest/ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread /export/home/delphix/zfs/cmd/ztest/ztest.c:6761 #6 0x7fe889cbc6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da) #7 0x7fe8899e588e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12188e) 0x608000a1fcd0 is located 48 bytes inside of 88-byte region [0x608000a1fca0,0x608000a1fcf8) freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xde7b8) #1 0x7fe88ae541c5 in nvlist_free ../../module/nvpair/nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free ../../module/nvpair/nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair ../../module/nvpair/nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux ../../module/zfs/vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove ../../module/zfs/vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove /export/home/delphix/zfs/cmd/ztest/ztest.c:3229 #7 0x55ffbc769fba in ztest_execute /export/home/delphix/zfs/cmd/ztest/ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread /export/home/delphix/zfs/cmd/ztest/ztest.c:6761 #9 0x7fe889cbc6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da) Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread #7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #7 0x55ffbc769fba in ztest_execute ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread ztest.c:6761 #9 0x7fe889cbc6da in start_thread Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread #7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #7 0x55ffbc769fba in ztest_execute ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread ztest.c:6761 #9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9706

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 openzfs#4 0x55ffbc769fba in ztest_execute ztest.c:6714 openzfs#5 0x55ffbc779a90 in ztest_thread ztest.c:6761 openzfs#6 0x7fe889cbc6da in start_thread openzfs#7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 openzfs#4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 openzfs#5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 openzfs#6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714 openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761 openzfs#9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#9706

After spa_vdev_remove_aux() is called, the config nvlist is no longer valid, as it's been replaced by the new one (with the specified device removed). Therefore any pointers into the nvlist are no longer valid. So we can't save the result of `fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the call to spa_vdev_remove_aux(). Instead, use spa_strdup() to save a copy of the string before calling spa_vdev_remove_aux. Found by AddressSanitizer: ERROR: AddressSanitizer: heap-use-after-free on address ... READ of size 34 at 0x608000a1fcd0 thread T686 #0 0x7fe88b0c166d (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d) #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447 #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259 #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #4 0x55ffbc769fba in ztest_execute ztest.c:6714 #5 0x55ffbc779a90 in ztest_thread ztest.c:6761 #6 0x7fe889cbc6da in start_thread #7 0x7fe8899e588e in __clone 0x608000a1fcd0 is located 48 bytes inside of 88-byte region freed by thread T686 here: #0 0x7fe88b14e7b8 in __interceptor_free #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874 #2 0x7fe88ae543ba in nvpair_free nvpair.c:844 #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978 #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185 #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221 #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229 #7 0x55ffbc769fba in ztest_execute ztest.c:6714 #8 0x55ffbc779a90 in ztest_thread ztest.c:6761 #9 0x7fe889cbc6da in start_thread Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #9706

Using zfs with Lustre, an arc_read can trigger kernel memory allocation that in turn leads to a memory reclaim callback and a deadlock within a single zfs process. This change uses spl_fstrans_mark and spl_trans_unmark to prevent the reclaim attempt and the deadlock (https://zfsonlinux.topicbox.com/groups/zfs-devel/T4db2c705ec1804ba). The stack trace observed is: #0 [ffffc9002b98adc8] __schedule at ffffffff81610f2e openzfs#1 [ffffc9002b98ae68] schedule at ffffffff81611558 openzfs#2 [ffffc9002b98ae70] schedule_preempt_disabled at ffffffff8161184a openzfs#3 [ffffc9002b98ae78] __mutex_lock at ffffffff816131e8 openzfs#4 [ffffc9002b98af18] arc_buf_destroy at ffffffffa0bf37d7 [zfs] openzfs#5 [ffffc9002b98af48] dbuf_destroy at ffffffffa0bfa6fe [zfs] openzfs#6 [ffffc9002b98af88] dbuf_evict_one at ffffffffa0bfaa96 [zfs] openzfs#7 [ffffc9002b98afa0] dbuf_rele_and_unlock at ffffffffa0bfa561 [zfs] openzfs#8 [ffffc9002b98b050] dbuf_rele_and_unlock at ffffffffa0bfa32b [zfs] openzfs#9 [ffffc9002b98b100] osd_object_delete at ffffffffa0b64ecc [osd_zfs] openzfs#10 [ffffc9002b98b118] lu_object_free at ffffffffa06d6a74 [obdclass] openzfs#11 [ffffc9002b98b178] lu_site_purge_objects at ffffffffa06d7fc1 [obdclass] openzfs#12 [ffffc9002b98b220] lu_cache_shrink_scan at ffffffffa06d81b8 [obdclass] openzfs#13 [ffffc9002b98b278] shrink_slab at ffffffff811ca9d8 openzfs#14 [ffffc9002b98b338] shrink_node at ffffffff811cfd94 openzfs#15 [ffffc9002b98b3b8] do_try_to_free_pages at ffffffff811cfe63 openzfs#16 [ffffc9002b98b408] try_to_free_pages at ffffffff811d01c4 openzfs#17 [ffffc9002b98b488] __alloc_pages_slowpath at ffffffff811be7f2 openzfs#18 [ffffc9002b98b580] __alloc_pages_nodemask at ffffffff811bf3ed openzfs#19 [ffffc9002b98b5e0] new_slab at ffffffff81226304 openzfs#20 [ffffc9002b98b638] ___slab_alloc at ffffffff812272ab openzfs#21 [ffffc9002b98b6f8] __slab_alloc at ffffffff8122740c openzfs#22 [ffffc9002b98b708] kmem_cache_alloc at ffffffff81227578 openzfs#23 [ffffc9002b98b740] spl_kmem_cache_alloc at ffffffffa048a1fd [spl] openzfs#24 [ffffc9002b98b780] arc_buf_alloc_impl at ffffffffa0befba2 [zfs] openzfs#25 [ffffc9002b98b7b0] arc_read at ffffffffa0bf0924 [zfs] openzfs#26 [ffffc9002b98b858] dbuf_read at ffffffffa0bf9083 [zfs] openzfs#27 [ffffc9002b98b900] dmu_buf_hold_by_dnode at ffffffffa0c04869 [zfs] Signed-off-by: Mark Roper <markroper@gmail.com>

0.8.2merge

Copyright header and readme cleanup

This is a fixup of commit 0fdd610 See added test case for a reproducer. Stack trace: panic: VERIFY3(nvlist_next_nvpair(redactnvl, pair) == NULL) failed (0xfffff80003ce5d18x == 0x) cpuid = 7 time = 1602212370 KDB: stack backtrace: #0 0xffffffff80c1d297 at kdb_backtrace+0x67 openzfs#1 0xffffffff80bd05cd at vpanic+0x19d openzfs#2 0xffffffff828446fa at spl_panic+0x3a openzfs#3 0xffffffff828af85d at dmu_redact_snap+0x39d openzfs#4 0xffffffff829c0370 at zfs_ioc_redact+0xa0 openzfs#5 0xffffffff829bba44 at zfsdev_ioctl_common+0x4a4 openzfs#6 0xffffffff8284c3ed at zfsdev_ioctl+0x14d openzfs#7 0xffffffff80a85ead at devfs_ioctl+0xad openzfs#8 0xffffffff8122a46c at VOP_IOCTL_APV+0x7c openzfs#9 0xffffffff80cb0a3a at vn_ioctl+0x16a openzfs#10 0xffffffff80a8649f at devfs_ioctl_f+0x1f openzfs#11 0xffffffff80c3b55e at kern_ioctl+0x2be openzfs#12 0xffffffff80c3b22d at sys_ioctl+0x15d openzfs#13 0xffffffff810a88e4 at amd64_syscall+0x364 openzfs#14 0xffffffff81082330 at fast_syscall_common+0x101 Signed-off-by: Christian Schwarz <me@cschwarz.com>

Add a new hit-by-size histogram to zettacache that is updated whenever there is a successful lookup (for read). This captures the minimum cache size that would contain this block. Add a new command (zcache) that has subcommands to display and clear the new histogram.

Under certain loads, the following panic is hit: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 openzfs#4 0xffffffff8066fdee at vinactivef+0xde openzfs#5 0xffffffff80670b8a at vgonel+0x1ea openzfs#6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 A race condition can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Sponsored-by: rsync.net Sponsored-by: Klara, Inc.

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 openzfs#4 0xffffffff808adc6f at trap_pfault+0x4f openzfs#5 0xffffffff80886da8 at calltrap+0x8 openzfs#6 0xffffffff80669186 at vgonel+0x186 openzfs#7 0xffffffff80669841 at vgone+0x31 openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#11 0xffffffff8065a28c at lookup+0x45c openzfs#12 0xffffffff806594b9 at namei+0x259 openzfs#13 0xffffffff80676a33 at kern_statat+0xf3 openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 openzfs#4 0xffffffff8066fdee at vinactivef+0xde openzfs#5 0xffffffff80670b8a at vgonel+0x1ea openzfs#6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 openzfs#7 0xffffffff80669841 at vgone+0x31 openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#11 0xffffffff8065a28c at lookup+0x45c openzfs#12 0xffffffff806594b9 at namei+0x259 openzfs#13 0xffffffff80676a33 at kern_statat+0xf3 openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501

Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 openzfs#1 0xffffffff8058e86f at vpanic+0x17f openzfs#2 0xffffffff8058e6e3 at panic+0x43 openzfs#3 0xffffffff808adc15 at trap_fatal+0x385 openzfs#4 0xffffffff808adc6f at trap_pfault+0x4f openzfs#5 0xffffffff80886da8 at calltrap+0x8 openzfs#6 0xffffffff80669186 at vgonel+0x186 openzfs#7 0xffffffff80669841 at vgone+0x31 openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#11 0xffffffff8065a28c at lookup+0x45c openzfs#12 0xffffffff806594b9 at namei+0x259 openzfs#13 0xffffffff80676a33 at kern_statat+0xf3 openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 openzfs#1 0xffffffff8059620f at vpanic+0x17f openzfs#2 0xffffffff81a27f4a at spl_panic+0x3a openzfs#3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 openzfs#4 0xffffffff8066fdee at vinactivef+0xde openzfs#5 0xffffffff80670b8a at vgonel+0x1ea openzfs#6 0xffffffff806711e1 at vgone+0x31 openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149 openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 openzfs#10 0xffffffff80661c2c at lookup+0x45c openzfs#11 0xffffffff80660e59 at namei+0x259 openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3 openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes openzfs#14501

behlendorf closed this as completed Mar 24, 2012

marcin-github mentioned this issue Jan 30, 2013

zfs set compression=lz4 throws core dump #1251

Closed

prometheanfire mentioned this issue Apr 4, 2013

cannot use zfs send recursivly #1388

Closed

db48x mentioned this issue Jun 15, 2013

kernel BUG at security/selinux/ss/services.c:654! #1519

Closed

dward mentioned this issue Aug 26, 2013

kernel panic on zfs send #1678

Closed

AndCycle mentioned this issue Sep 6, 2013

ZFS random lockup #1701

Closed

olw2005 mentioned this issue Oct 17, 2013

Kernel panic during "zfs recv" of multiple snapshot stream #1795

Closed

Xaseron mentioned this issue Jan 10, 2014

ztest: pthread_mutex_destroy() fails with EBUSY #2027

Closed

clefru mentioned this issue Mar 2, 2014

zdb -dddd dies with kernel.c:301 assertion #2154

Closed

Nasf-Fan mentioned this issue Jul 31, 2017

OpenZFS 8373 - TXG_WAIT in ZIL commit path #6403

Closed

sanjeevbagewadi mentioned this issue Sep 15, 2017

"zpool export <zpoolname>" on a faulted zpool hangs and blocks other zpool commands after that #6649

Open

sanjeevbagewadi mentioned this issue Jan 4, 2018

With "casesensitivity=mixed" hitting an assert in ZAP code #7011

Closed

nivedita76 mentioned this issue Apr 5, 2018

Fix race in dnode_check_slots_free() #7388

Merged

13 tasks

jgallag88 pushed a commit to jgallag88/zfs that referenced this issue Jun 8, 2018

Merge pull request openzfs#9 from zfsonlinux/master

5d83685

merge ZoL

BonganiMalalane mentioned this issue Aug 3, 2018

Panic in NFS filehandle decoding for snapshots [code issue identified and reproducible] #7764

Closed

UralZima mentioned this issue Aug 31, 2019

Build failure on 4.9.135-dappersec kernel. Patch available, needs rewrite and merge to master #9265

Closed

allanjude pushed a commit to KlaraSystems/zfs that referenced this issue Apr 28, 2020

Merge pull request openzfs#9 from bgly/0.8.2merge

e087db5

0.8.2merge

PrivatePuffin referenced this issue in PrivatePuffin/zfs Jun 18, 2020

Merge pull request #9 from Ornias1993/Fix_doc

3a82e1e

Copyright header and readme cleanup

mattmacy mentioned this issue Sep 22, 2020

Fix ZFS_DEBUG_MODIFY assert in arc_buf_try_copy_decompressed_data #10943

Closed

12 tasks

justinpryzby mentioned this issue Mar 16, 2021

stuck in futex #11749

Closed

bu7cher mentioned this issue Jun 11, 2021

Possible circular locking dependency in dmu_buf_will_dirty_impl vs free_children #10583

Open

tongfw mentioned this issue Feb 6, 2023

kernel panic on write #14463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZVOL Performance #9

ZVOL Performance #9

behlendorf commented May 14, 2010

sehe commented Jul 6, 2010

behlendorf commented Jul 6, 2010

behlendorf commented Aug 10, 2010

sehe commented Aug 10, 2010

behlendorf commented Aug 10, 2010

mauricev commented Sep 27, 2010

behlendorf commented Sep 27, 2010

mauricev commented Sep 27, 2010

behlendorf commented Sep 27, 2010

mauricev commented Sep 27, 2010

Rudd-O commented Mar 21, 2011

behlendorf commented Mar 21, 2011

Rudd-O commented Mar 22, 2011

behlendorf commented Mar 22, 2011

byteharmony commented Dec 8, 2012

behlendorf commented Dec 10, 2012

byteharmony commented Dec 10, 2012

ryao commented Feb 25, 2013

behlendorf commented Feb 25, 2013

yuriccp commented Mar 28, 2014

ZVOL Performance #9

ZVOL Performance #9

Comments

behlendorf commented May 14, 2010

sehe commented Jul 6, 2010

behlendorf commented Jul 6, 2010

behlendorf commented Aug 10, 2010

sehe commented Aug 10, 2010

behlendorf commented Aug 10, 2010

mauricev commented Sep 27, 2010

behlendorf commented Sep 27, 2010

mauricev commented Sep 27, 2010

behlendorf commented Sep 27, 2010

mauricev commented Sep 27, 2010

Rudd-O commented Mar 21, 2011

behlendorf commented Mar 21, 2011

Rudd-O commented Mar 22, 2011

behlendorf commented Mar 22, 2011

byteharmony commented Dec 8, 2012

behlendorf commented Dec 10, 2012

byteharmony commented Dec 10, 2012

ryao commented Feb 25, 2013

behlendorf commented Feb 25, 2013

yuriccp commented Mar 28, 2014