Kmem rework #369

ryao · 2014-06-20T15:26:38Z

This refactoring is intended to simplify the code while making it easier to share code between ZoL and other Open ZFS platforms.

ryao · 2014-07-20T16:45:49Z

@behlendorf I have revised this pull request to place "kmem_cache: Call constructor/destructor on each alloc/free" at the top of the stack in response to #371. The rest of this pull request and its accessory pull request to the zfs repository should be safe to merge after review.

DeHackEd · 2014-08-11T12:45:09Z

I'm not 100% sure this is the cause, but it's the most likely issue.

# zpool export $activepool
# rmmod zfs
SysRq : Show Blocked State
 task                        PC stack   pid father
rmmod           D ffff880025165e18     0 25402  24110 0x00000000
ffff880025165de8 0000000000000082 ffff880025165d78 ffff8801b69768d0
ffff8801af7ca840 0000000100200009 ffff880025164010 0000000000011780
ffff8801b6758810 0000000000011780 0000000000011780 ffff8801b6758810
Call Trace:
[<ffffffff8e3d4c6d>] schedule+0x6e/0x70
[<ffffffffc00be16c>] spl_kmem_cache_destroy+0x14e/0x2f9 [spl]
[<ffffffffc00bc9f1>] ? spl_kmem_free+0x2d/0x2f [spl]
[<ffffffff8e062ab3>] ? bit_waitqueue+0xb0/0xb0
[<ffffffffc047ae48>] zil_fini+0x10/0x2e [zfs]
[<ffffffffc043f81d>] spa_fini+0x1d/0x112 [zfs]
[<ffffffffc0460eb6>] _fini+0xa3/0xf0 [zfs]
[<ffffffffc0460f11>] spl__fini+0xe/0x2d [zfs]
[<ffffffff8e08004a>] SyS_delete_module+0x122/0x19a
[<ffffffff8e02a347>] ? do_page_fault+0x3d/0x54
[<ffffffff8e3d7b22>] system_call_fastpath+0x16/0x1b

My tree: https://github.com/DeHackEd/spl/commits/dehacked

behlendorf · 2014-08-12T22:47:29Z

@DeHackEd Yes, the stack you reported was caused by this patch stack. It's easily reproducible with the kmem splat tests. In particular I hit it with the kmem:slab_small test. My guess is it's due to the various *_bit operations which don't work on virtual addresses.

behlendorf · 2014-08-13T00:01:41Z

@ryao I'm still working my way through all the patches so expect more comments but I wanted to quickly jot down some of my initial thoughts and concerns.

In general I think this is a nice bit of cleanup. In particular, I like the KM_* conversion function. It's clear to me that's what should have been done from day one. Updating the code to depend on the existing PF_FSTRAN flags is a nice clean solution as well. It's a shame we can't share more of the infrastructure xfs built up.

That said, my major concern thus far is that with the new interfaces it's possible for a kmem_alloc() to return a virtual address. As you point out under Illumos there's no real analog to this problem. But under Linux it matters a great deal exactly what kind of memory you've allocated. For example, it can result in seemingly bizarre problems like the *_bit operations not working. I think reworking to patch to ensure this sort of thing never happens will save us a huge amount of troubleshooting in the future.

The converse case of a vmem_alloc() being mapped to a kmalloc() is less concerning to me. I'm not aware of any problems this would cause. In practice it also should never happen, if we're used vmem_alloc() for a small allocation the call site should be fixed.

ryao · 2014-08-13T00:18:34Z

@behlendorf Only Linux-specific aspects of the code base need *_bit operations, so my present solution is to convert the routines that allocated their memory to use Linux's native memory allocation routines. Only a few routines require this. Anyway, I will think about how I could rework the patch stack to avoid the need.

As for @DeHackEd's issue, this patch stack includes a patch to convert the slab cache function to use kzalloc(), which is provided by the Linux kernel itself. *_bit operations should be fine there, so the problem is something else. My guess is that the wait_event(wq, atomic_read(&skc->skc_ref) == 0); line is involved.

behlendorf · 2014-08-13T18:27:39Z

my present solution is to convert the allocations that made them to use Linux's native memory allocation routines.

We should avoid this, even if all the needed changes end up in Linux specific code. Doing this goes against one of the design goals of this patch. We should be minimizing the delta between OpenZFS implementations. Simply ensuring that kmem_alloc() is never backed by a virtual address prevent this problem. Also as a Linux kernel developer I'd also be astonished to see this function behave this way.

ryao · 2014-08-13T18:32:17Z

@behlendorf In that case, I will try mapping kmem_alloc(), kmem_zalloc() and kmem_free() to kmalloc(), kzalloc() and kfree() respectively. That requires giving up on having those routines do allocations with kernel virtual memory, but it seems like that seems like that would be for the best.

behlendorf · 2014-08-13T18:51:41Z

One other general comment. Since eventually we'd like to move the spl code in to the zfs tree we should be careful to follow same style guidelines used for zfs. Currently the spl code is riddled with violations but we should be careful to avoid adding more when writing new code.

ryao · 2014-10-04T15:43:58Z

@behlendorf I think there is some misunderstanding about the distinction between vmem_alloc() and kmem_alloc(). vmem_alloc() is a low level virtual memory allocator on Solaris. It is used by the kernel to do slab allocations for kmem_cache_alloc(), which is used as the basis for kmem_alloc() via HOARD-style allocation. Consequently, all kmem_alloc() allocations on Solaris are portions of vmem_alloc() allocations. I found this confusing until I came to the conclusion that kmalloc() has no analog on Solaris.

The idea that mapping the Solaris analog of vmalloc() to kmalloc() is safe from the perspective of wait_on_bit() while the converse is unsafe is correct. However, all Solaris allocations can be safely mapped to kmalloc() or vmalloc() from the perspective of wait_on_bit() because Solaris has no analog to either kmalloc() or wait_on_bit(). The only places where such analogs exist are in Linux structures. I believe that it would be best to maintain a distinction between Linux allocations and Solaris allocations by using kmalloc() in sections of code that are Linux-specific and our emulated Solaris interfaces everywhere else. This might seem fragile at first glance, but I believe that is not the case for three reasons:

Linux's locking structures use atomic instructions and churn in them is rare. When churn does occur, there is an incredible amount of pressure to maintain both the size of the structure and remain backward compatible precisely because changes to locking structures can cause unanticipated issues that are hard to debug.
wait_on_bit() has an issue with vmalloc() based memory because it does a hashtable lookup that assumes that wait_on_bit() is never called on virtual memory. This hashtable lookup can be more expensive than conventional locking (as it involves more memory accesses than an atomic instruction for taking a mutex), so it will never be used inside one of the locking primitives that we map to the Solaris ones.
The kernel will print a backtrace to dmesg whenever wait_on_bit() is used inside memory from vmalloc(), so any problems that arise would appear very quickly in buildbots.

ryao · 2014-10-08T17:52:02Z

@behlendorf Here is a description of this pull request that I wrote in an email recently.

The kmem-rework originally intended to fix a rare deadlock that
SoftNAS managed to reliably trigger by running the VMWare Hypervisor's
NFS benchmark on a NFS export of a ZoL dataset from a VM with 8GB of RAM
and a 200GB SLOG. The deadlock was a recursive DMU transaction caused by
a combination of ZFS's atime updates doing DMU transactions and Linux's
direct reclaim doing atime updates inside critical sections formed by
DMU transactions. Systems with more RAM did not trigger it and there
were no reports that look similar (even now in hindsight).

I took the opportunity to fix additional potential problems that I
spotted (e.g. a race making KM_PUSHPAGE into KM_SLEEP) when writing the
fix for that issue. I heard the day after I sent the changes to SoftNAS
for evaluation that my nearly doubled write performance in VMWare's
benchmark in addition to fixing the deadlock. Here are the original
numbers, which SoftNAS has told me that I am free to share.

ZFSOnLinux 0.6.3 (with atime=off because atime=on deadlocks)

Storage Performance Results        Actual    Expected
Minimal write throughput(MB/s)     128.99130249023438       40.0
Minimal read throughput(MB/s)      64.23629951477051       80.0
Minimal seek throughput(seeks/s)   2481.5099487304688       400.0
Total write throughput(MB/s)       130.93539810180664       40.0
Total read throughput(MB/s)        64.28860092163086       80.0
Total seek throughput(seeks/s)     2501.2999877929688       500.0

ZFSOnLinux 0.6.3 (the default atime=on plus my patches)

Storage Performance Results        Actual    Expected
Minimal write throughput(MB/s)     157.76309967041016       40.0
Minimal read throughput(MB/s)      112.0521011352539       80.0
Minimal seek throughput(seeks/s)   2773.9099731445312       400.0
Total write throughput(MB/s)       160.7144012451172       40.0
Total read throughput(MB/s)        112.53049850463867       80.0
Total seek throughput(seeks/s)     2792.1400146484375       500.0

My interpretation of this is that the previous internal locking
introduced micro lags that hurt throughput and my changes
reduced/eliminated them

This predated openzfs/zfs@cd3939c, which SoftNAS told me further improved performance in that benchmark.

rbraddy · 2014-10-08T23:30:37Z

I can confirm the above benchmarks and findings. In fact, we have observed remarkable ZoL performance improvements that now make ZoL much more usable on VMware ESXi, especially VM cloning and migrations. Great job!

kernelOfTruth · 2014-10-09T18:18:11Z

@ryao from reading through this pull request I got the impression that only pools using atime would benefit from this - is this true ?

thanks

ryao · 2014-10-09T18:25:36Z

@kernelOfTruth All pools benefit from it. However, pools using atime have the additional benefit of avoiding the deadlock that I mentioned in the following blog post:

https://clusterhq.com/blog/zfs-on-linux-runtime-stability/

sempervictus · 2014-10-12T16:39:56Z

Seems SPL doesnt want to build for me - DKMS package builds, actual kernel module fails - http://dpaste.com/067615E.

kernelOfTruth · 2014-10-12T22:47:17Z

@ryao thank you =)

@sempervictus confirmed - got the same error message

ryao · 2014-10-13T11:01:14Z

@sempervictus @kernelOfTruth ryao/spl@634047f had a mistake. I had changed an ASSERT to a VERIFY as per @behlendorf's request, but neglected to remove the #ifdef DEBUG ... #endif removing that struct member from non-debug builds. The correction is quite literally:

diff --git a/include/sys/kmem.h b/include/sys/kmem.h                                                                                                                                                                                                                           
index 1f38405..196542c 100644                                                                                                                                                                                                                                                  
--- a/include/sys/kmem.h                                                                                                                                                                                                                                                       
+++ b/include/sys/kmem.h                                                                                                                                                                                                                                                       
@@ -97,9 +97,7 @@ kmem_flags_convert(int flags)                                                                                                                                                                                                                                
 }                                                                                                                                                                                                                                                                             

 typedef struct {                                                                                                                                                                                                                                                              
-#ifdef DEBUG                                                                                                                                                                                                                                                                  
        struct task_struct *fstrans_thread;                                                                                                                                                                                                                                    
-#endif                                                                                                                                                                                                                                                                        
        unsigned int saved_flags;                                                                                                                                                                                                                                              
 } fstrans_cookie_t;

I have pushed a revised commit.

ryao · 2014-10-13T11:30:26Z

I have refreshed the patches to the zfs repository via openzfs/zfs#2796. Until today, this patch had anenabling patch for supporting CONFIG_PAX_USERCOPY_SLABS on PaX/GrSecurity patched Linux kernels versioned 3.14.y and later. It did nothing without the corresponding ZFS patch and upon writing it, I realize that the enabling patch needs some additional work. I have pulled that patch from this pull request until I have had time to address the issue in it properly. While fixing the GrSecurity issue is important, the purpose of this pull request is to make our kmem emulation better, so there is no reason for it to block merging the other changes here.

behlendorf · 2014-10-14T00:33:40Z

@ryao I'll try and give the updated patch a careful review tomorrow. I would like to get this merged once we're sure there aren't any subtle issues lurking. However, I was able to hit another issue in the splat kmem:vmem_size test which caused a hang so there are still some issues lurking.

ryao · 2014-10-16T18:36:59Z

I had not run splat to test the more recent iterations of these changes and unfortunately, that lead to bugs going uncaught. These are my present mistakes:

diff --git a/module/spl/spl-kmem.c b/module/spl/spl-kmem.c
index 6dc2a6e..fcdcbb8 100644
--- a/module/spl/spl-kmem.c
+++ b/module/spl/spl-kmem.c
@@ -342,7 +342,7 @@ start:
         * __vmalloc_node() is not exported for our use.
         */
        if (unlikely(size > PAGE_SIZE*2)
-           && (flags & (KM_PUSHPAGE | KM_NOSLEEP)))
+           && !(flags & (KM_PUSHPAGE | KM_NOSLEEP)))
                ptr =  spl_vmalloc(size, lflags, PAGE_KERNEL);
        else
                ptr = kmalloc_node(size, lflags, node);
@@ -355,7 +355,7 @@ start:
                spl_debug_dumpstack(NULL);
        }

-       congestion_wait(BLK_RW_ASYNC, HZ/50);
+       cond_resched();
        goto start;

        /* Silence compiler warning */

The first is quite obvious in that I am sending a request for a large amount of memory to kmalloc() that it cannot satisfy. The stack trace for it looks like this:

[<ffffffffa0002915>] kmem_alloc_debug+0x15/0x50 [spl]
[<ffffffffa0029ad7>] splat_kmem_test12+0xe7/0xa20 [splat]
[<ffffffffa00282e8>] splat_unlocked_ioctl+0x228/0x9d0 [splat]
[<ffffffff81187d9d>] do_vfs_ioctl+0x2dd/0x4b0
[<ffffffff81187ff1>] SyS_ioctl+0x81/0xa0
[<ffffffff8155c429>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

The second is less obvious, but it has a similar backtrace in congestion_wait(). That looked right, although it turned out that it could block indefinitely because it will wait for a write on a backing device and we can have configurations where there is no backing device.

sempervictus · 2014-10-18T17:02:40Z

@behlendorf and/or @ryao: with the patch above applied to the PR, where do we stand on suspicions about corrupted pools?

On a related but separate note, is there a writeup i can reference for getting a comprehensive testing environment configured? We have datacenter environments with metal and virtual resources... i'm sure they wouldn't mind spinning a few more cycles for "the greater good."

behlendorf · 2014-10-19T21:32:17Z

@sempervictus @ryao's fixes addressed the issue I caught about potentially not zeroing memory. This could have resulted in problems for all callers which depended on that behavior. I'm still working through a careful review of the rest of the patch. Any additional testing you can offer in a test environment would be welcome.

behlendorf · 2014-10-20T18:26:22Z

@ryao can you rebase this against master. I've merged the pre-2.6.32 compatibility cleanup which touched/removed some of the same code.

ryao · 2014-10-24T16:36:33Z

@behlendorf It is rebased and should be ready to merge. I had mentioned in IRC that I had identified a problem with the linux:shrinker SPLAT test on Linux 3.14.14-gentoo, but it turns out that this is a pre-existing problem. I have included the patch from #403 in the pull request.

behlendorf · 2014-10-24T17:06:13Z

@ryao Thanks, I'll work through another round and feedback and testing today.

behlendorf · 2014-10-24T23:02:45Z

@ryao This is shaping up very nicely! I know there's a lot of feedback here but this is definitely getting close to something we can merge.

This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #369

behlendorf · 2014-10-28T20:18:31Z

The following patches have been merged. The remaining patches in the stack we still need to reach agreement on.

ad9863e kmem_cache: Call constructor/destructor on each alloc/free

wake_up_bit() is called on a word inside kmem_cache_t objects. This calls virt_to_page() on the address of the memory. That is incompatible with virtual memory, so we must switch to Linux's memory allocator. Signed-off-by: Richard Yao <ryao@gentoo.org>

This reverts commit eb0f407.

This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps `kmem_{,z}alloc()` KM_SLEEP allocations that were previously mapped to `kmalloc()` to `vmalloc()` to reduce internal memory fragmentation. 3. It discards the distinction between vmem_* and kmem_* that was previously made by mapping them to vmalloc() and kmalloc() respectively. This achieve better compatibility because kmem_* allocates are done from slabs allocated from vmem_*. Both are therefore virtual memory allocators and it makes no sense to implement them differently than one another. The detailed reasons for each are as follows: 1. Solaris derivatives have different allocation flag semantics than Linux. This was originally handled by trying to map Solaris flags to Linux flags, but the semantics are different enough that this approach does not correctly handle all cases. For example, 0 is KM_SLEEP on Solaris derivatives while 0 is illegal on Linux. This means that things like assertions that make assumptions about the flag semantics are no longer portable because reasonable assertions such as `ASSERT0(flags)` on Solaris derivatives are illegal on Linux. In addition, a trivial mapping allows us to freely mix and match flags. This is bad for portability and it can lead to unexpected consequences when clashes between semantics means that one expects one system's semantics, and receives another. 3. The SPL originally mapped kmem_alloc() to kmalloc() and vmem_alloc() to vmalloc(). One would be inclined to think this is correct by applying the reasonable expectation that things with similar names on each platform are similar things. However, this is not the case here. On Solaris, vmem_* is a general purpose arena allocator that does kernel virtual memory allocations. The Solaris SLAB allocator `kmem_cache_alloc()` operates by allocating slabs from vmem and returns objects. Allocations from kmem_alloc() work by performing HOARD-style allocations on pre-existing power of 2 SLAB caches. When mapping uses of these allocators to Linux equivalents, we must consider 4 allocators on Linux and how they interact: 1. The buddy allocator 2. The slab allocator 3. The vmap allocator 4. The kernel virtual memory allocator The buddy allocator is used for allocating both pages and slabs for Linux's kmem_slab_alloc. These are then used to provide generic power of 2 caches to which kmalloc() is mapped. Allocations that are larger than the largest power of 2 are sent directly to the buddy allocator. This is analogous to kmem_cache_alloc() and kmem_alloc() on Solaris. The third allocator is the vmap allocator, which concerns itself with allocating address space. The four allocator is the kernel virtual memory allocator and is invoked via `vmalloc()`. This uses pages from the buddy allocator and address space from the vmap allocator to perform virtual memory allocations. 3. Switching the KM_SLEEP allocations to `vmalloc()` provides some protection from deadlocks caused by internal memory fragmentation. It would have been ideal to make all allocations virtual like they are on Illumos. However, virtual memory allocations allocations that require KM_PUSHPAGE or KM_NOSLEEP semantics will receive KM_SLEEP semantics on Linux whenever a page directory table entry must be allocated, which is unsafe. We are therefore forced to use physical memory for KM_PUSHPAGE and KM_NOSLEEP allocations. That is suboptimal from the perspective of reducing internal memory fragmentation, but we still partially benefit by mapping KM_SLEEP allocations to `vmalloc()`. A caveat that aries from replacing `kmalloc()` with `vmalloc()` is that code using Linux's wake_up_bit should use the native Linux allocators. This has no equivalent on Solaris. While it might seem fragile at first glance, that is not the case for three reasons: 1. Linux's locking structures use atomic instructions and churn in them is rare. When churn does occur, there is an incredible amount of pressure to maintain both the size of the structure and remain backward compatible precisely because changes to locking structures can cause unanticipated issues that are hard to debug. 2. The incompatibility arises because `wait_on_bit()` does a hashtable lookup that assumes that `wait_on_bit()` is never called on virtual memory. This hashtable lookup should be more expensive than conventional locking because it involves more memory accesses than an atomic instruction for taking a mutex, so it will never be used inside one of the locking primitives that we map to the Solaris ones. 3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is used inside memory from vmalloc(), so any problems that arise would appear very quickly in buildbots. Consequently, it is reasonable to expect allocations intended for structures that use `wake_up_bit()` to be done using the Linux allocator. At present, the only allocations for such structures are done inside the SPL SLAB allocator and for super blocks. No other code uses it or is likely to use it. These changes appear to create the most semantically equivalent mapping possible on Linux. The result is the elimination of concerns regarding proper use of generic interfaces when writing portable code, which posed problems for the development of things like sgbuf. A couple of additional changes worth noting are: 1. The kmem_alloc_node interface has been removed(). It has no external consumers and does not exist on Solaris. 2. sys/vmem.h has been introduced as an alias of sys/kmem.h for Illumos compatibility. Signed-off-by: Richard Yao <ryao@gentoo.org>

If a SLAB cache is full and two allocations occur from the same SLAB cache nearly simultaneously where one is KM_SLEEP and another is either KM_PUSHPAGE or KM_NOSLEEP, the one that occurs first will dictate the KM_FLAGS used for SLAB growth for both of them. This is a race condition that at best hurts performance and at worse, causes deadlocks. We address this by modifying `spl_cache_grow()` to only provide the emergency allocation semantics to KM_PUSHPAGE allocations, with KM_SLEEP allocations being coalesced and KM_NOSLEEP allocations failing immediately. Signed-off-by: Richard Yao <ryao@gentoo.org>

The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark transactions to that the translation of the IRIX kmem flags into Linux gfp flags for allocations inside of transactions will dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed with in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted into a KM_PUSHPAGE allocation. It will also set PF_MEMALLOC_NOIO to prevent direct reclaim from entering `pageout()` on any KM_PUSHPAGE or KM_NOSLEEP allocation on Linux 3.9 or later. Signed-off-by: Richard Yao <ryao@gentoo.org>

This reverts commit e302072.

@nr

The comment above the Linux 3.17 kernel's clear_bit() states: /** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from * * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic() * in order to ensure changes are visible on other processors. */ This comment does not make sense in the context of x86 because x86 maps these operations to barrier(), which is a compiler barrier. However, it does make sense to me when I consider architectures that reorder around atomic instructions. In such situations, a processor is allowed to execute the wake_up_bit() before clear_bit() and we have a race. There are a few architectures that suffer from this issue: http://lxr.free-electrons.com/source/arch/arm/include/asm/barrier.h?v=3.16#L83 http://lxr.free-electrons.com/source/arch/arm64/include/asm/barrier.h?v=3.16#L102 http://lxr.free-electrons.com/source/arch/mips/include/asm/barrier.h?v=3.16#L199 http://lxr.free-electrons.com/source/arch/powerpc/include/asm/barrier.h?v=3.16#L88 http://lxr.free-electrons.com/source/arch/s390/include/asm/barrier.h?v=3.16#L32 http://lxr.free-electrons.com/source/arch/tile/include/asm/barrier.h?v=3.16#L83 https://en.wikipedia.org/wiki/Memory_ordering In such situations, the other processor would wake-up, see the bit is still taken and go to sleep, while the one responsible for waking it up will assume that it did its job and continue. https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h#L100 It is important to note that smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}(), were replaced by smp_mb__{before,after}_atomic() in recent kernels: torvalds/linux@febdbfe Some compatibility code was added to replace it in the time being, although it does not interact well with -Werror: http://www.spinics.net/lists/backports/msg02669.html http://lxr.free-electrons.com/source/include/linux/bitops.h?v=3.16#L48 In addition, the kernel's code paths are using clear_bit_unlock() in situations where clear_bit is used for unlocking. This adds smp_mb__before_atomic(), which I assume is for Alpha. This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to smp_mb__{before,after}_clear_bit() on older kernels and changes our code to leverage it in a manner consistent with the mainine kernel. Signed-off-by: Richard Yao <ryao@gentoo.org>

The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao · 2014-11-03T19:04:16Z

I have pushed a refresh. There are two known issues at this time:

The issue with kmem leak accounting being off has not been addressed.
splat kmem test 13 causes the OOM killer to trigger.

I suspect that the kmem chnages exposed some subtle issues that were hidden by the earlier kmem code. I am going to work some more on these issues more today, but tomorrow onward, I need to switch to working on the libzfs_core extensions for work.

behlendorf · 2014-11-06T20:12:46Z

@ryao regarding the OOM killer we should avoid disabling it in such a common code path. We want it to be able to always kill runaway user processes and disabling it should not be necessary.

This has a few benefits. First, it fixes a regression that "Rework generic memory allocation interfaces" appears to have triggered in splat's slab_reap and slab_age tests. Second, it makes porting code from Illumos to ZFSOnLinux easier. Third, it has the side effect of making reclaim from slab caches that specify reclaim functions an order of magnitude faster. The splat slab_reap test usually took 30 to 40 seconds. With this change, it takes 3 to 4. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#369

behlendorf · 2015-02-02T18:41:39Z

Closing. The kmem rework was merged some time ago.

ryao mentioned this pull request Jun 20, 2014

Kmem rework openzfs/zfs#2411

Closed

ryao mentioned this pull request Jul 15, 2014

Swap on zvol still hangs machine with all CPUs active openzfs/zfs#2390

Closed

behlendorf added the Bug label Jul 15, 2014

behlendorf added this to the 0.6.4 milestone Jul 15, 2014

ryao mentioned this pull request Jul 30, 2014

Heavy IO lockup from Linux 3.5->3.6 change openzfs/zfs#1342

Closed

behlendorf mentioned this pull request Aug 13, 2014

Cleanup Memory Allocation Routines #267

Closed

ryao mentioned this pull request Sep 15, 2014

Hang under memory pressure openzfs/zfs#2308

Closed

ryao force-pushed the kmem-rework branch from 811614e to 83f77d8 Compare October 8, 2014 17:01

ryao mentioned this pull request Oct 9, 2014

Suppress AIO kmem warnings openzfs/zfs#2774

Closed

ryao force-pushed the kmem-rework branch from 83f77d8 to d7db449 Compare October 13, 2014 11:00

ryao force-pushed the kmem-rework branch from d7db449 to 55c5aa4 Compare October 13, 2014 11:13

ryao mentioned this pull request Oct 13, 2014

Kmem rework openzfs/zfs#2796

Closed

ryao force-pushed the kmem-rework branch from 55c5aa4 to f61fcf5 Compare October 13, 2014 11:21

ryao force-pushed the kmem-rework branch 2 times, most recently from a46707e to adf1e3c Compare October 23, 2014 23:51

ryao mentioned this pull request Oct 24, 2014

spl_kmem_cache_reclaim=0 causes splat to hang in linux:shrinker on 3.14.14-gentoo #404

Closed

ryao mentioned this pull request Nov 2, 2014

Severe contention in get_vmalloc_info() openzfs/zfs#2705

Closed

ryao added 7 commits November 3, 2014 09:18

Revert "Add PF_NOFS debugging flag"

7de7607

This reverts commit eb0f407.

Revert "Linux 3.16 compat: smp_mb__after_clear_bit()"

291a4b8

This reverts commit e302072.

ryao force-pushed the kmem-rework branch from adf1e3c to 01a9891 Compare November 3, 2014 19:00

ryao force-pushed the kmem-rework branch from 01a9891 to 723c8bb Compare November 3, 2014 19:01

behlendorf mentioned this pull request Nov 21, 2014

Kmem rework (WIP) #414

Closed

behlendorf closed this Feb 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmem rework #369

Kmem rework #369

ryao commented Jun 20, 2014

ryao commented Jul 20, 2014

DeHackEd commented Aug 11, 2014

behlendorf commented Aug 12, 2014

behlendorf commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

ryao commented Oct 4, 2014

ryao commented Oct 8, 2014

rbraddy commented Oct 8, 2014

kernelOfTruth commented Oct 9, 2014

ryao commented Oct 9, 2014

sempervictus commented Oct 12, 2014

kernelOfTruth commented Oct 12, 2014

ryao commented Oct 13, 2014

ryao commented Oct 13, 2014

behlendorf commented Oct 14, 2014

ryao commented Oct 16, 2014

sempervictus commented Oct 18, 2014

behlendorf commented Oct 19, 2014

behlendorf commented Oct 20, 2014

ryao commented Oct 24, 2014

behlendorf commented Oct 24, 2014

behlendorf commented Oct 24, 2014

behlendorf commented Oct 28, 2014

ryao commented Nov 3, 2014

behlendorf commented Nov 6, 2014

behlendorf commented Feb 2, 2015

Kmem rework #369

Kmem rework #369

Conversation

ryao commented Jun 20, 2014

ryao commented Jul 20, 2014

DeHackEd commented Aug 11, 2014

behlendorf commented Aug 12, 2014

behlendorf commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

ryao commented Aug 13, 2014

behlendorf commented Aug 13, 2014

ryao commented Oct 4, 2014

ryao commented Oct 8, 2014

rbraddy commented Oct 8, 2014

kernelOfTruth commented Oct 9, 2014

ryao commented Oct 9, 2014

sempervictus commented Oct 12, 2014

kernelOfTruth commented Oct 12, 2014

ryao commented Oct 13, 2014

ryao commented Oct 13, 2014

behlendorf commented Oct 14, 2014

ryao commented Oct 16, 2014

sempervictus commented Oct 18, 2014

behlendorf commented Oct 19, 2014

behlendorf commented Oct 20, 2014

ryao commented Oct 24, 2014

behlendorf commented Oct 24, 2014

behlendorf commented Oct 24, 2014

behlendorf commented Oct 28, 2014

ryao commented Nov 3, 2014

behlendorf commented Nov 6, 2014

behlendorf commented Feb 2, 2015