Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Kmem rework #369

Closed
wants to merge 8 commits into from
Closed

Kmem rework #369

wants to merge 8 commits into from

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Jun 20, 2014

This refactoring is intended to simplify the code while making it easier to share code between ZoL and other Open ZFS platforms.

@ryao
Copy link
Contributor Author

ryao commented Jul 20, 2014

@behlendorf I have revised this pull request to place "kmem_cache: Call constructor/destructor on each alloc/free" at the top of the stack in response to #371. The rest of this pull request and its accessory pull request to the zfs repository should be safe to merge after review.

@DeHackEd
Copy link
Contributor

I'm not 100% sure this is the cause, but it's the most likely issue.

# zpool export $activepool
# rmmod zfs
SysRq : Show Blocked State
 task                        PC stack   pid father
rmmod           D ffff880025165e18     0 25402  24110 0x00000000
ffff880025165de8 0000000000000082 ffff880025165d78 ffff8801b69768d0
ffff8801af7ca840 0000000100200009 ffff880025164010 0000000000011780
ffff8801b6758810 0000000000011780 0000000000011780 ffff8801b6758810
Call Trace:
[<ffffffff8e3d4c6d>] schedule+0x6e/0x70
[<ffffffffc00be16c>] spl_kmem_cache_destroy+0x14e/0x2f9 [spl]
[<ffffffffc00bc9f1>] ? spl_kmem_free+0x2d/0x2f [spl]
[<ffffffff8e062ab3>] ? bit_waitqueue+0xb0/0xb0
[<ffffffffc047ae48>] zil_fini+0x10/0x2e [zfs]
[<ffffffffc043f81d>] spa_fini+0x1d/0x112 [zfs]
[<ffffffffc0460eb6>] _fini+0xa3/0xf0 [zfs]
[<ffffffffc0460f11>] spl__fini+0xe/0x2d [zfs]
[<ffffffff8e08004a>] SyS_delete_module+0x122/0x19a
[<ffffffff8e02a347>] ? do_page_fault+0x3d/0x54
[<ffffffff8e3d7b22>] system_call_fastpath+0x16/0x1b

My tree: https://github.com/DeHackEd/spl/commits/dehacked

@behlendorf
Copy link
Contributor

@DeHackEd Yes, the stack you reported was caused by this patch stack. It's easily reproducible with the kmem splat tests. In particular I hit it with the kmem:slab_small test. My guess is it's due to the various *_bit operations which don't work on virtual addresses.

@behlendorf
Copy link
Contributor

@ryao I'm still working my way through all the patches so expect more comments but I wanted to quickly jot down some of my initial thoughts and concerns.

In general I think this is a nice bit of cleanup. In particular, I like the KM_* conversion function. It's clear to me that's what should have been done from day one. Updating the code to depend on the existing PF_FSTRAN flags is a nice clean solution as well. It's a shame we can't share more of the infrastructure xfs built up.

That said, my major concern thus far is that with the new interfaces it's possible for a kmem_alloc() to return a virtual address. As you point out under Illumos there's no real analog to this problem. But under Linux it matters a great deal exactly what kind of memory you've allocated. For example, it can result in seemingly bizarre problems like the *_bit operations not working. I think reworking to patch to ensure this sort of thing never happens will save us a huge amount of troubleshooting in the future.

The converse case of a vmem_alloc() being mapped to a kmalloc() is less concerning to me. I'm not aware of any problems this would cause. In practice it also should never happen, if we're used vmem_alloc() for a small allocation the call site should be fixed.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@behlendorf Only Linux-specific aspects of the code base need *_bit operations, so my present solution is to convert the routines that allocated their memory to use Linux's native memory allocation routines. Only a few routines require this. Anyway, I will think about how I could rework the patch stack to avoid the need.

As for @DeHackEd's issue, this patch stack includes a patch to convert the slab cache function to use kzalloc(), which is provided by the Linux kernel itself. *_bit operations should be fine there, so the problem is something else. My guess is that the wait_event(wq, atomic_read(&skc->skc_ref) == 0); line is involved.

@behlendorf
Copy link
Contributor

my present solution is to convert the allocations that made them to use Linux's native memory allocation routines.

We should avoid this, even if all the needed changes end up in Linux specific code. Doing this goes against one of the design goals of this patch. We should be minimizing the delta between OpenZFS implementations. Simply ensuring that kmem_alloc() is never backed by a virtual address prevent this problem. Also as a Linux kernel developer I'd also be astonished to see this function behave this way.

@ryao
Copy link
Contributor Author

ryao commented Aug 13, 2014

@behlendorf In that case, I will try mapping kmem_alloc(), kmem_zalloc() and kmem_free() to kmalloc(), kzalloc() and kfree() respectively. That requires giving up on having those routines do allocations with kernel virtual memory, but it seems like that seems like that would be for the best.

@behlendorf
Copy link
Contributor

One other general comment. Since eventually we'd like to move the spl code in to the zfs tree we should be careful to follow same style guidelines used for zfs. Currently the spl code is riddled with violations but we should be careful to avoid adding more when writing new code.

@ryao
Copy link
Contributor Author

ryao commented Oct 4, 2014

@behlendorf I think there is some misunderstanding about the distinction between vmem_alloc() and kmem_alloc(). vmem_alloc() is a low level virtual memory allocator on Solaris. It is used by the kernel to do slab allocations for kmem_cache_alloc(), which is used as the basis for kmem_alloc() via HOARD-style allocation. Consequently, all kmem_alloc() allocations on Solaris are portions of vmem_alloc() allocations. I found this confusing until I came to the conclusion that kmalloc() has no analog on Solaris.

The idea that mapping the Solaris analog of vmalloc() to kmalloc() is safe from the perspective of wait_on_bit() while the converse is unsafe is correct. However, all Solaris allocations can be safely mapped to kmalloc() or vmalloc() from the perspective of wait_on_bit() because Solaris has no analog to either kmalloc() or wait_on_bit(). The only places where such analogs exist are in Linux structures. I believe that it would be best to maintain a distinction between Linux allocations and Solaris allocations by using kmalloc() in sections of code that are Linux-specific and our emulated Solaris interfaces everywhere else. This might seem fragile at first glance, but I believe that is not the case for three reasons:

  1. Linux's locking structures use atomic instructions and churn in them is rare. When churn does occur, there is an incredible amount of pressure to maintain both the size of the structure and remain backward compatible precisely because changes to locking structures can cause unanticipated issues that are hard to debug.
  2. wait_on_bit() has an issue with vmalloc() based memory because it does a hashtable lookup that assumes that wait_on_bit() is never called on virtual memory. This hashtable lookup can be more expensive than conventional locking (as it involves more memory accesses than an atomic instruction for taking a mutex), so it will never be used inside one of the locking primitives that we map to the Solaris ones.
  3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is used inside memory from vmalloc(), so any problems that arise would appear very quickly in buildbots.

@ryao
Copy link
Contributor Author

ryao commented Oct 8, 2014

@behlendorf Here is a description of this pull request that I wrote in an email recently.

The kmem-rework originally intended to fix a rare deadlock that
SoftNAS managed to reliably trigger by running the VMWare Hypervisor's
NFS benchmark on a NFS export of a ZoL dataset from a VM with 8GB of RAM
and a 200GB SLOG. The deadlock was a recursive DMU transaction caused by
a combination of ZFS's atime updates doing DMU transactions and Linux's
direct reclaim doing atime updates inside critical sections formed by
DMU transactions. Systems with more RAM did not trigger it and there
were no reports that look similar (even now in hindsight).

I took the opportunity to fix additional potential problems that I
spotted (e.g. a race making KM_PUSHPAGE into KM_SLEEP) when writing the
fix for that issue. I heard the day after I sent the changes to SoftNAS
for evaluation that my nearly doubled write performance in VMWare's
benchmark in addition to fixing the deadlock. Here are the original
numbers, which SoftNAS has told me that I am free to share.

ZFSOnLinux 0.6.3 (with atime=off because atime=on deadlocks)

Storage Performance Results        Actual    Expected
Minimal write throughput(MB/s)     128.99130249023438       40.0
Minimal read throughput(MB/s)      64.23629951477051       80.0
Minimal seek throughput(seeks/s)   2481.5099487304688       400.0
Total write throughput(MB/s)       130.93539810180664       40.0
Total read throughput(MB/s)        64.28860092163086       80.0
Total seek throughput(seeks/s)     2501.2999877929688       500.0

ZFSOnLinux 0.6.3 (the default atime=on plus my patches)

Storage Performance Results        Actual    Expected
Minimal write throughput(MB/s)     157.76309967041016       40.0
Minimal read throughput(MB/s)      112.0521011352539       80.0
Minimal seek throughput(seeks/s)   2773.9099731445312       400.0
Total write throughput(MB/s)       160.7144012451172       40.0
Total read throughput(MB/s)        112.53049850463867       80.0
Total seek throughput(seeks/s)     2792.1400146484375       500.0

My interpretation of this is that the previous internal locking
introduced micro lags that hurt throughput and my changes
reduced/eliminated them

This predated openzfs/zfs@cd3939c, which SoftNAS told me further improved performance in that benchmark.

@rbraddy
Copy link

rbraddy commented Oct 8, 2014

I can confirm the above benchmarks and findings. In fact, we have observed remarkable ZoL performance improvements that now make ZoL much more usable on VMware ESXi, especially VM cloning and migrations. Great job!

@kernelOfTruth
Copy link

@ryao from reading through this pull request I got the impression that only pools using atime would benefit from this - is this true ?

thanks

@ryao
Copy link
Contributor Author

ryao commented Oct 9, 2014

@kernelOfTruth All pools benefit from it. However, pools using atime have the additional benefit of avoiding the deadlock that I mentioned in the following blog post:

https://clusterhq.com/blog/zfs-on-linux-runtime-stability/

@sempervictus
Copy link
Contributor

Seems SPL doesnt want to build for me - DKMS package builds, actual kernel module fails - http://dpaste.com/067615E.

@kernelOfTruth
Copy link

@ryao thank you =)

@sempervictus confirmed - got the same error message

@ryao
Copy link
Contributor Author

ryao commented Oct 13, 2014

@sempervictus @kernelOfTruth ryao/spl@634047f had a mistake. I had changed an ASSERT to a VERIFY as per @behlendorf's request, but neglected to remove the #ifdef DEBUG ... #endif removing that struct member from non-debug builds. The correction is quite literally:

diff --git a/include/sys/kmem.h b/include/sys/kmem.h                                                                                                                                                                                                                           
index 1f38405..196542c 100644                                                                                                                                                                                                                                                  
--- a/include/sys/kmem.h                                                                                                                                                                                                                                                       
+++ b/include/sys/kmem.h                                                                                                                                                                                                                                                       
@@ -97,9 +97,7 @@ kmem_flags_convert(int flags)                                                                                                                                                                                                                                
 }                                                                                                                                                                                                                                                                             

 typedef struct {                                                                                                                                                                                                                                                              
-#ifdef DEBUG                                                                                                                                                                                                                                                                  
        struct task_struct *fstrans_thread;                                                                                                                                                                                                                                    
-#endif                                                                                                                                                                                                                                                                        
        unsigned int saved_flags;                                                                                                                                                                                                                                              
 } fstrans_cookie_t;                                                                                                                                                                                                                                                           

I have pushed a revised commit.

@ryao
Copy link
Contributor Author

ryao commented Oct 13, 2014

I have refreshed the patches to the zfs repository via openzfs/zfs#2796. Until today, this patch had anenabling patch for supporting CONFIG_PAX_USERCOPY_SLABS on PaX/GrSecurity patched Linux kernels versioned 3.14.y and later. It did nothing without the corresponding ZFS patch and upon writing it, I realize that the enabling patch needs some additional work. I have pulled that patch from this pull request until I have had time to address the issue in it properly. While fixing the GrSecurity issue is important, the purpose of this pull request is to make our kmem emulation better, so there is no reason for it to block merging the other changes here.

@behlendorf
Copy link
Contributor

@ryao I'll try and give the updated patch a careful review tomorrow. I would like to get this merged once we're sure there aren't any subtle issues lurking. However, I was able to hit another issue in the splat kmem:vmem_size test which caused a hang so there are still some issues lurking.

@ryao
Copy link
Contributor Author

ryao commented Oct 16, 2014

I had not run splat to test the more recent iterations of these changes and unfortunately, that lead to bugs going uncaught. These are my present mistakes:

diff --git a/module/spl/spl-kmem.c b/module/spl/spl-kmem.c
index 6dc2a6e..fcdcbb8 100644
--- a/module/spl/spl-kmem.c
+++ b/module/spl/spl-kmem.c
@@ -342,7 +342,7 @@ start:
         * __vmalloc_node() is not exported for our use.
         */
        if (unlikely(size > PAGE_SIZE*2)
-           && (flags & (KM_PUSHPAGE | KM_NOSLEEP)))
+           && !(flags & (KM_PUSHPAGE | KM_NOSLEEP)))
                ptr =  spl_vmalloc(size, lflags, PAGE_KERNEL);
        else
                ptr = kmalloc_node(size, lflags, node);
@@ -355,7 +355,7 @@ start:
                spl_debug_dumpstack(NULL);
        }

-       congestion_wait(BLK_RW_ASYNC, HZ/50);
+       cond_resched();
        goto start;

        /* Silence compiler warning */

The first is quite obvious in that I am sending a request for a large amount of memory to kmalloc() that it cannot satisfy. The stack trace for it looks like this:

[<ffffffffa0002915>] kmem_alloc_debug+0x15/0x50 [spl]
[<ffffffffa0029ad7>] splat_kmem_test12+0xe7/0xa20 [splat]
[<ffffffffa00282e8>] splat_unlocked_ioctl+0x228/0x9d0 [splat]
[<ffffffff81187d9d>] do_vfs_ioctl+0x2dd/0x4b0
[<ffffffff81187ff1>] SyS_ioctl+0x81/0xa0
[<ffffffff8155c429>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

The second is less obvious, but it has a similar backtrace in congestion_wait(). That looked right, although it turned out that it could block indefinitely because it will wait for a write on a backing device and we can have configurations where there is no backing device.

@sempervictus
Copy link
Contributor

@behlendorf and/or @ryao: with the patch above applied to the PR, where do we stand on suspicions about corrupted pools?

On a related but separate note, is there a writeup i can reference for getting a comprehensive testing environment configured? We have datacenter environments with metal and virtual resources... i'm sure they wouldn't mind spinning a few more cycles for "the greater good."

@behlendorf
Copy link
Contributor

@sempervictus @ryao's fixes addressed the issue I caught about potentially not zeroing memory. This could have resulted in problems for all callers which depended on that behavior. I'm still working through a careful review of the rest of the patch. Any additional testing you can offer in a test environment would be welcome.

@behlendorf
Copy link
Contributor

@ryao can you rebase this against master. I've merged the pre-2.6.32 compatibility cleanup which touched/removed some of the same code.

@ryao ryao force-pushed the kmem-rework branch 2 times, most recently from a46707e to adf1e3c Compare October 23, 2014 23:51
@ryao
Copy link
Contributor Author

ryao commented Oct 24, 2014

@behlendorf It is rebased and should be ready to merge. I had mentioned in IRC that I had identified a problem with the linux:shrinker SPLAT test on Linux 3.14.14-gentoo, but it turns out that this is a pre-existing problem. I have included the patch from #403 in the pull request.

@behlendorf
Copy link
Contributor

@ryao Thanks, I'll work through another round and feedback and testing today.

@behlendorf
Copy link
Contributor

@ryao This is shaping up very nicely! I know there's a lot of feedback here but this is definitely getting close to something we can merge.

behlendorf pushed a commit that referenced this pull request Oct 28, 2014
This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #369
@behlendorf
Copy link
Contributor

The following patches have been merged. The remaining patches in the stack we still need to reach agreement on.

ad9863e kmem_cache: Call constructor/destructor on each alloc/free

wake_up_bit() is called on a word inside kmem_cache_t objects.  This
calls virt_to_page() on the address of the memory. That is incompatible
with virtual memory, so we must switch to Linux's memory allocator.

Signed-off-by: Richard Yao <ryao@gentoo.org>
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
proper translation logic. This eliminates the potential for surprises
that were previously possible where kmem flags were mapped to gfp flags.

2. It maps `kmem_{,z}alloc()` KM_SLEEP allocations
that were previously mapped to `kmalloc()` to `vmalloc()` to reduce
internal memory fragmentation.

3. It discards the distinction between vmem_* and kmem_* that was
previously made by mapping them to vmalloc() and kmalloc() respectively.
This achieve better compatibility because kmem_* allocates are done from
slabs allocated from vmem_*. Both are therefore virtual memory
allocators and it makes no sense to implement them differently than one
another.

The detailed reasons for each are as follows:

1. Solaris derivatives have different allocation flag semantics than Linux.
This was originally handled by trying to map Solaris flags to Linux
flags, but the semantics are different enough that this approach does
not correctly handle all cases. For example, 0 is KM_SLEEP on Solaris
derivatives while 0 is illegal on Linux. This means that things like
assertions that make assumptions about the flag semantics are no longer
portable because reasonable assertions such as `ASSERT0(flags)` on
Solaris derivatives are illegal on Linux. In addition, a trivial mapping
allows us to freely mix and match flags.  This is bad for portability
and it can lead to unexpected consequences when clashes between
semantics means that one expects one system's semantics, and receives
another.

3. The SPL originally mapped kmem_alloc() to kmalloc() and
vmem_alloc() to vmalloc(). One would be inclined to think this is
correct by applying the reasonable expectation that things with similar
names on each platform are similar things. However, this is not the case
here. On Solaris, vmem_* is a general purpose arena allocator that does
kernel virtual memory allocations. The Solaris SLAB allocator
`kmem_cache_alloc()` operates by allocating slabs from vmem and returns
objects. Allocations from kmem_alloc() work by performing HOARD-style
allocations on pre-existing power of 2 SLAB caches. When mapping uses of
these allocators to Linux equivalents, we must consider 4 allocators on
Linux and how they interact:

1. The buddy allocator
2. The slab allocator
3. The vmap allocator
4. The kernel virtual memory allocator

The buddy allocator is used for allocating both pages and slabs for
Linux's kmem_slab_alloc. These are then used to provide generic power of
2 caches to which kmalloc() is mapped. Allocations that are larger than
the largest power of 2 are sent directly to the buddy allocator.  This
is analogous to kmem_cache_alloc() and kmem_alloc() on Solaris. The
third allocator is the vmap allocator, which concerns itself with
allocating address space. The four allocator is the kernel virtual
memory allocator and is invoked via `vmalloc()`. This uses pages from
the buddy allocator and address space from the vmap allocator to perform
virtual memory allocations.

3. Switching the KM_SLEEP allocations to `vmalloc()` provides some
protection from deadlocks caused by internal memory fragmentation. It
would have been ideal to make all allocations virtual like they are on
Illumos. However, virtual memory allocations allocations that require
KM_PUSHPAGE or KM_NOSLEEP semantics will receive KM_SLEEP semantics on
Linux whenever a page directory table entry must be allocated, which is
unsafe. We are therefore forced to use physical memory for KM_PUSHPAGE
and KM_NOSLEEP allocations. That is suboptimal from the perspective of
reducing internal memory fragmentation, but we still partially benefit
by mapping KM_SLEEP allocations to `vmalloc()`.

A caveat that aries from replacing `kmalloc()` with `vmalloc()` is that
code using Linux's wake_up_bit should use the native Linux allocators.
This has no equivalent on Solaris.  While it might seem fragile at first
glance, that is not the case for three reasons:

1. Linux's locking structures use atomic instructions and churn in them
is rare. When churn does occur, there is an incredible amount of
pressure to maintain both the size of the structure and remain backward
compatible precisely because changes to locking structures can cause
unanticipated issues that are hard to debug.

2. The incompatibility arises because `wait_on_bit()` does a hashtable
lookup that assumes that `wait_on_bit()` is never called on virtual
memory. This hashtable lookup should be more expensive than conventional
locking because it involves more memory accesses than an atomic
instruction for taking a mutex, so it will never be used inside one of
the locking primitives that we map to the Solaris ones.

3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is
used inside memory from vmalloc(), so any problems that arise would
appear very quickly in buildbots.

Consequently, it is reasonable to expect allocations intended for
structures that use `wake_up_bit()` to be done using the Linux
allocator. At present, the only allocations for such structures are done
inside the SPL SLAB allocator and for super blocks. No other code uses
it or is likely to use it.

These changes appear to create the most semantically equivalent mapping
possible on Linux. The result is the elimination of concerns regarding
proper use of generic interfaces when writing portable code, which posed
problems for the development of things like sgbuf.

A couple of additional changes worth noting are:

1. The kmem_alloc_node interface has been removed(). It has no external
consumers and does not exist on Solaris.

2. sys/vmem.h has been introduced as an alias of sys/kmem.h for Illumos
compatibility.

Signed-off-by: Richard Yao <ryao@gentoo.org>
If a SLAB cache is full and two allocations occur from the same SLAB
cache nearly simultaneously where one is KM_SLEEP and another is either
KM_PUSHPAGE or KM_NOSLEEP, the one that occurs first will dictate the
KM_FLAGS used for SLAB growth for both of them. This is a race condition
that at best hurts performance and at worse, causes deadlocks.

We address this by modifying `spl_cache_grow()` to only provide the
emergency allocation semantics to KM_PUSHPAGE allocations, with KM_SLEEP
allocations being coalesced and KM_NOSLEEP allocations failing
immediately.

Signed-off-by: Richard Yao <ryao@gentoo.org>
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark transactions to that the translation of the IRIX
kmem flags into Linux gfp flags for allocations inside of transactions
will dip into kernel memory reserves to avoid deadlocks during
writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for
disabling __GFP_IO in page allocations, which XFS began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed with in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted into a KM_PUSHPAGE
allocation. It will also set PF_MEMALLOC_NOIO to prevent direct reclaim
from entering `pageout()` on any KM_PUSHPAGE or KM_NOSLEEP allocation on
Linux 3.9 or later.

Signed-off-by: Richard Yao <ryao@gentoo.org>
The comment above the Linux 3.17 kernel's clear_bit() states:

/**
 * clear_bit - Clears a bit in memory
 * @nr: Bit to clear
 * @addr: Address to start counting from
 *
 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
 * in order to ensure changes are visible on other processors.
 */

This comment does not make sense in the context of x86 because x86 maps these
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue:

http://lxr.free-electrons.com/source/arch/arm/include/asm/barrier.h?v=3.16#L83
http://lxr.free-electrons.com/source/arch/arm64/include/asm/barrier.h?v=3.16#L102
http://lxr.free-electrons.com/source/arch/mips/include/asm/barrier.h?v=3.16#L199
http://lxr.free-electrons.com/source/arch/powerpc/include/asm/barrier.h?v=3.16#L88
http://lxr.free-electrons.com/source/arch/s390/include/asm/barrier.h?v=3.16#L32
http://lxr.free-electrons.com/source/arch/tile/include/asm/barrier.h?v=3.16#L83
https://en.wikipedia.org/wiki/Memory_ordering

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will assume
that it did its job and continue.

https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h#L100

It is important to note that
smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}(), were replaced by
smp_mb__{before,after}_atomic() in recent kernels:

torvalds/linux@febdbfe

Some compatibility code was added to replace it in the time being, although it
does not interact well with -Werror:

http://www.spinics.net/lists/backports/msg02669.html
http://lxr.free-electrons.com/source/include/linux/bitops.h?v=3.16#L48

In addition, the kernel's code paths are using clear_bit_unlock() in situations
where clear_bit is used for unlocking. This adds smp_mb__before_atomic(), which
I assume is for Alpha.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainine kernel.

Signed-off-by: Richard Yao <ryao@gentoo.org>
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Richard Yao <ryao@gentoo.org>
@ryao
Copy link
Contributor Author

ryao commented Nov 3, 2014

I have pushed a refresh. There are two known issues at this time:

  1. The issue with kmem leak accounting being off has not been addressed.
  2. splat kmem test 13 causes the OOM killer to trigger.

I suspect that the kmem chnages exposed some subtle issues that were hidden by the earlier kmem code. I am going to work some more on these issues more today, but tomorrow onward, I need to switch to working on the libzfs_core extensions for work.

@behlendorf
Copy link
Contributor

@ryao regarding the OOM killer we should avoid disabling it in such a common code path. We want it to be able to always kill runaway user processes and disabling it should not be necessary.

@behlendorf behlendorf mentioned this pull request Nov 21, 2014
ryao added a commit to ryao/spl that referenced this pull request Nov 29, 2014
This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#369
@behlendorf
Copy link
Contributor

Closing. The kmem rework was merged some time ago.

@behlendorf behlendorf closed this Feb 2, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants