Kmem rework #369

wake_up_bit() is called on a word inside kmem_cache_t objects. This calls virt_to_page() on the address of the memory. That is incompatible with virtual memory, so we must switch to Linux's memory allocator. Signed-off-by: Richard Yao <ryao@gentoo.org>

This reverts commit eb0f407.

This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps `kmem_{,z}alloc()` KM_SLEEP allocations that were previously mapped to `kmalloc()` to `vmalloc()` to reduce internal memory fragmentation. 3. It discards the distinction between vmem_* and kmem_* that was previously made by mapping them to vmalloc() and kmalloc() respectively. This achieve better compatibility because kmem_* allocates are done from slabs allocated from vmem_*. Both are therefore virtual memory allocators and it makes no sense to implement them differently than one another. The detailed reasons for each are as follows: 1. Solaris derivatives have different allocation flag semantics than Linux. This was originally handled by trying to map Solaris flags to Linux flags, but the semantics are different enough that this approach does not correctly handle all cases. For example, 0 is KM_SLEEP on Solaris derivatives while 0 is illegal on Linux. This means that things like assertions that make assumptions about the flag semantics are no longer portable because reasonable assertions such as `ASSERT0(flags)` on Solaris derivatives are illegal on Linux. In addition, a trivial mapping allows us to freely mix and match flags. This is bad for portability and it can lead to unexpected consequences when clashes between semantics means that one expects one system's semantics, and receives another. 3. The SPL originally mapped kmem_alloc() to kmalloc() and vmem_alloc() to vmalloc(). One would be inclined to think this is correct by applying the reasonable expectation that things with similar names on each platform are similar things. However, this is not the case here. On Solaris, vmem_* is a general purpose arena allocator that does kernel virtual memory allocations. The Solaris SLAB allocator `kmem_cache_alloc()` operates by allocating slabs from vmem and returns objects. Allocations from kmem_alloc() work by performing HOARD-style allocations on pre-existing power of 2 SLAB caches. When mapping uses of these allocators to Linux equivalents, we must consider 4 allocators on Linux and how they interact: 1. The buddy allocator 2. The slab allocator 3. The vmap allocator 4. The kernel virtual memory allocator The buddy allocator is used for allocating both pages and slabs for Linux's kmem_slab_alloc. These are then used to provide generic power of 2 caches to which kmalloc() is mapped. Allocations that are larger than the largest power of 2 are sent directly to the buddy allocator. This is analogous to kmem_cache_alloc() and kmem_alloc() on Solaris. The third allocator is the vmap allocator, which concerns itself with allocating address space. The four allocator is the kernel virtual memory allocator and is invoked via `vmalloc()`. This uses pages from the buddy allocator and address space from the vmap allocator to perform virtual memory allocations. 3. Switching the KM_SLEEP allocations to `vmalloc()` provides some protection from deadlocks caused by internal memory fragmentation. It would have been ideal to make all allocations virtual like they are on Illumos. However, virtual memory allocations allocations that require KM_PUSHPAGE or KM_NOSLEEP semantics will receive KM_SLEEP semantics on Linux whenever a page directory table entry must be allocated, which is unsafe. We are therefore forced to use physical memory for KM_PUSHPAGE and KM_NOSLEEP allocations. That is suboptimal from the perspective of reducing internal memory fragmentation, but we still partially benefit by mapping KM_SLEEP allocations to `vmalloc()`. A caveat that aries from replacing `kmalloc()` with `vmalloc()` is that code using Linux's wake_up_bit should use the native Linux allocators. This has no equivalent on Solaris. While it might seem fragile at first glance, that is not the case for three reasons: 1. Linux's locking structures use atomic instructions and churn in them is rare. When churn does occur, there is an incredible amount of pressure to maintain both the size of the structure and remain backward compatible precisely because changes to locking structures can cause unanticipated issues that are hard to debug. 2. The incompatibility arises because `wait_on_bit()` does a hashtable lookup that assumes that `wait_on_bit()` is never called on virtual memory. This hashtable lookup should be more expensive than conventional locking because it involves more memory accesses than an atomic instruction for taking a mutex, so it will never be used inside one of the locking primitives that we map to the Solaris ones. 3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is used inside memory from vmalloc(), so any problems that arise would appear very quickly in buildbots. Consequently, it is reasonable to expect allocations intended for structures that use `wake_up_bit()` to be done using the Linux allocator. At present, the only allocations for such structures are done inside the SPL SLAB allocator and for super blocks. No other code uses it or is likely to use it. These changes appear to create the most semantically equivalent mapping possible on Linux. The result is the elimination of concerns regarding proper use of generic interfaces when writing portable code, which posed problems for the development of things like sgbuf. A couple of additional changes worth noting are: 1. The kmem_alloc_node interface has been removed(). It has no external consumers and does not exist on Solaris. 2. sys/vmem.h has been introduced as an alias of sys/kmem.h for Illumos compatibility. Signed-off-by: Richard Yao <ryao@gentoo.org>

If a SLAB cache is full and two allocations occur from the same SLAB cache nearly simultaneously where one is KM_SLEEP and another is either KM_PUSHPAGE or KM_NOSLEEP, the one that occurs first will dictate the KM_FLAGS used for SLAB growth for both of them. This is a race condition that at best hurts performance and at worse, causes deadlocks. We address this by modifying `spl_cache_grow()` to only provide the emergency allocation semantics to KM_PUSHPAGE allocations, with KM_SLEEP allocations being coalesced and KM_NOSLEEP allocations failing immediately. Signed-off-by: Richard Yao <ryao@gentoo.org>

The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark transactions to that the translation of the IRIX kmem flags into Linux gfp flags for allocations inside of transactions will dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed with in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted into a KM_PUSHPAGE allocation. It will also set PF_MEMALLOC_NOIO to prevent direct reclaim from entering `pageout()` on any KM_PUSHPAGE or KM_NOSLEEP allocation on Linux 3.9 or later. Signed-off-by: Richard Yao <ryao@gentoo.org>

This reverts commit e302072.

@nr

The comment above the Linux 3.17 kernel's clear_bit() states: /** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from * * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic() * in order to ensure changes are visible on other processors. */ This comment does not make sense in the context of x86 because x86 maps these operations to barrier(), which is a compiler barrier. However, it does make sense to me when I consider architectures that reorder around atomic instructions. In such situations, a processor is allowed to execute the wake_up_bit() before clear_bit() and we have a race. There are a few architectures that suffer from this issue: http://lxr.free-electrons.com/source/arch/arm/include/asm/barrier.h?v=3.16#L83 http://lxr.free-electrons.com/source/arch/arm64/include/asm/barrier.h?v=3.16#L102 http://lxr.free-electrons.com/source/arch/mips/include/asm/barrier.h?v=3.16#L199 http://lxr.free-electrons.com/source/arch/powerpc/include/asm/barrier.h?v=3.16#L88 http://lxr.free-electrons.com/source/arch/s390/include/asm/barrier.h?v=3.16#L32 http://lxr.free-electrons.com/source/arch/tile/include/asm/barrier.h?v=3.16#L83 https://en.wikipedia.org/wiki/Memory_ordering In such situations, the other processor would wake-up, see the bit is still taken and go to sleep, while the one responsible for waking it up will assume that it did its job and continue. https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h#L100 It is important to note that smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}(), were replaced by smp_mb__{before,after}_atomic() in recent kernels: torvalds/linux@febdbfe Some compatibility code was added to replace it in the time being, although it does not interact well with -Werror: http://www.spinics.net/lists/backports/msg02669.html http://lxr.free-electrons.com/source/include/linux/bitops.h?v=3.16#L48 In addition, the kernel's code paths are using clear_bit_unlock() in situations where clear_bit is used for unlocking. This adds smp_mb__before_atomic(), which I assume is for Alpha. This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to smp_mb__{before,after}_clear_bit() on older kernels and changes our code to leverage it in a manner consistent with the mainine kernel. Signed-off-by: Richard Yao <ryao@gentoo.org>

The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Richard Yao <ryao@gentoo.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmem rework #369

Kmem rework #369

Commits on Nov 3, 2014