Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Kmem rework #369

Closed
wants to merge 8 commits into from
Closed

Kmem rework #369

wants to merge 8 commits into from

Commits on Nov 3, 2014

  1. Use Linux kzalloc to allocate kmem_cache_t objects

    wake_up_bit() is called on a word inside kmem_cache_t objects.  This
    calls virt_to_page() on the address of the memory. That is incompatible
    with virtual memory, so we must switch to Linux's memory allocator.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    3743a7d View commit details
    Browse the repository at this point in the history
  2. Revert "Add PF_NOFS debugging flag"

    This reverts commit eb0f407.
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    7de7607 View commit details
    Browse the repository at this point in the history
  3. Refactor generic memory allocation interfaces

    This patch achieves the following goals:
    
    1. It replaces the preprocessor kmem flag to gfp flag mapping with
    proper translation logic. This eliminates the potential for surprises
    that were previously possible where kmem flags were mapped to gfp flags.
    
    2. It maps `kmem_{,z}alloc()` KM_SLEEP allocations
    that were previously mapped to `kmalloc()` to `vmalloc()` to reduce
    internal memory fragmentation.
    
    3. It discards the distinction between vmem_* and kmem_* that was
    previously made by mapping them to vmalloc() and kmalloc() respectively.
    This achieve better compatibility because kmem_* allocates are done from
    slabs allocated from vmem_*. Both are therefore virtual memory
    allocators and it makes no sense to implement them differently than one
    another.
    
    The detailed reasons for each are as follows:
    
    1. Solaris derivatives have different allocation flag semantics than Linux.
    This was originally handled by trying to map Solaris flags to Linux
    flags, but the semantics are different enough that this approach does
    not correctly handle all cases. For example, 0 is KM_SLEEP on Solaris
    derivatives while 0 is illegal on Linux. This means that things like
    assertions that make assumptions about the flag semantics are no longer
    portable because reasonable assertions such as `ASSERT0(flags)` on
    Solaris derivatives are illegal on Linux. In addition, a trivial mapping
    allows us to freely mix and match flags.  This is bad for portability
    and it can lead to unexpected consequences when clashes between
    semantics means that one expects one system's semantics, and receives
    another.
    
    3. The SPL originally mapped kmem_alloc() to kmalloc() and
    vmem_alloc() to vmalloc(). One would be inclined to think this is
    correct by applying the reasonable expectation that things with similar
    names on each platform are similar things. However, this is not the case
    here. On Solaris, vmem_* is a general purpose arena allocator that does
    kernel virtual memory allocations. The Solaris SLAB allocator
    `kmem_cache_alloc()` operates by allocating slabs from vmem and returns
    objects. Allocations from kmem_alloc() work by performing HOARD-style
    allocations on pre-existing power of 2 SLAB caches. When mapping uses of
    these allocators to Linux equivalents, we must consider 4 allocators on
    Linux and how they interact:
    
    1. The buddy allocator
    2. The slab allocator
    3. The vmap allocator
    4. The kernel virtual memory allocator
    
    The buddy allocator is used for allocating both pages and slabs for
    Linux's kmem_slab_alloc. These are then used to provide generic power of
    2 caches to which kmalloc() is mapped. Allocations that are larger than
    the largest power of 2 are sent directly to the buddy allocator.  This
    is analogous to kmem_cache_alloc() and kmem_alloc() on Solaris. The
    third allocator is the vmap allocator, which concerns itself with
    allocating address space. The four allocator is the kernel virtual
    memory allocator and is invoked via `vmalloc()`. This uses pages from
    the buddy allocator and address space from the vmap allocator to perform
    virtual memory allocations.
    
    3. Switching the KM_SLEEP allocations to `vmalloc()` provides some
    protection from deadlocks caused by internal memory fragmentation. It
    would have been ideal to make all allocations virtual like they are on
    Illumos. However, virtual memory allocations allocations that require
    KM_PUSHPAGE or KM_NOSLEEP semantics will receive KM_SLEEP semantics on
    Linux whenever a page directory table entry must be allocated, which is
    unsafe. We are therefore forced to use physical memory for KM_PUSHPAGE
    and KM_NOSLEEP allocations. That is suboptimal from the perspective of
    reducing internal memory fragmentation, but we still partially benefit
    by mapping KM_SLEEP allocations to `vmalloc()`.
    
    A caveat that aries from replacing `kmalloc()` with `vmalloc()` is that
    code using Linux's wake_up_bit should use the native Linux allocators.
    This has no equivalent on Solaris.  While it might seem fragile at first
    glance, that is not the case for three reasons:
    
    1. Linux's locking structures use atomic instructions and churn in them
    is rare. When churn does occur, there is an incredible amount of
    pressure to maintain both the size of the structure and remain backward
    compatible precisely because changes to locking structures can cause
    unanticipated issues that are hard to debug.
    
    2. The incompatibility arises because `wait_on_bit()` does a hashtable
    lookup that assumes that `wait_on_bit()` is never called on virtual
    memory. This hashtable lookup should be more expensive than conventional
    locking because it involves more memory accesses than an atomic
    instruction for taking a mutex, so it will never be used inside one of
    the locking primitives that we map to the Solaris ones.
    
    3. The kernel will print a backtrace to dmesg whenever wait_on_bit() is
    used inside memory from vmalloc(), so any problems that arise would
    appear very quickly in buildbots.
    
    Consequently, it is reasonable to expect allocations intended for
    structures that use `wake_up_bit()` to be done using the Linux
    allocator. At present, the only allocations for such structures are done
    inside the SPL SLAB allocator and for super blocks. No other code uses
    it or is likely to use it.
    
    These changes appear to create the most semantically equivalent mapping
    possible on Linux. The result is the elimination of concerns regarding
    proper use of generic interfaces when writing portable code, which posed
    problems for the development of things like sgbuf.
    
    A couple of additional changes worth noting are:
    
    1. The kmem_alloc_node interface has been removed(). It has no external
    consumers and does not exist on Solaris.
    
    2. sys/vmem.h has been introduced as an alias of sys/kmem.h for Illumos
    compatibility.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    05fb716 View commit details
    Browse the repository at this point in the history
  4. KM_SLEEP SLAB growth should never affect high priority allocations

    If a SLAB cache is full and two allocations occur from the same SLAB
    cache nearly simultaneously where one is KM_SLEEP and another is either
    KM_PUSHPAGE or KM_NOSLEEP, the one that occurs first will dictate the
    KM_FLAGS used for SLAB growth for both of them. This is a race condition
    that at best hurts performance and at worse, causes deadlocks.
    
    We address this by modifying `spl_cache_grow()` to only provide the
    emergency allocation semantics to KM_PUSHPAGE allocations, with KM_SLEEP
    allocations being coalesced and KM_NOSLEEP allocations failing
    immediately.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    97927f5 View commit details
    Browse the repository at this point in the history
  5. Add hooks for disabling direct reclaim inside DMU transactions

    The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
    that is used to mark transactions to that the translation of the IRIX
    kmem flags into Linux gfp flags for allocations inside of transactions
    will dip into kernel memory reserves to avoid deadlocks during
    writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for
    disabling __GFP_IO in page allocations, which XFS began using in 3.15.
    
    This patch implements hooks for marking transactions via PF_FSTRANS.
    When an allocation is performed with in the context of PF_FSTRANS, any
    KM_SLEEP allocation is transparently converted into a KM_PUSHPAGE
    allocation. It will also set PF_MEMALLOC_NOIO to prevent direct reclaim
    from entering `pageout()` on any KM_PUSHPAGE or KM_NOSLEEP allocation on
    Linux 3.9 or later.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    0154bbe View commit details
    Browse the repository at this point in the history
  6. Revert "Linux 3.16 compat: smp_mb__after_clear_bit()"

    This reverts commit e302072.
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    291a4b8 View commit details
    Browse the repository at this point in the history
  7. spl-kmem: Enforce architecture-specific barriers around clear_bit()

    The comment above the Linux 3.17 kernel's clear_bit() states:
    
    /**
     * clear_bit - Clears a bit in memory
     * @nr: Bit to clear
     * @addr: Address to start counting from
     *
     * clear_bit() is atomic and may not be reordered.  However, it does
     * not contain a memory barrier, so if it is used for locking purposes,
     * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
     * in order to ensure changes are visible on other processors.
     */
    
    This comment does not make sense in the context of x86 because x86 maps these
    operations to barrier(), which is a compiler barrier. However, it does make
    sense to me when I consider architectures that reorder around atomic
    instructions. In such situations, a processor is allowed to execute the
    wake_up_bit() before clear_bit() and we have a race. There are a few
    architectures that suffer from this issue:
    
    http://lxr.free-electrons.com/source/arch/arm/include/asm/barrier.h?v=3.16#L83
    http://lxr.free-electrons.com/source/arch/arm64/include/asm/barrier.h?v=3.16#L102
    http://lxr.free-electrons.com/source/arch/mips/include/asm/barrier.h?v=3.16#L199
    http://lxr.free-electrons.com/source/arch/powerpc/include/asm/barrier.h?v=3.16#L88
    http://lxr.free-electrons.com/source/arch/s390/include/asm/barrier.h?v=3.16#L32
    http://lxr.free-electrons.com/source/arch/tile/include/asm/barrier.h?v=3.16#L83
    https://en.wikipedia.org/wiki/Memory_ordering
    
    In such situations, the other processor would wake-up, see the bit is still
    taken and go to sleep, while the one responsible for waking it up will assume
    that it did its job and continue.
    
    https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h#L100
    
    It is important to note that
    smp_mb__{before,after}_{atomic,clear}_{dec,inc,bit}(), were replaced by
    smp_mb__{before,after}_atomic() in recent kernels:
    
    torvalds/linux@febdbfe
    
    Some compatibility code was added to replace it in the time being, although it
    does not interact well with -Werror:
    
    http://www.spinics.net/lists/backports/msg02669.html
    http://lxr.free-electrons.com/source/include/linux/bitops.h?v=3.16#L48
    
    In addition, the kernel's code paths are using clear_bit_unlock() in situations
    where clear_bit is used for unlocking. This adds smp_mb__before_atomic(), which
    I assume is for Alpha.
    
    This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
    smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
    leverage it in a manner consistent with the mainine kernel.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    340b095 View commit details
    Browse the repository at this point in the history
  8. Retire kmem_virt()

    The initial port of ZFS to Linux required a way to identify virtual
    memory to make IO to virtual memory backed slabs work, so kmem_virt()
    was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
    logically equivalent to kmem_virt(). Support for kernels before 2.6.26
    was later dropped and more recently, support for kernels before Linux
    2.6.32 has been dropped. We retire kmem_virt() in favor of
    is_vmalloc_addr() to cleanup the code.
    
    Signed-off-by: Richard Yao <ryao@gentoo.org>
    ryao committed Nov 3, 2014
    Configuration menu
    Copy the full SHA
    723c8bb View commit details
    Browse the repository at this point in the history