Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

Kmem rework (WIP) #414

Closed
wants to merge 11 commits into from
Closed

Kmem rework (WIP) #414

wants to merge 11 commits into from

Conversation

behlendorf
Copy link
Contributor

This pull request is not complete and should ONLY be used on a test system. I'm opening this pull request to get feedback on the general approach. It builds on the work @ryao has done in #369. It has only been lightly testing under CentOS 7 and expect there may still be build failures for other distributions.

@ryao I'd appreciate your feedback on this, particularly on af7f2ab which significantly restructures the code. This patch stack still needs significant polish but the core ideas are all there. My commit comments contain notes for some of the areas which I still think need work.

  • All the previous patches in Kmem rework #369 have been refresh so they apply cleanly.
  • The kmem_alloc/vmem_alloc interfaces have been unified.
  • The kmem_alloc/vmem_alloc implementation has been consolidated.
  • Split spl-kmem.c in to spl-kmem.c, spl-vmem.c, spl-kmem-cache.c with respective headers.
  • Cstyle issues for related source fixes have been fixed.

@dweeezil
Copy link
Contributor

You're welcome. I'm going to try to work my way through the rest of this patch stack over the next couple of days.

@dweeezil
Copy link
Contributor

I just finished reviewing all the commits in this patch stack and save for my mostly documentation-related comments, everything LGTM.

@chrisrd
Copy link
Contributor

chrisrd commented Dec 29, 2014

b3db9cb commit comment:
s/by there are a few caveats/but there are a few caveats/
s/it extraneous includes removed callers/with extraneous #includes removed, callers/
s/explicitly includes to access/explicitly included to access/
s/The kmem_virt() functions has/The kmem_virt() function has/

@behlendorf
Copy link
Contributor Author

Thanks for all the feedback and testing, I'll be getting this refreshed right away. More comments likely to follow and I work my way through the review comments.

ryao and others added 10 commits January 8, 2015 10:03
This reverts commit eb0f407 in
preperation for updating the kmem/vmem infrastructure to use the
PF_FSTRANS flag.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change introduces no functional changes to the memory management
interfaces.  It only restructures the existing codes by separating the
kmem, vmem, and kmem cache implementations in the separate source and
header files.

Splitting this functionality in to separate files required the addition
of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions.

Additionally, several minor changes to the #include's were required to
accommodate the removal of extraneous header from kmem.h.

But again, while large this patch introduces no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Address all cstyle issues in the kmem, vmem, and kmem_cache source
and headers.  This will done to make it easier to review subsequent
changes which will rework the kmem/vmem implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to the newly-added spl_kmem_alloc_max
   parameter.  This ensures that small allocations will not contend
   on a single global lock, large allocations can still be handled,
   and potentially limited virtual address space will not be squandered.
   This behavior is entirely different than under Illumos due to
   different memory management strategies employed by the respective
   kernels.  However, this functionally provides the semantics required.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   use KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers but there
are a few caveats:

1. Because the headers were refactored and extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The comment above the Linux 3.16 kernel's clear_bit() states:

/**
 * clear_bit - Clears a bit in memory
 * @nr: Bit to clear
 * @addr: Address to start counting from
 *
 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
 * in order to ensure changes are visible on other processors.
 */

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Many people have noticed that the kmem cache implementation is slow
to release its memory.  This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty.  Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The spl-module-parameters(5) was not kept up to date.  Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10.  The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly.  By
reducing it we allow the system to resolve the issue more quickly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

Thanks for the feedback, I've incorporated it and refreshed the patch stack for a fresh round of testing. As long as nothing unexpected comes out of the testing I'll merge this.

The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation.  This is
done because large slabs must be allocated using vmalloc().  It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored.  This can result in a deadlock.

Therefore, a deadlock detection strategy was implemented to deal with
this case.  When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab.  Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.

This logic was correct but an optimization in the code allowed for a
deadlock.  Because only slabs backed by vmalloc() can deadlock in the
way described above.  An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches.  This had the
advantage of making it easy to distinguish these objects when they
were freed.

But this isn't strictly safe.  If all the spl_kmem_cache threads end
up deadlocked then we can't grow any of the other caches either.  This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.

The fix here is to remove the optimization which limits this fall back
allocation strategy to vmalloc() backed caches.  Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
This small cost is somewhat mitigated by ignoring objects with virtual
addresses which could not have been allocated by kmalloc().

For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4 (and made tunable).  This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked.  However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread can't block allocations from
all other caches until the timeout is reached.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

I've added commit c65cc48 to this patch stack. It addresses a long standing issue in the spl kmem cache implementation which can result in a deadlock. @ryao you may be interested in this.

I was able to reliably reproduce this issue while stress testing this patch stack by using a zvol as a swap device. I choose this workload because it's about the nastiest, most deadlock prone, thing you can do the the VM and the ZFS code. This kind of test gives me confidences that we haven't missed any of the corner cases.

After applying this fix I've been able to reliably push 100's of GBs to a zvol swap device at an average rate of ~42 MB/s. Despite the heavy swapping the machine also remains more responsive than with the existing kmem code.

Interestingly enough for this workload the system behaves best when spl_kmem_cache_slab_limit=0 and the Linux slabs are disabled. I expect things will behave even better once we move to scatter gather lists for the zio buffers. But this is an important step in that direction.

behlendorf added a commit to openzfs/zfs that referenced this pull request Jan 16, 2015
The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms.  This simplifies the process of porting changes to
Linux from other platforms.  This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.  The following key Linux specific
changes have been reverted.

* KM_PUSHPAGE changed back to KM_SLEEP.  All contexts where
  it is unsafe to perform IO have been marked with PF_FSTRANS.
  This context specific mechanism is now used exclusively
  and the KM_PUSHPAGE mechanism has been retired.

* The KM_NODEBUG flag has been retired.  Allocations larger
  than 32K should use vmem_alloc()/vmem_free().  Depending
  on the size of the allocation either kmalloc() or vmalloc()
  will be used internally, but no warning will be printed.

* Pre-allocated vdev IO buffers and the dedicated SA spill
  block cache have been retired.  It is now safe and reliable
  to allocate buffers of the needed size without fear of
  deadlocking.  This reduces our memory footprint and paves
  the way for larger block sizes.

Depends on openzfs/spl#414.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2918
@behlendorf
Copy link
Contributor Author

This has been merged as:

9099312 Merge branch 'kmem-rework'

rread pushed a commit to rread/lustre that referenced this pull request Mar 3, 2015
Due to some long overdue memory management cleanup in the ZoL kmem
implementation the definition of KM_SLEEP has changed.  This change
was expected to be transparent to consumers but it causes issues
for Lustre because it explicitly redefines KM_SLEEP.  This was
originally done to avoid overriding the Linux slab interfaces.

This change implements a more portable fix.  Instead of preventing
the inclusion of the kmem.h header by setting the guard.  The
kmem_cache_* preprocessor macros are explictly undefined to make
the Linux interface available.

The related ZoL pull requests are as follows:

  openzfs/spl#414
  openzfs/zfs#2918

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3
Reviewed-on: http://review.whamcloud.com/13096
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
behlendorf added a commit to LLNL/lustre that referenced this pull request Mar 17, 2015
Due to some long overdue memory management cleanup in the ZoL kmem
implementation the definition of KM_SLEEP has changed.  This change
was expected to be transparent to consumers but it causes issues
for Lustre because it explicitly redefines KM_SLEEP.  This was
originally done to avoid overriding the Linux slab interfaces.

This change implements a more portable fix.  Instead of preventing
the inclusion of the kmem.h header by setting the guard.  The
kmem_cache_* preprocessor macros are explictly undefined to make
the Linux interface available.

The related ZoL pull requests are as follows:

  openzfs/spl#414
  openzfs/zfs#2918

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3
Reviewed-on: http://review.whamcloud.com/13096
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
@behlendorf behlendorf deleted the kmem-rework branch July 28, 2017 22:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants