Kmem rework (WIP) #414

behlendorf · 2014-11-21T01:35:18Z

This pull request is not complete and should ONLY be used on a test system. I'm opening this pull request to get feedback on the general approach. It builds on the work @ryao has done in #369. It has only been lightly testing under CentOS 7 and expect there may still be build failures for other distributions.

@ryao I'd appreciate your feedback on this, particularly on af7f2ab which significantly restructures the code. This patch stack still needs significant polish but the core ideas are all there. My commit comments contain notes for some of the areas which I still think need work.

All the previous patches in Kmem rework #369 have been refresh so they apply cleanly.
The kmem_alloc/vmem_alloc interfaces have been unified.
The kmem_alloc/vmem_alloc implementation has been consolidated.
Split spl-kmem.c in to spl-kmem.c, spl-vmem.c, spl-kmem-cache.c with respective headers.
Cstyle issues for related source fixes have been fixed.

dweeezil · 2014-12-24T17:17:06Z

You're welcome. I'm going to try to work my way through the rest of this patch stack over the next couple of days.

dweeezil · 2014-12-29T06:27:54Z

I just finished reviewing all the commits in this patch stack and save for my mostly documentation-related comments, everything LGTM.

chrisrd · 2014-12-29T22:07:01Z

b3db9cb commit comment:
s/by there are a few caveats/but there are a few caveats/
s/it extraneous includes removed callers/with extraneous #includes removed, callers/
s/explicitly includes to access/explicitly included to access/
s/The kmem_virt() functions has/The kmem_virt() function has/

behlendorf · 2015-01-06T23:35:03Z

Thanks for all the feedback and testing, I'll be getting this refreshed right away. More comments likely to follow and I work my way through the review comments.

This reverts commit eb0f407 in preperation for updating the kmem/vmem infrastructure to use the PF_FSTRANS flag. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This change introduces no functional changes to the memory management interfaces. It only restructures the existing codes by separating the kmem, vmem, and kmem cache implementations in the separate source and header files. Splitting this functionality in to separate files required the addition of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions. Additionally, several minor changes to the #include's were required to accommodate the removal of extraneous header from kmem.h. But again, while large this patch introduces no functional changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Address all cstyle issues in the kmem, vmem, and kmem_cache source and headers. This will done to make it easier to review subsequent changes which will rework the kmem/vmem implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This patch achieves the following goals: 1. It replaces the preprocessor kmem flag to gfp flag mapping with proper translation logic. This eliminates the potential for surprises that were previously possible where kmem flags were mapped to gfp flags. 2. It maps vmem_alloc() allocations to kmem_alloc() for allocations sized less than or equal to the newly-added spl_kmem_alloc_max parameter. This ensures that small allocations will not contend on a single global lock, large allocations can still be handled, and potentially limited virtual address space will not be squandered. This behavior is entirely different than under Illumos due to different memory management strategies employed by the respective kernels. However, this functionally provides the semantics required. 3. The --disable-debug-kmem, --enable-debug-kmem (default), and --enable-debug-kmem-tracking allocators have been unified in to a single spl_kmem_alloc_impl() allocation function. This was done to simplify the code and make it more maintainable. 4. Improve portability by exposing an implementation of the memory allocations functions that can be safely used in the same way they are used on Illumos. Specifically, callers may safely use KM_SLEEP in contexts which perform filesystem IO. This allows us to eliminate an entire class of Linux specific changes which were previously required to avoid deadlocking the system. This change will be largely transparent to existing callers but there are a few caveats: 1. Because the headers were refactored and extraneous includes removed callers may find they need to explicitly add additional #includes. In particular, kmem_cache.h must now be explicitly includes to access the SPL's kmem cache implementation. This behavior is different from Illumos but it was done to avoid always masking the Linux slab functions when kmem.h is included. 2. Callers, like Lustre, which made assumptions about the definitions of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated. Other callers such as ZFS which did not will not require changes. 3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO. It retains its original meaning of allowing allocations to access reserved memory. KM_PUSHPAGE callers can be converted back to KM_SLEEP. 4. The KM_NODEBUG flags has been retired and the default warning threshold increased to 32k. 5. The kmem_virt() functions has been removed. For callers which need to distinguish between a physical and virtual address use is_vmalloc_addr(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit that is used to mark contexts which are processing transactions. When set, allocations in this context can dip into kernel memory reserves to avoid deadlocks during writeback. Linux 3.9 provided the additional PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS began using in 3.15. This patch implements hooks for marking transactions via PF_FSTRANS. When an allocation is performed in the context of PF_FSTRANS, any KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation. Additionally, when using a Linux 3.9 or newer kernel, it will set PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on on any KM_PUSHPAGE or KM_NOSLEEP allocation. This effectively allows the spl_vmalloc() helper function to be used safely in a thread which is responsible for IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

@nr

The comment above the Linux 3.16 kernel's clear_bit() states: /** * clear_bit - Clears a bit in memory * @nr: Bit to clear * @addr: Address to start counting from * * clear_bit() is atomic and may not be reordered. However, it does * not contain a memory barrier, so if it is used for locking purposes, * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic() * in order to ensure changes are visible on other processors. */ This comment does not make sense in the context of x86 because x86 maps the operations to barrier(), which is a compiler barrier. However, it does make sense to me when I consider architectures that reorder around atomic instructions. In such situations, a processor is allowed to execute the wake_up_bit() before clear_bit() and we have a race. There are a few architectures that suffer from this issue. In such situations, the other processor would wake-up, see the bit is still taken and go to sleep, while the one responsible for waking it up will assume that it did its job and continue. This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to smp_mb__{before,after}_clear_bit() on older kernels and changes our code to leverage it in a manner consistent with the mainline kernel. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Many people have noticed that the kmem cache implementation is slow to release its memory. This patch makes the reclaim behavior more aggressive by immediately freeing a slab once it is empty. Unused objects which are cached in the magazines will still prevent a slab from being freed. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The spl-module-parameters(5) was not kept up to date. Refresh the man page so that it lists all the possible module options, describes what the do, and justify why the default values are set they way the are. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reduce the threshold for detecting a kmem cache deadlock by 10x from HZ to HZ/10. The reduced value is still several orders of magnitude large enough to avoid being triggered incorrectly. By reducing it we allow the system to resolve the issue more quickly. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

This change is designed to improve the memory utilization of slabs by more carefully setting their size. The way the code currently works is problematic for slabs which contain large objects (>1MB). This is due to slabs being unconditionally rounded up to a power of two which may result in unused space at the end of the slab. The reason the existing code rounds up every slab is because it assumes it will backed by the buddy allocator. Since the buddy allocator can only performs power of two allocations this is desirable because it avoids wasting any space. However, this logic breaks down if slab is backed by vmalloc() which operates at a page level granularity. In this case, the optimal thing to do is calculate the minimum required slab size given certain constraints (object size, alignment, objects/slab, etc). Therefore, this patch reworks the spl_slab_size() function so that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs. KMC_KMEM slabs are rounded up to the nearest power of two, and KMC_VMEM slabs are allowed to be the minimum required size. This change also reduces the default number of objects per slab. This reduces how much memory a single cache object can pin, which can result in significant memory saving for highly fragmented caches. But depending on the workload it may result in slabs being allocated and freed more frequently. In practice, this has been shown to be a better default for most workloads. Also the maximum slab size has been reduced to 4MB on 32-bit systems. Due to the limited virtual address space it's critical the we be as frugal as possible. A limit of 4M still lets us reasonably comfortably allocate a limited number of 1MB objects. Finally, the kmem:slab_small and kmem:slab_large SPLAT tests were extended to provide better test coverage of various object sizes and alignments. Caches are created with random parameters and their basic functionality is verified by allocating several slabs worth of objects. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2015-01-08T18:13:37Z

Thanks for the feedback, I've incorporated it and refreshed the patch stack for a fresh round of testing. As long as nothing unexpected comes out of the testing I'll merge this.

The kmem cache implementation always adds new slabs by dispatching a task to the spl_kmem_cache taskq to perform the allocation. This is done because large slabs must be allocated using vmalloc(). It is possible these allocations will block on IO because the GFP_NOIO flag is not honored. This can result in a deadlock. Therefore, a deadlock detection strategy was implemented to deal with this case. When it is determined, by timeout, that the spl_kmem_cache thread has deadlocked attempting to add a new slab. Then all callers attempting to allocate from the cache fall back to using kmalloc() which does honor all passed flags. This logic was correct but an optimization in the code allowed for a deadlock. Because only slabs backed by vmalloc() can deadlock in the way described above. An optimization was made to only invoke this deadlock detection code for vmalloc() backed caches. This had the advantage of making it easy to distinguish these objects when they were freed. But this isn't strictly safe. If all the spl_kmem_cache threads end up deadlocked then we can't grow any of the other caches either. This can once again result in a deadlock if memory needs to be allocated from one of these other caches to ensure forward progress. The fix here is to remove the optimization which limits this fall back allocation strategy to vmalloc() backed caches. Doing this means we may need to take the cache lock in spl_kmem_cache_free() call path. This small cost is somewhat mitigated by ignoring objects with virtual addresses which could not have been allocated by kmalloc(). For good measure the default number of spl_kmem_cache threads has been increased from 1 to 4 (and made tunable). This alone wouldn't resolve the original issue since it's still possible for all the threads to be deadlocked. However, it does help responsiveness by ensuring that a single deadlocked spl_kmem_cache thread can't block allocations from all other caches until the timeout is reached. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2015-01-10T01:12:58Z

I've added commit c65cc48 to this patch stack. It addresses a long standing issue in the spl kmem cache implementation which can result in a deadlock. @ryao you may be interested in this.

I was able to reliably reproduce this issue while stress testing this patch stack by using a zvol as a swap device. I choose this workload because it's about the nastiest, most deadlock prone, thing you can do the the VM and the ZFS code. This kind of test gives me confidences that we haven't missed any of the corner cases.

After applying this fix I've been able to reliably push 100's of GBs to a zvol swap device at an average rate of ~42 MB/s. Despite the heavy swapping the machine also remains more responsive than with the existing kmem code.

Interestingly enough for this workload the system behaves best when spl_kmem_cache_slab_limit=0 and the Linux slabs are disabled. I expect things will behave even better once we move to scatter gather lists for the zio buffers. But this is an important step in that direction.

The core motivation behind these changes is to minimize the memory management differences between ZFS on Linux and other platforms. This simplifies the process of porting changes to Linux from other platforms. This is good for code quality and is expected to reduce the number of defects accidentally introduced due to porting. The following key Linux specific changes have been reverted. * KM_PUSHPAGE changed back to KM_SLEEP. All contexts where it is unsafe to perform IO have been marked with PF_FSTRANS. This context specific mechanism is now used exclusively and the KM_PUSHPAGE mechanism has been retired. * The KM_NODEBUG flag has been retired. Allocations larger than 32K should use vmem_alloc()/vmem_free(). Depending on the size of the allocation either kmalloc() or vmalloc() will be used internally, but no warning will be printed. * Pre-allocated vdev IO buffers and the dedicated SA spill block cache have been retired. It is now safe and reliable to allocate buffers of the needed size without fear of deadlocking. This reduces our memory footprint and paves the way for larger block sizes. Depends on openzfs/spl#414. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2918

behlendorf · 2015-01-16T23:49:52Z

This has been merged as:

9099312 Merge branch 'kmem-rework'

Due to some long overdue memory management cleanup in the ZoL kmem implementation the definition of KM_SLEEP has changed. This change was expected to be transparent to consumers but it causes issues for Lustre because it explicitly redefines KM_SLEEP. This was originally done to avoid overriding the Linux slab interfaces. This change implements a more portable fix. Instead of preventing the inclusion of the kmem.h header by setting the guard. The kmem_cache_* preprocessor macros are explictly undefined to make the Linux interface available. The related ZoL pull requests are as follows: openzfs/spl#414 openzfs/zfs#2918 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3 Reviewed-on: http://review.whamcloud.com/13096 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

behlendorf mentioned this pull request Nov 21, 2014

Kmem rework (WIP) openzfs/zfs#2918

Closed

behlendorf force-pushed the kmem-rework branch 4 times, most recently from 69c205e to 306f0fe Compare December 16, 2014 22:16

behlendorf force-pushed the kmem-rework branch 2 times, most recently from 646140b to 8399636 Compare December 18, 2014 17:16

This was referenced Dec 18, 2014

Out of memory during heavy load over NFS #419

Closed

crash in spl_kmem_cache_alloc after applying zfsonlinux/zfs@a3c1eb7 openzfs/zfs#2987

Closed

ryao and others added 10 commits January 8, 2015 10:03

Revert "Add PF_NOFS debugging flag"

80367a2

This reverts commit eb0f407 in preperation for updating the kmem/vmem infrastructure to use the PF_FSTRANS flag. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Fix kmem cstyle issues

5994aef

Address all cstyle issues in the kmem, vmem, and kmem_cache source and headers. This will done to make it easier to review subsequent changes which will rework the kmem/vmem implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf force-pushed the kmem-rework branch from 8399636 to 82cd609 Compare January 8, 2015 18:12

behlendorf closed this in 9099312 Jan 16, 2015

kernelOfTruth mentioned this pull request Jan 17, 2015

Move ARC data buffers out of vmalloc openzfs/zfs#2129

Closed

behlendorf deleted the kmem-rework branch July 28, 2017 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmem rework (WIP) #414

Kmem rework (WIP) #414

behlendorf commented Nov 21, 2014

dweeezil commented Dec 24, 2014

dweeezil commented Dec 29, 2014

chrisrd commented Dec 29, 2014

behlendorf commented Jan 6, 2015

behlendorf commented Jan 8, 2015

behlendorf commented Jan 10, 2015

behlendorf commented Jan 16, 2015

Kmem rework (WIP) #414

Kmem rework (WIP) #414

Conversation

behlendorf commented Nov 21, 2014

dweeezil commented Dec 24, 2014

dweeezil commented Dec 29, 2014

chrisrd commented Dec 29, 2014

behlendorf commented Jan 6, 2015

behlendorf commented Jan 8, 2015

behlendorf commented Jan 10, 2015

behlendorf commented Jan 16, 2015