Kmem rework (WIP) #2918

behlendorf · 2014-11-21T01:39:35Z

This pull request is not complete and should ONLY be used on a test system. I'm opening this pull request to get feedback on the general approach. It builds on the work @ryao has done in #2796. It has only been lightly testing under CentOS 7 and expect there may still be build failures for other distributions.

@ryao I'd appreciate your feedback on this. This is largely your previous patch stack with a few extra patches to return KM_NODEBUG and convert KM_PUSHPAGE -> KM_SLEEP. It still requires extensive testing.

Depends on openzfs/spl#414

sempervictus · 2014-12-05T21:53:45Z

This appears to have a hard-to-debug issue: suspend/resume power management activities hang the host, making any sort of debug output rather difficult if not impossible to acquire. System exhibiting issues is a "C600/X79" chipset with a Xeon E5-2680 v2 (yes its a laptop).

Update: Reproduced several more times, high consistency, but not every time. Happens on 3.16.7 vanilla and with PF kernel patchset, running Ubuntu 14.04

behlendorf · 2014-12-09T23:56:47Z

@sempervictus Thanks for the feedback. This is a new issue issue, correct?

@ryao I've refreshed this patch stack and it's counterpart openzfs/spl#414. It's evolved considerably but at it's core it still encompasses the changes you suggested. I'll be working on polishing it further, making sure it doesn't introduce any regressions, and that it's actually a measurable improvement over the way things are today.

sempervictus · 2014-12-16T20:50:35Z

Rebuilt off latest, running several systems on this patchset now, including a pair of iSCSI (SCST) hosts.
Had to drop #2129 to do this, so i'm seeing significant increases in memory use after just a couple of hours. Arcstat claims 12G, looks like it ends up being ~16G if not more. This reduces memory available to SCST and NFS, and i've seen the ram use OOM me out of launching 'cat' to read a file. I'm hoping this does not translate to missed writes (so far so good).
However, no crashes as of yet, write performance is more consistent (benchmarks dont get such pretty numbers, but the shot grouping is much closer), and SCST exhibits less jitter on random IO workloads.

EDIT: Suspend/resume issue does not occur under 3.17.6 based kernel. If anyone else sees it i'll see if i can dig into it more, but for now i'd attribute it to my oddball kernel patch stack.

In order to avoid deadlocking in the IO pipeline it is critical that pageout be avoided during direct memory reclaim. This ensures that the pipeline threads can always make forward progress and never end up blocking on a DMU transaction. For this very reason Linux now provides the PF_FSTRANS flag which may be set in the process context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream.

As part of the spl kmem/vmem refactoring the kmem_cache_* functions were split in to their own kmem_cache.h header. This was done in part so that kmem_* consumers would not be forced to include the kmem_cache_* functions which mask several Linux SLAB/SLAB functions. Because of this we now much explicitly include kmem_cache.h in the zfs_context.h. However, consumers such as Lustre which need access to the KM_FLAGS but not the kmem_cache_* functions can now safely just include kmem.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Commit 86dd0fd added preallocated I/O buffers. This is no longer required after the recent kmem changes designed to make our memory allocation interfaces behave more like those found on Illumos. A deadlock in this situation is no longer possible. However, these allocations still have the potential to be expensive. So a potential future optimization might be to perform then KM_NOSLEEP so that they either succeed of fail quicky. Either case is acceptable here because we can safely abort the aggregation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2014-12-16T22:11:08Z

@sempervictus Thanks for the feedback I appreciate it. Can I bother you to please test this patch stack again, I've just refreshed it and it's getting very close to a form in which it can be merged.

I've subjected the refreshed patch stack to a range of torture tests including using a zvol as a swap device and it's held up very well. However, I'd like to see some more real world usage and evidence that this makes things better. Getting this in will lay a chunk of the ground work for merging #2129 or something very similar to it.

At a high level what I'd expect from this change is the following:

Reduce the memory footprint slightly. Certain worst case behaviors which the SPL kmem cache could exhibit have been resolved and other optimizations made. This should slightly reduce the memory usages, but if that isn't your experience we'll need to explain why.
Reduce latency and improve consistency. The increased use of kmalloc() and reduced use of the GFP_NOIO flag should allow direct reclaim more flexibility and that is expected to translate in to more consistent behavior.
Brings the ZoL code back in sync with Illumos regarding how the KM_FLAGS can be used. This simplifies the work involved in porting patches.

The SA spill_cache was originally introduced to avoid the need to perform large kmem or vmem allocations. Instead a small dedicated cache of preallocated SA buffers was kept. This solution was viable while the maximum block size was limited to 128K. But with the planned increase of the maximum block size to 16M callers need to migrate to the zio_buf_alloc(). However, they should be aware this interface is expected to change again once the zio buffers are fully backed by scatter-gather lists. Alternately, if the callers know these buffers will never be large or be infrequently accessed they may kmem_alloc() or vmem_alloc() the needed temporary space. This change has the additional benegit of bringing the code back inline with the upstream Illumos source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf · 2014-12-18T17:19:03Z

This change and its openzfs/spl#414 counterpart have passed an extensive amount of local testing. My intention is to continue letting my internal stress and correctness testing run until Monday. If nothing new is turned up I'll merge these changes. It would be great to get any additional feedback and patch reviews before then.

edillmann · 2015-01-12T08:08:30Z

Hi, I give some try to this patch and got the following stacktraces :

The kernel is vanilla 3.18.1

[80237.642028] Large kmem_alloc(65792, 0x0), please file an issue at:
[80237.642028] https://github.com/zfsonlinux/zfs/issues/new
[80237.642030] CPU: 1 PID: 44020 Comm: mysqld Tainted: P OE 3.18.1-jave #2
[80237.642031] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0c 10/08/2012
[80237.642032] 0000000000000000 ffff880e0cb9fb88 ffffffff8171ca68 0000000000000000
[80237.642034] 00000000000042d0 ffff880e0cb9fbb8 ffffffffa0327a29 00000000000100c0
[80237.642035] 0000000000000009 00000000004a0000 0000000000000001 ffff880e0cb9fbd8
[80237.642037] Call Trace:
[80237.642039] [] dump_stack+0x46/0x58
[80237.642044] [] spl_kmem_alloc+0xf9/0x180 [spl]
[80237.642063] [] zil_itx_create+0x37/0x70 [zfs]
[80237.642083] [] zfs_log_write+0x296/0x360 [zfs]
[80237.642102] [] zfs_write+0x60a/0xc20 [zfs]
[80237.642104] [] ? futex_wait_queue_me+0xed/0x140
[80237.642125] [] zpl_write+0x90/0xd0 [zfs]
[80237.642127] [] vfs_write+0xb7/0x1f0
[80237.642128] [] SyS_pwrite64+0x72/0xb0
[80237.642130] [] system_call_fastpath+0x16/0x1b
[80237.642141] Large kmem_alloc(65792, 0x0), please file an issue at:

edillmann · 2015-01-12T08:11:50Z

I got also this slightly different trace

[79995.452118] https://github.com/zfsonlinux/zfs/issues/new
[79995.452130] CPU: 7 PID: 44020 Comm: mysqld Tainted: P OE 3.18.1-jave #2
[79995.452133] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0c 10/08/2012
[79995.452136] 0000000000000000 ffff880e0cb9fb88 ffffffff8171ca68 0000000000000001
[79995.452142] 00000000000042d0 ffff880e0cb9fbb8 ffffffffa0327a29 00000000000178c0
[79995.452146] 0000000000000009 0000000006248800 0000000000000001 ffff880e0cb9fbd8
[79995.452151] Call Trace:
[79995.452164] [] dump_stack+0x46/0x58
[79995.452185] [] spl_kmem_alloc+0xf9/0x180 [spl]
[79995.452251] [] zil_itx_create+0x37/0x70 [zfs]
[79995.452309] [] zfs_log_write+0x296/0x360 [zfs]
[79995.452365] [] zfs_write+0x60a/0xc20 [zfs]
[79995.452372] [] ? mutex_lock+0x16/0x37
[79995.452412] [] ? dmu_buf_hold_array_by_dnode+0x253/0x540 [zfs]
[79995.452414] [] ? mutex_lock+0x16/0x37
[79995.452442] [] zpl_write+0x90/0xd0 [zfs]
[79995.452445] [] vfs_write+0xb7/0x1f0
[79995.452447] [] SyS_pwrite64+0x72/0xb0
[79995.452450] [] ? int_check_syscall_exit_work+0x34/0x3d
[79995.452452] [] system_call_fastpath+0x16/0x1b
[79995.452797] Large kmem_alloc(93952, 0x0), please file an issue at:

behlendorf · 2015-01-16T18:28:12Z

@edillmann thank you. I'll fix these two harmless warning when this is merged.

behlendorf · 2015-01-16T23:51:14Z

This has been merged as:

6e9710f Merge branch 'kmem-rework'

Due to some long overdue memory management cleanup in the ZoL kmem implementation the definition of KM_SLEEP has changed. This change was expected to be transparent to consumers but it causes issues for Lustre because it explicitly redefines KM_SLEEP. This was originally done to avoid overriding the Linux slab interfaces. This change implements a more portable fix. Instead of preventing the inclusion of the kmem.h header by setting the guard. The kmem_cache_* preprocessor macros are explictly undefined to make the Linux interface available. The related ZoL pull requests are as follows: openzfs/spl#414 openzfs/zfs#2918 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3 Reviewed-on: http://review.whamcloud.com/13096 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

behlendorf changed the title ~~Kmem rework~~ Kmem rework (WIP) Nov 21, 2014

behlendorf force-pushed the kmem-rework branch from 85507fe to 20fd2f7 Compare December 9, 2014 23:51

behlendorf added this to the 0.6.4 milestone Dec 9, 2014

behlendorf added Difficulty - Hard Component: Memory Management kernel memory management labels Dec 9, 2014

behlendorf force-pushed the kmem-rework branch from 20fd2f7 to 1299945 Compare December 9, 2014 23:58

behlendorf mentioned this pull request Dec 12, 2014

Kmem rework #2796

Closed

behlendorf and others added 6 commits December 16, 2014 13:43

behlendorf force-pushed the kmem-rework branch from 1299945 to ccbc620 Compare December 16, 2014 21:47

behlendorf force-pushed the kmem-rework branch from ccbc620 to 73e438c Compare December 17, 2014 01:02

This was referenced Dec 19, 2014

Another ZFS deadlock with 0.6.3-1, sysrq-t included #2864

Closed

crash in spl_kmem_cache_alloc after applying zfsonlinux/zfs@a3c1eb7 #2987

Closed

behlendorf closed this in 6e9710f Jan 16, 2015

kernelOfTruth mentioned this pull request Jan 17, 2015

Move ARC data buffers out of vmalloc #2129

Closed

behlendorf deleted the kmem-rework branch April 19, 2021 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmem rework (WIP) #2918

Kmem rework (WIP) #2918

behlendorf commented Nov 21, 2014

sempervictus commented Dec 5, 2014

behlendorf commented Dec 9, 2014

sempervictus commented Dec 16, 2014

behlendorf commented Dec 16, 2014

behlendorf commented Dec 18, 2014

edillmann commented Jan 12, 2015

edillmann commented Jan 12, 2015

behlendorf commented Jan 16, 2015

behlendorf commented Jan 16, 2015

Kmem rework (WIP) #2918

Kmem rework (WIP) #2918

Conversation

behlendorf commented Nov 21, 2014

sempervictus commented Dec 5, 2014

behlendorf commented Dec 9, 2014

sempervictus commented Dec 16, 2014

behlendorf commented Dec 16, 2014

behlendorf commented Dec 18, 2014

edillmann commented Jan 12, 2015

edillmann commented Jan 12, 2015

behlendorf commented Jan 16, 2015

behlendorf commented Jan 16, 2015