Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kmem rework (WIP) #2918

Closed
wants to merge 7 commits into from
Closed

Kmem rework (WIP) #2918

wants to merge 7 commits into from

Conversation

behlendorf
Copy link
Contributor

This pull request is not complete and should ONLY be used on a test system. I'm opening this pull request to get feedback on the general approach. It builds on the work @ryao has done in #2796. It has only been lightly testing under CentOS 7 and expect there may still be build failures for other distributions.

@ryao I'd appreciate your feedback on this. This is largely your previous patch stack with a few extra patches to return KM_NODEBUG and convert KM_PUSHPAGE -> KM_SLEEP. It still requires extensive testing.

Depends on openzfs/spl#414

@behlendorf behlendorf changed the title Kmem rework Kmem rework (WIP) Nov 21, 2014
@sempervictus
Copy link
Contributor

This appears to have a hard-to-debug issue: suspend/resume power management activities hang the host, making any sort of debug output rather difficult if not impossible to acquire. System exhibiting issues is a "C600/X79" chipset with a Xeon E5-2680 v2 (yes its a laptop).

Update: Reproduced several more times, high consistency, but not every time. Happens on 3.16.7 vanilla and with PF kernel patchset, running Ubuntu 14.04

@behlendorf
Copy link
Contributor Author

@sempervictus Thanks for the feedback. This is a new issue issue, correct?

@ryao I've refreshed this patch stack and it's counterpart openzfs/spl#414. It's evolved considerably but at it's core it still encompasses the changes you suggested. I'll be working on polishing it further, making sure it doesn't introduce any regressions, and that it's actually a measurable improvement over the way things are today.

@sempervictus
Copy link
Contributor

Rebuilt off latest, running several systems on this patchset now, including a pair of iSCSI (SCST) hosts.
Had to drop #2129 to do this, so i'm seeing significant increases in memory use after just a couple of hours. Arcstat claims 12G, looks like it ends up being ~16G if not more. This reduces memory available to SCST and NFS, and i've seen the ram use OOM me out of launching 'cat' to read a file. I'm hoping this does not translate to missed writes (so far so good).
However, no crashes as of yet, write performance is more consistent (benchmarks dont get such pretty numbers, but the shot grouping is much closer), and SCST exhibits less jitter on random IO workloads.

EDIT: Suspend/resume issue does not occur under 3.17.6 based kernel. If anyone else sees it i'll see if i can dig into it more, but for now i'd attribute it to my oddball kernel patch stack.

behlendorf and others added 6 commits December 16, 2014 13:43
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.
As part of the spl kmem/vmem refactoring the kmem_cache_* functions
were split in to their own kmem_cache.h header.  This was done in
part so that kmem_* consumers would not be forced to include the
kmem_cache_* functions which mask several Linux SLAB/SLAB functions.

Because of this we now much explicitly include kmem_cache.h in the
zfs_context.h.  However, consumers such as Lustre which need access
to the KM_FLAGS but not the kmem_cache_* functions can now safely
just include kmem.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

@sempervictus Thanks for the feedback I appreciate it. Can I bother you to please test this patch stack again, I've just refreshed it and it's getting very close to a form in which it can be merged.

I've subjected the refreshed patch stack to a range of torture tests including using a zvol as a swap device and it's held up very well. However, I'd like to see some more real world usage and evidence that this makes things better. Getting this in will lay a chunk of the ground work for merging #2129 or something very similar to it.

At a high level what I'd expect from this change is the following:

  • Reduce the memory footprint slightly. Certain worst case behaviors which the SPL kmem cache could exhibit have been resolved and other optimizations made. This should slightly reduce the memory usages, but if that isn't your experience we'll need to explain why.
  • Reduce latency and improve consistency. The increased use of kmalloc() and reduced use of the GFP_NOIO flag should allow direct reclaim more flexibility and that is expected to translate in to more consistent behavior.
  • Brings the ZoL code back in sync with Illumos regarding how the KM_FLAGS can be used. This simplifies the work involved in porting patches.

The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf
Copy link
Contributor Author

This change and its openzfs/spl#414 counterpart have passed an extensive amount of local testing. My intention is to continue letting my internal stress and correctness testing run until Monday. If nothing new is turned up I'll merge these changes. It would be great to get any additional feedback and patch reviews before then.

@edillmann
Copy link
Contributor

Hi, I give some try to this patch and got the following stacktraces :

The kernel is vanilla 3.18.1

[80237.642028] Large kmem_alloc(65792, 0x0), please file an issue at:
[80237.642028] https://github.com/zfsonlinux/zfs/issues/new
[80237.642030] CPU: 1 PID: 44020 Comm: mysqld Tainted: P OE 3.18.1-jave #2
[80237.642031] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0c 10/08/2012
[80237.642032] 0000000000000000 ffff880e0cb9fb88 ffffffff8171ca68 0000000000000000
[80237.642034] 00000000000042d0 ffff880e0cb9fbb8 ffffffffa0327a29 00000000000100c0
[80237.642035] 0000000000000009 00000000004a0000 0000000000000001 ffff880e0cb9fbd8
[80237.642037] Call Trace:
[80237.642039] [] dump_stack+0x46/0x58
[80237.642044] [] spl_kmem_alloc+0xf9/0x180 [spl]
[80237.642063] [] zil_itx_create+0x37/0x70 [zfs]
[80237.642083] [] zfs_log_write+0x296/0x360 [zfs]
[80237.642102] [] zfs_write+0x60a/0xc20 [zfs]
[80237.642104] [] ? futex_wait_queue_me+0xed/0x140
[80237.642125] [] zpl_write+0x90/0xd0 [zfs]
[80237.642127] [] vfs_write+0xb7/0x1f0
[80237.642128] [] SyS_pwrite64+0x72/0xb0
[80237.642130] [] system_call_fastpath+0x16/0x1b
[80237.642141] Large kmem_alloc(65792, 0x0), please file an issue at:

@edillmann
Copy link
Contributor

I got also this slightly different trace

[79995.452118] https://github.com/zfsonlinux/zfs/issues/new
[79995.452130] CPU: 7 PID: 44020 Comm: mysqld Tainted: P OE 3.18.1-jave #2
[79995.452133] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 1.0c 10/08/2012
[79995.452136] 0000000000000000 ffff880e0cb9fb88 ffffffff8171ca68 0000000000000001
[79995.452142] 00000000000042d0 ffff880e0cb9fbb8 ffffffffa0327a29 00000000000178c0
[79995.452146] 0000000000000009 0000000006248800 0000000000000001 ffff880e0cb9fbd8
[79995.452151] Call Trace:
[79995.452164] [] dump_stack+0x46/0x58
[79995.452185] [] spl_kmem_alloc+0xf9/0x180 [spl]
[79995.452251] [] zil_itx_create+0x37/0x70 [zfs]
[79995.452309] [] zfs_log_write+0x296/0x360 [zfs]
[79995.452365] [] zfs_write+0x60a/0xc20 [zfs]
[79995.452372] [] ? mutex_lock+0x16/0x37
[79995.452412] [] ? dmu_buf_hold_array_by_dnode+0x253/0x540 [zfs]
[79995.452414] [] ? mutex_lock+0x16/0x37
[79995.452442] [] zpl_write+0x90/0xd0 [zfs]
[79995.452445] [] vfs_write+0xb7/0x1f0
[79995.452447] [] SyS_pwrite64+0x72/0xb0
[79995.452450] [] ? int_check_syscall_exit_work+0x34/0x3d
[79995.452452] [] system_call_fastpath+0x16/0x1b
[79995.452797] Large kmem_alloc(93952, 0x0), please file an issue at:

@behlendorf
Copy link
Contributor Author

@edillmann thank you. I'll fix these two harmless warning when this is merged.

@behlendorf
Copy link
Contributor Author

This has been merged as:

6e9710f Merge branch 'kmem-rework'

rread pushed a commit to rread/lustre that referenced this pull request Mar 3, 2015
Due to some long overdue memory management cleanup in the ZoL kmem
implementation the definition of KM_SLEEP has changed.  This change
was expected to be transparent to consumers but it causes issues
for Lustre because it explicitly redefines KM_SLEEP.  This was
originally done to avoid overriding the Linux slab interfaces.

This change implements a more portable fix.  Instead of preventing
the inclusion of the kmem.h header by setting the guard.  The
kmem_cache_* preprocessor macros are explictly undefined to make
the Linux interface available.

The related ZoL pull requests are as follows:

  openzfs/spl#414
  openzfs/zfs#2918

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3
Reviewed-on: http://review.whamcloud.com/13096
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
behlendorf added a commit to LLNL/lustre that referenced this pull request Mar 17, 2015
Due to some long overdue memory management cleanup in the ZoL kmem
implementation the definition of KM_SLEEP has changed.  This change
was expected to be transparent to consumers but it causes issues
for Lustre because it explicitly redefines KM_SLEEP.  This was
originally done to avoid overriding the Linux slab interfaces.

This change implements a more portable fix.  Instead of preventing
the inclusion of the kmem.h header by setting the guard.  The
kmem_cache_* preprocessor macros are explictly undefined to make
the Linux interface available.

The related ZoL pull requests are as follows:

  openzfs/spl#414
  openzfs/zfs#2918

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3
Reviewed-on: http://review.whamcloud.com/13096
Tested-by: Jenkins
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
@behlendorf behlendorf deleted the kmem-rework branch April 19, 2021 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants