-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kmem rework (WIP) #2918
Kmem rework (WIP) #2918
Conversation
This appears to have a hard-to-debug issue: suspend/resume power management activities hang the host, making any sort of debug output rather difficult if not impossible to acquire. System exhibiting issues is a "C600/X79" chipset with a Xeon E5-2680 v2 (yes its a laptop). Update: Reproduced several more times, high consistency, but not every time. Happens on 3.16.7 vanilla and with PF kernel patchset, running Ubuntu 14.04 |
85507fe
to
20fd2f7
Compare
@sempervictus Thanks for the feedback. This is a new issue issue, correct? @ryao I've refreshed this patch stack and it's counterpart openzfs/spl#414. It's evolved considerably but at it's core it still encompasses the changes you suggested. I'll be working on polishing it further, making sure it doesn't introduce any regressions, and that it's actually a measurable improvement over the way things are today. |
20fd2f7
to
1299945
Compare
Rebuilt off latest, running several systems on this patchset now, including a pair of iSCSI (SCST) hosts. EDIT: Suspend/resume issue does not occur under 3.17.6 based kernel. If anyone else sees it i'll see if i can dig into it more, but for now i'd attribute it to my oddball kernel patch stack. |
In order to avoid deadlocking in the IO pipeline it is critical that pageout be avoided during direct memory reclaim. This ensures that the pipeline threads can always make forward progress and never end up blocking on a DMU transaction. For this very reason Linux now provides the PF_FSTRANS flag which may be set in the process context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The initial port of ZFS to Linux required a way to identify virtual memory to make IO to virtual memory backed slabs work, so kmem_virt() was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is logically equivalent to kmem_virt(). Support for kernels before 2.6.26 was later dropped and more recently, support for kernels before Linux 2.6.32 has been dropped. We retire kmem_virt() in favor of is_vmalloc_addr() to cleanup the code. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress the large allocation warning have been replaced by vmem_alloc() as appropriate. The updated vmem_alloc() call will not print a warning regardless of the size of the allocation. A careful reader will notice that not all callers have been changed to vmem_alloc(). Some have only had the KM_NODEBUG flag removed. This was possible because the default warning threshold has been increased to 32k. This is desirable because it minimizes the need for Linux specific code changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
By marking DMU transaction processing contexts with PF_FSTRANS we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings us back in line with upstream. In some cases this means simply swapping the flags back. For others fnvlist_alloc() was replaced by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to fnvlist_alloc() which assumes KM_SLEEP. The one place KM_PUSHPAGE is kept is when allocating ARC buffers which allows us to dip in to reserved memory. This is again the same as upstream.
As part of the spl kmem/vmem refactoring the kmem_cache_* functions were split in to their own kmem_cache.h header. This was done in part so that kmem_* consumers would not be forced to include the kmem_cache_* functions which mask several Linux SLAB/SLAB functions. Because of this we now much explicitly include kmem_cache.h in the zfs_context.h. However, consumers such as Lustre which need access to the KM_FLAGS but not the kmem_cache_* functions can now safely just include kmem.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit 86dd0fd added preallocated I/O buffers. This is no longer required after the recent kmem changes designed to make our memory allocation interfaces behave more like those found on Illumos. A deadlock in this situation is no longer possible. However, these allocations still have the potential to be expensive. So a potential future optimization might be to perform then KM_NOSLEEP so that they either succeed of fail quicky. Either case is acceptable here because we can safely abort the aggregation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
1299945
to
ccbc620
Compare
@sempervictus Thanks for the feedback I appreciate it. Can I bother you to please test this patch stack again, I've just refreshed it and it's getting very close to a form in which it can be merged. I've subjected the refreshed patch stack to a range of torture tests including using a zvol as a swap device and it's held up very well. However, I'd like to see some more real world usage and evidence that this makes things better. Getting this in will lay a chunk of the ground work for merging #2129 or something very similar to it. At a high level what I'd expect from this change is the following:
|
The SA spill_cache was originally introduced to avoid the need to perform large kmem or vmem allocations. Instead a small dedicated cache of preallocated SA buffers was kept. This solution was viable while the maximum block size was limited to 128K. But with the planned increase of the maximum block size to 16M callers need to migrate to the zio_buf_alloc(). However, they should be aware this interface is expected to change again once the zio buffers are fully backed by scatter-gather lists. Alternately, if the callers know these buffers will never be large or be infrequently accessed they may kmem_alloc() or vmem_alloc() the needed temporary space. This change has the additional benegit of bringing the code back inline with the upstream Illumos source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
ccbc620
to
73e438c
Compare
This change and its openzfs/spl#414 counterpart have passed an extensive amount of local testing. My intention is to continue letting my internal stress and correctness testing run until Monday. If nothing new is turned up I'll merge these changes. It would be great to get any additional feedback and patch reviews before then. |
Hi, I give some try to this patch and got the following stacktraces : The kernel is vanilla 3.18.1 [80237.642028] Large kmem_alloc(65792, 0x0), please file an issue at: |
I got also this slightly different trace [79995.452118] https://github.com/zfsonlinux/zfs/issues/new |
@edillmann thank you. I'll fix these two harmless warning when this is merged. |
This has been merged as: 6e9710f Merge branch 'kmem-rework' |
Due to some long overdue memory management cleanup in the ZoL kmem implementation the definition of KM_SLEEP has changed. This change was expected to be transparent to consumers but it causes issues for Lustre because it explicitly redefines KM_SLEEP. This was originally done to avoid overriding the Linux slab interfaces. This change implements a more portable fix. Instead of preventing the inclusion of the kmem.h header by setting the guard. The kmem_cache_* preprocessor macros are explictly undefined to make the Linux interface available. The related ZoL pull requests are as follows: openzfs/spl#414 openzfs/zfs#2918 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3 Reviewed-on: http://review.whamcloud.com/13096 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
Due to some long overdue memory management cleanup in the ZoL kmem implementation the definition of KM_SLEEP has changed. This change was expected to be transparent to consumers but it causes issues for Lustre because it explicitly redefines KM_SLEEP. This was originally done to avoid overriding the Linux slab interfaces. This change implements a more portable fix. Instead of preventing the inclusion of the kmem.h header by setting the guard. The kmem_cache_* preprocessor macros are explictly undefined to make the Linux interface available. The related ZoL pull requests are as follows: openzfs/spl#414 openzfs/zfs#2918 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Change-Id: Id1d19d7b4c440808b8b3fd042f687b10c1b869f3 Reviewed-on: http://review.whamcloud.com/13096 Tested-by: Jenkins Tested-by: Maloo <hpdd-maloo@intel.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Reviewed-by: Isaac Huang <he.huang@intel.com> Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>
This pull request is not complete and should ONLY be used on a test system. I'm opening this pull request to get feedback on the general approach. It builds on the work @ryao has done in #2796. It has only been lightly testing under CentOS 7 and expect there may still be build failures for other distributions.
@ryao I'd appreciate your feedback on this. This is largely your previous patch stack with a few extra patches to return KM_NODEBUG and convert KM_PUSHPAGE -> KM_SLEEP. It still requires extensive testing.
Depends on openzfs/spl#414