-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix relocation overflows by implementing preallocation in the memory manager #1009
Conversation
Copied verbatim from llvm/llvm-project@f28c006a5895, files: ``` llvm/include/llvm/ExecutionEngine/SectionMemoryManager.h llvm/lib/ExecutionEngine/SectionMemoryManager.cpp ```
This makes them compliant with our C++ style check.
Notes on the changes: - The memory manager is added to the build system. - The `LlvmliteMemoryManager` class is exported as a public interface. - When creating an execution engine, we set it to use our memory manager.
cd0d357
to
a7ae8c4
Compare
I hit an assertion in the Numba test suite on an M2 system:
Looking into which test caused this now. |
To reproduce:
|
Looks like we somehow don't quite reserve enough space for code mem - with the
|
ffi/memorymanager.cpp
Outdated
// Use the same calculation as allocateSection because we need to be able to | ||
// satisfy it. | ||
uintptr_t RequiredSize = | ||
Alignment * ((Size + Alignment - 1) / Alignment + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses the same oversizing that's used in allocateSection (which is done to let it align the address as well), but it can only do it on the full reservation. I was hoping the caller would ensure they're calculation had sufficient buffer for multiple calls to allocateDataSection/allocateCodeSection, but it looks like there may be circumstances where that's not true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for looking at this PR! Tracing through things a bit I just noticed https://github.com/numba/llvmlite/pull/1009/files#r1394577484
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason for the discrepancy is that the reservation is made with the alignment of the code section's alignment (4), but the actual allocation is made with the max of the alignment of the code section (4) and the stub alignment (which is 8 on AArch64 on macOS, and maybe other platforms). So I think the code segment preallocation needs to be aligned to max(code section alignment, stub alignment) too.
"Alignment must be a power of two."); | ||
|
||
uintptr_t RequiredSize = | ||
Alignment * ((Size + Alignment - 1) / Alignment + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This computation is pushing a request for 16379 bytes with alignment 8 up to a size of 16392 bytes, which is larger than the 16384 bytes reserved, in the case where the assertion is being hit in the tests on M2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of other things I just noticed:
- For
reserveAllocationSpace()
, the requested code size is 16380 bytes with an align of 4 - When the
allocateDataSection()
call is made for code, the request is for 16379 bytes with an align of 8
This discrepancy leads to the request being just slightly over what was reserved for code - I think the next step is to look into why the alignment and requested sizes differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output from the extra debugging prints I just pushed:
Code size / align: 0x3FFC / 4
ROData size / align: 0x0 / 1
RWData size / align: 0x2360 / 16
Reserving 0xC000 bytes
Code mem starts at 0x0000000132DDC000, size 0x4000
Rwdata mem starts at 0x0x0000000132DE0000, size 0x4000
Requested size / alignment: 0x3FFB / 8
Allocating 0x4008 bytes for CodeMem at Assertion failed: (false && "All memory must be pre-allocated"), function allocateSection, file memorymanager.cpp, line 109.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
61ae2b0 appears to resolve this issue and allow the test to run to completion.
// | ||
// This file implements the section-based memory manager used by the MCJIT | ||
// execution engine and RuntimeDyld | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe some more rationale here why we are switching to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
StringRef SectionName) override; | ||
|
||
/// Allocates a memory block of (at least) the given size suitable for | ||
/// executable code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: executable code -> data.
(looks a copy-paste typo from the previous method)
// allocated due to page alignment, but if we have insufficient free memory | ||
// for the request this can lead to allocating disparate memory that can | ||
// violate the ARM ABI. Clear free memory so only the new allocations are | ||
// used, but do not release allocated memory as it may still be in-use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read this a couple of times, but think I am getting it now. Let me check my understanding/logic here, perhaps it can be used to make the problem description /solution a bit more crisp.
The objective is allocate memory (blocks) that are "near" to each other. Keeping blocks near makes it less likely that the distance between different memory addresses would become too large and e.g. violate ARM ABI relocation restrictions. If a code/rodata/rwdata memory space has been allocated, but not all space is used (e.g. excess blocks that were allocated due to page alignment), then we do mark all memory as being used by clearing the "free memory" here in that space. This has the effect that a next allocation request is not going to try and scavenge some free blocks from somewhere else, thus avoiding that it finds some memory that is potentially "far away".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And further to my previous comment, freeing the free space is the crux of the solution. So I think we need to spend a little bit more time on discussing/documenting the alternatives and pros/cons. The alternative mentioned in the LLVM discourse thread talks about potentially doing this in the finalizeMemory()
method, but doing it here has the benefit of being less intrusive, at the cost of wasting some memory.
It wouldn't be too difficult to quantify the waste, I guess. In an experiment we could iterate over the free blocks and sum the sizes and print that for the numba test suite. Don't know if we are going to learn anything, but just an idea.
But given the simplicity of the approach, this definitely looks like the most appealing. I am going to look a bit further in this though, to see what you mean by "pending prefix indices" that you mentioned in the description and how that fits into the picture here.
Status update - with the commit 61ae2b0 I can get through the whole test suite (with the usually-skipped tests not skipped): diff --git a/numba/tests/test_array_constants.py b/numba/tests/test_array_constants.py
index a33dacd49..386c1856b 100644
--- a/numba/tests/test_array_constants.py
+++ b/numba/tests/test_array_constants.py
@@ -141,7 +141,6 @@ class TestConstantArray(unittest.TestCase):
out = cres.entry_point()
self.assertEqual(out, 86)
- @skip_m1_llvm_rtdyld_failure
def test_too_big_to_freeze(self):
"""
Test issue https://github.com/numba/numba/issues/2188 where freezing
diff --git a/numba/tests/test_stencils.py b/numba/tests/test_stencils.py
index 2a65c0370..1e2f8dc77 100644
--- a/numba/tests/test_stencils.py
+++ b/numba/tests/test_stencils.py
@@ -80,7 +80,6 @@ if not _32bit: # prevent compilation on unsupported 32bit targets
return a + 1
-@skip_m1_llvm_rtdyld_failure # skip all stencil tests on m1
class TestStencilBase(unittest.TestCase):
_numba_parallel_test_ = False resulting in:
I believe the failures are innocuous:
|
Still an issue on Linux AArch64, although this is maybe a latent bug in cleanup in Numba:
|
|
||
// Look in the list of free memory regions and use a block there if one | ||
// is available. | ||
for (FreeMemBlock &FreeMB : MemGroup.FreeMem) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About
I don't understand how the memory allocation / mapping really works, in particular what pending memory is and pending prefix indices are
My understanding is that pending memory is memory that has allocated but not yet "finalised".
I am not sure how importing this prefix index is. It looks like a bit of bookkeeping to keep an index of the next free block.
Also my impression is that this whole loop is skipped because of clearance on lines 137 - 139.
It turns out that this issue is unrelated to this PR - I need to raise a Numba issue shortly. |
With the offending ctypes tests skipped, on Linux AArch64 the test results are quite similar to those on macOS:
So as far as I can tell, there are no outstanding issues with the implementation in this PR in its present form. |
As a follow-up on the cause of those fails - they are all rooted in the warning about the llvmlite version not being recognized being produced - not an actual issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next thing on my to-do list is to align the changes in this branch / PR more closely with @MikaelSmith's changes in llvm/llvm-project#71968 and re-test - I think that the change for stub alignment to be taken into account might still be needed, but it would be best to check first.
The implementation of `reserveAllocationSpace()` now more closely follows that in llvm/llvm-project#71968, following some changes made there. The changes here include: - Improved readability of debugging output - Using a default alignment of 8 in `allocateSection()` to match the default alignment provided by the stub alignment during preallocation. - Replacing the "bespoke" `requiredPageSize()` function with computations using the LLVM `alignTo()` function. - Returning early from preallocation when no space is requested. - Reusing existing preallocations if there is enough space left over from the previous preallocation for all the required segments - this can happen quite frequently because allocations for each segment get rounded up to page sizes, which are usually either 4K or 16K, and many Numba-jitted functions require a lot less than this. - Removal of setting the near hints for memory blocks - this doesn't really have any use when all memory is preallocated, and forced to be "near" to other memory. - Addition of extra asserts to validate alignment of allocated sections.
616a057
to
75b103c
Compare
The default is to enable it on 64-bit ARM systems, since it solves the problem they encounter, and disable it elsewhere, to minimise the risk of an unintended side effect on platforms that don't need it. This can be overridden by manually specifying the value of `use_lmm` when creating the MCJIT compiler.
3ee5574
to
b673be6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my limited background knowledge on what's going on with the memory manager, I reviewed the C++ code based on whether I can understand it. The code is clear and well commented.
The license addition looks good.
Buildfarm has never been happier all thanks to this patch.
numba#9337 further unskipped more test related to M1 RuntimeDyLd issues and I have ran it on the farm. All M1 tests passed.
The PPC64 linker issue
The only outstanding problem is compiler failure on PPC64LE. On the Power machine in the buildfarm, both anaconda-distro and conda-forge packages are failing to link memorymanager.o
file with the error:
ld: /opt/conda/envs/cf/lib/libLLVMSupport.a(Error.cpp.o):(.data.rel.ro._ZTVN4llvm13ErrorInfoBaseE[_ZTVN4llvm13ErrorInfoBaseE]+0x40): undefined reference to `llvm::ErrorInfoBase::isA(void const*) const'
After some investigation and following suggestion in https://support.xilinx.com/s/article/20068?language=en_US, I found that adding -mlongcall
when compiling memorymanager.cpp
fixes the problem. However, this "fix" may introduce some performance issues since it forces all jumps to be long jumps. Since GCC9.5, the doc on mlongcall has this description:
On PowerPC64 ELFv2 and 32-bit PowerPC systems with newer GNU linkers, GCC can generate long calls using an inline PLT call sequence (see -mpltseq). PowerPC with -mbss-plt and PowerPC64 ELFv1 (big-endian) do not support inline PLT calls.
This might be a system linker too old problem or that newer GCC (>=9) can generate alternative longcall sequence to avoid the issue.
We can fix this PPC problem in a separate PR so it's not a blocker for this PR. Adding -mlongcall
is probably the easiest fix for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buildfarm has passed with the latest commit. This is ready for merge!!! Numba's buildfarm has never been happier as this PR will stop random failures on our arm64/aarch64 machines.
Thank you @gmarkall and everyone who reviewed this PR.
Nightly jobs have been [failing](https://github.com/rapidsai/cudf/actions/runs/7382855293/job/20083184931) with a numba segfault. This appears to be a longstanding issue with numba on aarch64 fixed by numba/llvmlite#1009. Technically, the issue exists already in our tests, but it appears that changes from numba 0.58 make the conditions for the issue to occur much more likely, hence the failures occurring after removing the numba 0.58 version constraint recently. The issue should be fixed in numba 0.59. For now however we should skip things so that nightlies can be fixed. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #14702
This implements a memory manager based on the MCJIT
SectionMemoryManager
, with a preallocation strategy that ensures all segments of an object are placed within a single block of mapped memory. This is intended to resolve the relocation overflow issues on AArch64 (numba/numba#8567, numba/numba#9001), which occur when the GOT segment is far from the code segment.The changes are based on those by @MikaelSmith in llvm/llvm-project#71968 and his code in https://github.com/MikaelSmith/impala/blob/ac8561b6b69530f9fa2ff2ae65ec7415aa4395c6/be/src/codegen/mcjit-mem-mgr.cc - there is additional discussion / background in the LLVM Discourse thread and on the aforementioned Numba issues.
I believe this is now ready for some review - notes to reviewers:
SectionMemoryManager
"as-standard" into llvmlite.cc @sjoerdmeijer for review.