Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix relocation overflows by implementing preallocation in the memory manager #1009

Merged
merged 8 commits into from
Dec 7, 2023

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Nov 14, 2023

This implements a memory manager based on the MCJIT SectionMemoryManager, with a preallocation strategy that ensures all segments of an object are placed within a single block of mapped memory. This is intended to resolve the relocation overflow issues on AArch64 (numba/numba#8567, numba/numba#9001), which occur when the GOT segment is far from the code segment.

The changes are based on those by @MikaelSmith in llvm/llvm-project#71968 and his code in https://github.com/MikaelSmith/impala/blob/ac8561b6b69530f9fa2ff2ae65ec7415aa4395c6/be/src/codegen/mcjit-mem-mgr.cc - there is additional discussion / background in the LLVM Discourse thread and on the aforementioned Numba issues.

I believe this is now ready for some review - notes to reviewers:

  • Only the last commit is substantial change, and adds the preallocation strategy. The others are incorporating the SectionMemoryManager "as-standard" into llvmlite.
  • The changes here don't exactly match the ones in the PR to LLVM, but are substantially similar - as the review of that proceeds I expect to align this with the changes upstream as necessary / appropriate.
  • I don't understand how the memory allocation / mapping really works, in particular what pending memory is and pending prefix indices are - I had the idea to just clear the free memory vectors for each memory group which seemed to work (and was picked up in the LLVM PR) but I'm not sure this approach is 100% correct or could be made better.
  • Testing with this branch with the reproducer in https://github.com/gmarkall/numba-issue-9001 allows it to run apparently indefinitely on my Jetson AGX Xavier and Orin systems - previously it would crash at 10 or fewer iterations.
  • The memory manager is enabled for all platforms in this branch - this is really good for pipecleaning / exposing potential issues, but might not be what we want in production. For a final / ready PR, I'd expect to always build the memory manager and make it available, but enable it only by default on AArch64 systems.

cc @sjoerdmeijer for review.

Copied verbatim from llvm/llvm-project@f28c006a5895, files:

```
llvm/include/llvm/ExecutionEngine/SectionMemoryManager.h
llvm/lib/ExecutionEngine/SectionMemoryManager.cpp
```
This makes them compliant with our C++ style check.
Notes on the changes:

- The memory manager is added to the build system.
- The `LlvmliteMemoryManager` class is exported as a public interface.
- When creating an execution engine, we set it to use our memory
  manager.
@gmarkall gmarkall force-pushed the aarch64memorymanager branch from cd0d357 to a7ae8c4 Compare November 15, 2023 14:32
@gmarkall gmarkall added this to the v0.42.0-rc1 milestone Nov 15, 2023
@gmarkall gmarkall changed the title [WIP] AArch64 memory manager Fix relocation overflows by implementing preallocation in the memory manager Nov 15, 2023
@gmarkall
Copy link
Member Author

I hit an assertion in the Numba test suite on an M2 system:

Assertion failed: (false && "All memory must be pre-allocated"), function allocateSection, file memorymanager.cpp, line 107.
Fatal Python error: Aborted

Looking into which test caused this now.

@gmarkall
Copy link
Member Author

To reproduce:

python runtests.py numba.tests.test_array_reductions.TestArrayReductions.test_nanquantile_basic

@gmarkall
Copy link
Member Author

Looks like we somehow don't quite reserve enough space for code mem - with the -debug-only=llvmlite-memory-manager flag set, I see:

Reserving 0xC000 bytes
Code mem starts at 0x0000000129BD0000, size 0x4000
Rwdata mem starts at 0x0x0000000129BD4000, size 0x4000
Allocating 0x4008 bytes for CodeMem at Assertion failed: (false && "All memory must be pre-allocated"), function allocateSection, file memorymanager.cpp, line 107.
Fatal Python error: Aborted

// Use the same calculation as allocateSection because we need to be able to
// satisfy it.
uintptr_t RequiredSize =
Alignment * ((Size + Alignment - 1) / Alignment + 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the same oversizing that's used in allocateSection (which is done to let it align the address as well), but it can only do it on the full reservation. I was hoping the caller would ensure they're calculation had sufficient buffer for multiple calls to allocateDataSection/allocateCodeSection, but it looks like there may be circumstances where that's not true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for looking at this PR! Tracing through things a bit I just noticed https://github.com/numba/llvmlite/pull/1009/files#r1394577484

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason for the discrepancy is that the reservation is made with the alignment of the code section's alignment (4), but the actual allocation is made with the max of the alignment of the code section (4) and the stub alignment (which is 8 on AArch64 on macOS, and maybe other platforms). So I think the code segment preallocation needs to be aligned to max(code section alignment, stub alignment) too.

"Alignment must be a power of two.");

uintptr_t RequiredSize =
Alignment * ((Size + Alignment - 1) / Alignment + 1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computation is pushing a request for 16379 bytes with alignment 8 up to a size of 16392 bytes, which is larger than the 16384 bytes reserved, in the case where the assertion is being hit in the tests on M2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of other things I just noticed:

  • For reserveAllocationSpace(), the requested code size is 16380 bytes with an align of 4
  • When the allocateDataSection() call is made for code, the request is for 16379 bytes with an align of 8

This discrepancy leads to the request being just slightly over what was reserved for code - I think the next step is to look into why the alignment and requested sizes differ.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output from the extra debugging prints I just pushed:

Code size / align: 0x3FFC / 4
ROData size / align: 0x0 / 1
RWData size / align: 0x2360 / 16
Reserving 0xC000 bytes
Code mem starts at 0x0000000132DDC000, size 0x4000
Rwdata mem starts at 0x0x0000000132DE0000, size 0x4000
Requested size / alignment: 0x3FFB / 8
Allocating 0x4008 bytes for CodeMem at Assertion failed: (false && "All memory must be pre-allocated"), function allocateSection, file memorymanager.cpp, line 109.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

61ae2b0 appears to resolve this issue and allow the test to run to completion.

//
// This file implements the section-based memory manager used by the MCJIT
// execution engine and RuntimeDyld
//

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe some more rationale here why we are switching to this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.

StringRef SectionName) override;

/// Allocates a memory block of (at least) the given size suitable for
/// executable code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: executable code -> data.
(looks a copy-paste typo from the previous method)

ffi/memorymanager.cpp Outdated Show resolved Hide resolved
ffi/memorymanager.cpp Outdated Show resolved Hide resolved
// allocated due to page alignment, but if we have insufficient free memory
// for the request this can lead to allocating disparate memory that can
// violate the ARM ABI. Clear free memory so only the new allocations are
// used, but do not release allocated memory as it may still be in-use.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read this a couple of times, but think I am getting it now. Let me check my understanding/logic here, perhaps it can be used to make the problem description /solution a bit more crisp.

The objective is allocate memory (blocks) that are "near" to each other. Keeping blocks near makes it less likely that the distance between different memory addresses would become too large and e.g. violate ARM ABI relocation restrictions. If a code/rodata/rwdata memory space has been allocated, but not all space is used (e.g. excess blocks that were allocated due to page alignment), then we do mark all memory as being used by clearing the "free memory" here in that space. This has the effect that a next allocation request is not going to try and scavenge some free blocks from somewhere else, thus avoiding that it finds some memory that is potentially "far away".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And further to my previous comment, freeing the free space is the crux of the solution. So I think we need to spend a little bit more time on discussing/documenting the alternatives and pros/cons. The alternative mentioned in the LLVM discourse thread talks about potentially doing this in the finalizeMemory() method, but doing it here has the benefit of being less intrusive, at the cost of wasting some memory.

It wouldn't be too difficult to quantify the waste, I guess. In an experiment we could iterate over the free blocks and sum the sizes and print that for the numba test suite. Don't know if we are going to learn anything, but just an idea.

But given the simplicity of the approach, this definitely looks like the most appealing. I am going to look a bit further in this though, to see what you mean by "pending prefix indices" that you mentioned in the description and how that fits into the picture here.

@gmarkall
Copy link
Member Author

Status update - with the commit 61ae2b0 I can get through the whole test suite (with the usually-skipped tests not skipped):

diff --git a/numba/tests/test_array_constants.py b/numba/tests/test_array_constants.py
index a33dacd49..386c1856b 100644
--- a/numba/tests/test_array_constants.py
+++ b/numba/tests/test_array_constants.py
@@ -141,7 +141,6 @@ class TestConstantArray(unittest.TestCase):
         out = cres.entry_point()
         self.assertEqual(out, 86)
 
-    @skip_m1_llvm_rtdyld_failure
     def test_too_big_to_freeze(self):
         """
         Test issue https://github.com/numba/numba/issues/2188 where freezing
diff --git a/numba/tests/test_stencils.py b/numba/tests/test_stencils.py
index 2a65c0370..1e2f8dc77 100644
--- a/numba/tests/test_stencils.py
+++ b/numba/tests/test_stencils.py
@@ -80,7 +80,6 @@ if not _32bit: # prevent compilation on unsupported 32bit targets
         return a + 1
 
 
-@skip_m1_llvm_rtdyld_failure   # skip all stencil tests on m1
 class TestStencilBase(unittest.TestCase):
 
     _numba_parallel_test_ = False

resulting in:

======================================================================
FAIL: test_no_accidental_warnings (numba.tests.test_import.TestNumbaImport)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/test_import.py", line 103, in test_no_accidental_warnings
    run_in_subprocess(code, flags)
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/support.py", line 1121, in run_in_subprocess
    raise AssertionError(msg % (popen.returncode, err.decode()))
AssertionError: process failed with code 1: stderr follows
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/gmarkall/work/numbadev/numba/numba/__init__.py", line 230, in <module>
    _ensure_llvm()
  File "/Users/gmarkall/work/numbadev/numba/numba/__init__.py", line 169, in _ensure_llvm
    warnings.warn("llvmlite version format not recognized!")
UserWarning: llvmlite version format not recognized!



======================================================================
FAIL: test_unsafe_import_in_registry (numba.tests.test_np_functions.TestRegistryImports)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/test_np_functions.py", line 6172, in test_unsafe_import_in_registry
    self.assertEquals(b"", error.strip())
AssertionError: b'' != b'/Users/gmarkall/work/numbadev/numba/numba[126 chars]d!")'

======================================================================
FAIL: test_repr_long_list_ipython (numba.tests.test_typedlist.TestTypedList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/test_typedlist.py", line 563, in test_repr_long_list_ipython
    self.assertEqual(expected, err)
AssertionError: 'ListType[int64]([0, 1, 2, 3, 4, 5, 6, 7, [4867 chars]..])' != '/Users/gmarkall/work/numbadev/numba/numba[5040 chars]..])'
Diff is 10176 characters long. Set self.maxDiff to None to see it.

======================================================================
FAIL: test_repr_long_list_ipython (numba.tests.test_typedlist.TestTypedList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/support.py", line 909, in tearDown
    self.memory_leak_teardown()
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/support.py", line 884, in memory_leak_teardown
    self.assert_no_memory_leak()
  File "/Users/gmarkall/work/numbadev/numba/numba/tests/support.py", line 893, in assert_no_memory_leak
    self.assertEqual(total_alloc, total_free)
AssertionError: 2 != 1

----------------------------------------------------------------------
Ran 10387 tests in 1000.715s

FAILED (failures=4, skipped=639, expected failures=13)

I believe the failures are innocuous:

  • One because the LLVM version string is not as expected (probably due to slight edits to the llvmlite examples I had in my tree)
  • One is because the memory leak test gives a spurious fail when a related test fails
  • The other two I need to look into, but I'm pretty sure they're due to some local weirdness.

@gmarkall
Copy link
Member Author

Still an issue on Linux AArch64, although this is maybe a latent bug in cleanup in Numba:

$ gdb --args python runtests.py numba.tests.test_ctypes.TestCTypesUseCases.test_python_call_back -v -m
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "aarch64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
(No debugging symbols found in python)
(gdb) run
Starting program: /home/gmarkall/mambaforge/envs/numbadev/bin/python runtests.py numba.tests.test_ctypes.TestCTypesUseCases.test_python_call_back -v -m
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[Detaching after vfork from child process 40625]
[New Thread 0xffffee7901e0 (LWP 40626)]
[New Thread 0xffffedf8f1e0 (LWP 40627)]
[New Thread 0xffffeb78e1e0 (LWP 40628)]
[New Thread 0xffffe8f8d1e0 (LWP 40629)]
[New Thread 0xffffe478c1e0 (LWP 40630)]
[New Thread 0xffffe3f8b1e0 (LWP 40631)]
[New Thread 0xffffdf78a1e0 (LWP 40632)]
[New Thread 0xffffdef891e0 (LWP 40633)]
[New Thread 0xffffda7881e0 (LWP 40634)]
[New Thread 0xffffd7f871e0 (LWP 40635)]
[New Thread 0xffffd57861e0 (LWP 40636)]
[Detaching after vfork from child process 40637]
[Detaching after vfork from child process 40638]
[Detaching after vfork from child process 40639]
[Detaching after vfork from child process 40640]
[Detaching after vfork from child process 40641]
/home/gmarkall/numbadev/numba/numba/__init__.py:169: UserWarning: llvmlite version format not recognized!
  warnings.warn("llvmlite version format not recognized!")
Parallel: 1. Serial: 0
[Thread 0xffffdef891e0 (LWP 40633) exited]
[Thread 0xffffda7881e0 (LWP 40634) exited]
[Thread 0xffffe3f8b1e0 (LWP 40631) exited]
[Thread 0xffffedf8f1e0 (LWP 40627) exited]
[Thread 0xffffee7901e0 (LWP 40626) exited]
[Thread 0xffffdf78a1e0 (LWP 40632) exited]
[Thread 0xffffd57861e0 (LWP 40636) exited]
[Thread 0xffffd7f871e0 (LWP 40635) exited]
[Thread 0xffffe478c1e0 (LWP 40630) exited]
[Thread 0xffffe8f8d1e0 (LWP 40629) exited]
[Thread 0xffffeb78e1e0 (LWP 40628) exited]
[Detaching after fork from child process 40642]
[Detaching after fork from child process 40643]
[Detaching after fork from child process 40644]
[Detaching after fork from child process 40645]
[Detaching after fork from child process 40646]
[Detaching after fork from child process 40647]
[Detaching after fork from child process 40648]
[Detaching after fork from child process 40649]
[Detaching after fork from child process 40650]
[Detaching after fork from child process 40651]
[Detaching after fork from child process 40652]
[Detaching after fork from child process 40653]
[New Thread 0xffffd57861e0 (LWP 40654)]
[New Thread 0xffffd7f871e0 (LWP 40655)]
[New Thread 0xffffda7881e0 (LWP 40656)]
Code size / align: 0x4 / 4
ROData size / align: 0x0 / 1
RWData size / align: 0x0 / 1
Reserving 0x3000 bytes
Code mem starts at 0x0000FFFFF7286000, size 0x1000
Code size / align: 0x128 / 4
ROData size / align: 0x130 / 16
RWData size / align: 0x0 / 1
Reserving 0x3000 bytes
Code mem starts at 0x0000FFFFF7283000, size 0x1000
Rodata mem starts at 0x0x0000FFFFF7284000, size 0x1000
Requested size / alignment: 0x128 / 4
Allocating 0x12C bytes for CodeMem at 0x0000FFFFF7283000
Requested size / alignment: 0xE0 / 16
Allocating 0xF0 bytes for RODataMem at 0x0000FFFFF7284000
Requested size / alignment: 0x48 / 8
Allocating 0x50 bytes for RODataMem at 0x0000FFFFF72840E0
Code size / align: 0xD6C / 4
ROData size / align: 0x5B0 / 16
RWData size / align: 0xB0 / 8
Reserving 0x3000 bytes
Code mem starts at 0x0000FFFFEDC8E000, size 0x1000
Rodata mem starts at 0x0x0000FFFFEDC8F000, size 0x1000
Rwdata mem starts at 0x0x0000FFFFEDC90000, size 0x1000
Requested size / alignment: 0xD6C / 4
Allocating 0xD70 bytes for CodeMem at 0x0000FFFFEDC8E000
Requested size / alignment: 0x48D / 16
Allocating 0x4A0 bytes for RODataMem at 0x0000FFFFEDC8F000
Requested size / alignment: 0x8 / 8
Allocating 0x10 bytes for RWDataMem at 0x0000FFFFEDC90000
Requested size / alignment: 0x114 / 8
Allocating 0x120 bytes for RODataMem at 0x0000FFFFEDC8F490
Requested size / alignment: 0x8 / 8
Allocating 0x10 bytes for RWDataMem at 0x0000FFFFEDC90008
Requested size / alignment: 0x30 / 8
Allocating 0x38 bytes for RWDataMem at 0x0000FFFFEDC90010
test_python_call_back (numba.tests.test_ctypes.TestCTypesUseCases) ... ok
[Thread 0xffffd57861e0 (LWP 40654) exited]
[Thread 0xffffda7881e0 (LWP 40656) exited]
[Thread 0xffffd7f871e0 (LWP 40655) exited]

----------------------------------------------------------------------
Ran 1 test in 0.645s

OK

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000fffff73efb48 in dlfree () from /home/gmarkall/mambaforge/envs/numbadev/lib/python3.10/lib-dynload/../../libffi.so.8
(gdb) bt
#0  0x0000fffff73efb48 in dlfree () from /home/gmarkall/mambaforge/envs/numbadev/lib/python3.10/lib-dynload/../../libffi.so.8
#1  0x0000fffff7417768 in CThunkObject_dealloc ()
   from /home/gmarkall/mambaforge/envs/numbadev/lib/python3.10/lib-dynload/_ctypes.cpython-310-aarch64-linux-gnu.so
#2  0x0000aaaaaab3aa88 in free_keys_object ()
#3  0x0000aaaaaab3b398 in dict_dealloc ()
#4  0x0000fffff741060c in PyCFuncPtr_clear ()
   from /home/gmarkall/mambaforge/envs/numbadev/lib/python3.10/lib-dynload/_ctypes.cpython-310-aarch64-linux-gnu.so
#5  0x0000fffff74106c4 in PyCFuncPtr_dealloc ()
   from /home/gmarkall/mambaforge/envs/numbadev/lib/python3.10/lib-dynload/_ctypes.cpython-310-aarch64-linux-gnu.so
#6  0x0000aaaaaab63548 in subtype_dealloc ()
#7  0x0000aaaaaab3aa88 in free_keys_object ()
#8  0x0000aaaaaab3f0b4 in dict_tp_clear ()
#9  0x0000aaaaaac19af8 in gc_collect_main ()
#10 0x0000aaaaaac1a954 in _PyGC_CollectNoFail ()
#11 0x0000aaaaaabef3a0 in finalize_modules ()
#12 0x0000aaaaaabf25c4 in Py_FinalizeEx ()
#13 0x0000aaaaaabf35ec in Py_Exit ()
#14 0x0000aaaaaabf9058 in _PyErr_PrintEx ()
#15 0x0000aaaaaabf98e4 in _PyRun_SimpleFileObject ()
#16 0x0000aaaaaabf9bf0 in _PyRun_AnyFileObject ()
#17 0x0000aaaaaab0f888 in Py_RunMain ()
#18 0x0000aaaaaab0fec4 in Py_BytesMain ()
#19 0x0000fffff7d52e10 in __libc_start_main (main=0xaaaaaab04f90 <main>, argc=5, argv=0xfffffffff208, init=<optimised out>, fini=<optimised out>, 
    rtld_fini=<optimised out>, stack_end=<optimised out>) at ../csu/libc-start.c:308
#20 0x0000aaaaaab0e6b8 in _start ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 


// Look in the list of free memory regions and use a block there if one
// is available.
for (FreeMemBlock &FreeMB : MemGroup.FreeMem) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About

I don't understand how the memory allocation / mapping really works, in particular what pending memory is and pending prefix indices are

My understanding is that pending memory is memory that has allocated but not yet "finalised".
I am not sure how importing this prefix index is. It looks like a bit of bookkeeping to keep an index of the next free block.

Also my impression is that this whole loop is skipped because of clearance on lines 137 - 139.

@gmarkall
Copy link
Member Author

Still an issue on Linux AArch64, although this is maybe a latent bug in cleanup in Numba:

It turns out that this issue is unrelated to this PR - I need to raise a Numba issue shortly.

@gmarkall
Copy link
Member Author

With the offending ctypes tests skipped, on Linux AArch64 the test results are quite similar to those on macOS:

======================================================================
FAIL: test_no_accidental_warnings (numba.tests.test_import.TestNumbaImport)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/tests/test_import.py", line 103, in test_no_accidental_warnings
    run_in_subprocess(code, flags)
  File "/home/gmarkall/numbadev/numba/numba/tests/support.py", line 1121, in run_in_subprocess
    raise AssertionError(msg % (popen.returncode, err.decode()))
AssertionError: process failed with code 1: stderr follows
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/gmarkall/numbadev/numba/numba/__init__.py", line 230, in <module>
    _ensure_llvm()
  File "/home/gmarkall/numbadev/numba/numba/__init__.py", line 169, in _ensure_llvm
    warnings.warn("llvmlite version format not recognized!")
UserWarning: llvmlite version format not recognized!



======================================================================
FAIL: test_unsafe_import_in_registry (numba.tests.test_np_functions.TestRegistryImports)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/tests/test_np_functions.py", line 6172, in test_unsafe_import_in_registry
    self.assertEquals(b"", error.strip())
AssertionError: b'' != b'/home/gmarkall/numbadev/numba/numba/__ini[120 chars]d!")'

======================================================================
FAIL: test_repr_long_list_ipython (numba.tests.test_typedlist.TestTypedList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/tests/test_typedlist.py", line 563, in test_repr_long_list_ipython
    self.assertEqual(expected, err)
AssertionError: 'ListType[int64]([0, 1, 2, 3, 4, 5, 6, 7, [4867 chars]..])' != '/home/gmarkall/numbadev/numba/numba/__ini[5034 chars]..])'
Diff is 10164 characters long. Set self.maxDiff to None to see it.

======================================================================
FAIL: test_repr_long_list_ipython (numba.tests.test_typedlist.TestTypedList)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba/numba/tests/support.py", line 909, in tearDown
    self.memory_leak_teardown()
  File "/home/gmarkall/numbadev/numba/numba/tests/support.py", line 884, in memory_leak_teardown
    self.assert_no_memory_leak()
  File "/home/gmarkall/numbadev/numba/numba/tests/support.py", line 893, in assert_no_memory_leak
    self.assertEqual(total_alloc, total_free)
AssertionError: 2 != 1

----------------------------------------------------------------------
Ran 11867 tests in 4344.252s

FAILED (failures=4, skipped=592, expected failures=24)

So as far as I can tell, there are no outstanding issues with the implementation in this PR in its present form.

@gmarkall
Copy link
Member Author

As a follow-up on the cause of those fails - they are all rooted in the warning about the llvmlite version not being recognized being produced - not an actual issue.

Copy link
Member Author

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next thing on my to-do list is to align the changes in this branch / PR more closely with @MikaelSmith's changes in llvm/llvm-project#71968 and re-test - I think that the change for stub alignment to be taken into account might still be needed, but it would be best to check first.

The implementation of `reserveAllocationSpace()` now more closely
follows that in llvm/llvm-project#71968,
following some changes made there.

The changes here include:

- Improved readability of debugging output
- Using a default alignment of 8 in `allocateSection()` to match the
  default alignment provided by the stub alignment during preallocation.
- Replacing the "bespoke" `requiredPageSize()` function with
  computations using the LLVM `alignTo()` function.
- Returning early from preallocation when no space is requested.
- Reusing existing preallocations if there is enough space left over
  from the previous preallocation for all the required segments - this
  can happen quite frequently because allocations for each segment get
  rounded up to page sizes, which are usually either 4K or 16K, and many
  Numba-jitted functions require a lot less than this.
- Removal of setting the near hints for memory blocks - this doesn't
  really have any use when all memory is preallocated, and forced to be
  "near" to other memory.
- Addition of extra asserts to validate alignment of allocated sections.
@gmarkall gmarkall force-pushed the aarch64memorymanager branch from 616a057 to 75b103c Compare November 22, 2023 13:15
@sklam sklam added the Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm label Dec 5, 2023
The default is to enable it on 64-bit ARM systems, since it solves the
problem they encounter, and disable it elsewhere, to minimise the risk
of an unintended side effect on platforms that don't need it.

This can be overridden by manually specifying the value of `use_lmm`
when creating the MCJIT compiler.
@gmarkall gmarkall force-pushed the aarch64memorymanager branch from 3ee5574 to b673be6 Compare December 6, 2023 16:52
Copy link
Member

@sklam sklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my limited background knowledge on what's going on with the memory manager, I reviewed the C++ code based on whether I can understand it. The code is clear and well commented.

The license addition looks good.

Buildfarm has never been happier all thanks to this patch.

numba#9337 further unskipped more test related to M1 RuntimeDyLd issues and I have ran it on the farm. All M1 tests passed.

The PPC64 linker issue

The only outstanding problem is compiler failure on PPC64LE. On the Power machine in the buildfarm, both anaconda-distro and conda-forge packages are failing to link memorymanager.o file with the error:

ld: /opt/conda/envs/cf/lib/libLLVMSupport.a(Error.cpp.o):(.data.rel.ro._ZTVN4llvm13ErrorInfoBaseE[_ZTVN4llvm13ErrorInfoBaseE]+0x40): undefined reference to `llvm::ErrorInfoBase::isA(void const*) const'

After some investigation and following suggestion in https://support.xilinx.com/s/article/20068?language=en_US, I found that adding -mlongcall when compiling memorymanager.cpp fixes the problem. However, this "fix" may introduce some performance issues since it forces all jumps to be long jumps. Since GCC9.5, the doc on mlongcall has this description:

On PowerPC64 ELFv2 and 32-bit PowerPC systems with newer GNU linkers, GCC can generate long calls using an inline PLT call sequence (see -mpltseq). PowerPC with -mbss-plt and PowerPC64 ELFv1 (big-endian) do not support inline PLT calls.

This might be a system linker too old problem or that newer GCC (>=9) can generate alternative longcall sequence to avoid the issue.

We can fix this PPC problem in a separate PR so it's not a blocker for this PR. Adding -mlongcall is probably the easiest fix for now.

Copy link
Member

@sklam sklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buildfarm has passed with the latest commit. This is ready for merge!!! Numba's buildfarm has never been happier as this PR will stop random failures on our arm64/aarch64 machines.

Thank you @gmarkall and everyone who reviewed this PR.

@sklam sklam added BuildFarm Passed For PRs that have been through the buildfarm and passed 5 - Ready to merge and removed 3 - Ready for Review Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Dec 7, 2023
@sklam sklam merged commit 53488e9 into numba:main Dec 7, 2023
20 checks passed
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jan 5, 2024
Nightly jobs have been [failing](https://github.com/rapidsai/cudf/actions/runs/7382855293/job/20083184931) with a numba segfault.

This appears to be a longstanding issue with numba on aarch64 fixed by numba/llvmlite#1009. Technically, the issue exists already in our tests, but it appears that changes from numba 0.58 make the conditions for the issue to occur much more likely, hence the failures occurring after removing the numba 0.58 version constraint recently. The issue should be fixed in numba 0.59.

For now however we should skip things so that nightlies can be fixed.

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #14702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants