-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_arena_mmap can fail on linux due to OOM conditions #9634
Comments
Is it necessary for the test to allocate 100G of vmem? It looks like it is testing that the memory map will lazily allocate the couple bytes at the tail when the pages there are written. Wouldn’t it be enough to allocate much less? Is the intent behind 100G to ensure that no test runner will possibly have that amount of physical memory? 100G isn’t that much in that case. |
On the failing runs both soft and hard |
I tried a few things, but so far configuration wise I don't see anything particularly damning on the CI runners. Can we change the test to be… less prone to experience something like this? |
Okay, so I think I figured this out. By default linux uses
and
On the CI runners these specific fields are:
Experimentally 15GiB is almost exactly what is possible to allocate with a single mmap call. Any more and it will fail. If I run the tests with
So here we have it, this test needs to be rewritten. |
Issue near#9634 has been solved since.
Issue #9634 has been solved since.
I have been seeing the
test_arena_mmap
test fail on GHA runners due to an OOM condition: https://github.com/near/nearcore/actions/runs/6403874087/job/17383298104This test attempts to allocate a memory map 100GB in size:
nearcore/core/store/src/trie/mem/arena/mod.rs
Lines 269 to 270 in 6f3a721
The comment is partly true, partly false – even though this is indeed 100GB of virtual memory, the kernel can still refuse to honour this allocation for other reasons:
I’m not exactly sure which one of these conditions is being triggered, but I would imagine it is either the last one or the first one (the kernel is trying to be smart and seeing that there’s no way for it to have that amount of memory in the future, so it does not allocate the necessary TLBs to represent 100G of memory.)
cc @robin-near
The text was updated successfully, but these errors were encountered: