Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_arena_mmap can fail on linux due to OOM conditions #9634

Closed
nagisa opened this issue Oct 4, 2023 · 4 comments · Fixed by #10132
Closed

test_arena_mmap can fail on linux due to OOM conditions #9634

nagisa opened this issue Oct 4, 2023 · 4 comments · Fixed by #10132

Comments

@nagisa
Copy link
Collaborator

nagisa commented Oct 4, 2023

I have been seeing the test_arena_mmap test fail on GHA runners due to an OOM condition: https://github.com/near/nearcore/actions/runs/6403874087/job/17383298104

This test attempts to allocate a memory map 100GB in size:

// 100GB is a lot, but it's all virtual memory so it's fine in 64-bit.
let mut arena5 = super::ArenaMemory::new(size_100gb);

The comment is partly true, partly false – even though this is indeed 100GB of virtual memory, the kernel can still refuse to honour this allocation for other reasons:

  • ENOMEM No memory is available.
  • ENOMEM The process's maximum number of mappings would have been exceeded. This error can also occur for munmap(), when unmapping a region in the middle of an existing mapping, since this results in two smaller mappings on either side of the region being unmapped.
  • ENOMEM (since Linux 4.7) The process's RLIMIT_DATA limit, described in getrlimit(2), would have been exceeded.

I’m not exactly sure which one of these conditions is being triggered, but I would imagine it is either the last one or the first one (the kernel is trying to be smart and seeing that there’s no way for it to have that amount of memory in the future, so it does not allocate the necessary TLBs to represent 100G of memory.)

cc @robin-near

@nagisa
Copy link
Collaborator Author

nagisa commented Oct 4, 2023

Is it necessary for the test to allocate 100G of vmem? It looks like it is testing that the memory map will lazily allocate the couple bytes at the tail when the pages there are written. Wouldn’t it be enough to allocate much less?

Is the intent behind 100G to ensure that no test runner will possibly have that amount of physical memory? 100G isn’t that much in that case.

@nagisa
Copy link
Collaborator Author

nagisa commented Oct 5, 2023

On the failing runs both soft and hard LIMIT_DATA rlimits are unset.

@nagisa
Copy link
Collaborator Author

nagisa commented Oct 5, 2023

I tried a few things, but so far configuration wise I don't see anything particularly damning on the CI runners. Can we change the test to be… less prone to experience something like this?

@nagisa
Copy link
Collaborator Author

nagisa commented Oct 5, 2023

Okay, so I think I figured this out. By default linux uses vm.overcommit_memory = 0 for overcommit handling, which means…

Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

and

The current overcommit limit and amount committed are viewable in /proc/meminfo as CommitLimit and Committed_AS respectively.

On the CI runners these specific fields are:

CommitLimit:    16427252 kB
Committed_AS:    2272992 kB

Experimentally 15GiB is almost exactly what is possible to allocate with a single mmap call. Any more and it will fail.

If I run the tests with vm.overcommit_memory = 1 which just allows things to go into infinity, things work out okay, but I don't think we can expect average developers to fiddle with these options and this test would fail on majority of developers’ systems. For example on my laptop:

CommitLimit:     8867052 kB
Committed_AS:   13229840 kB

So here we have it, this test needs to be rewritten.

@robin-near robin-near linked a pull request Nov 14, 2023 that will close this issue
Ekleog-NEAR added a commit to Ekleog-NEAR/nearcore that referenced this issue Nov 15, 2023
Issue near#9634 has been solved since.
github-merge-queue bot pushed a commit that referenced this issue Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant