Application crash when reaching vm.map_count limit #90230

ayende · 2023-08-09T13:54:21Z

Description

We are tracking what looks like a memory / fragmentation leak, see here: #89776

As a result of that, we run into the limit of vm.map_count, see:

sudo cat /proc/$(pidof Raven.Server)/maps | wc -l
65406

This was set to 65535, and we got several crashes from the finalizer.

Aug 06 19:56:28 vm9e618664fb audit[32018]: ANOM_ABEND auid=4294967295 uid=1001 gid=1001 ses=4294967295 pid=32018 comm=2E4E45542046696E616C697A6572 exe="/ravendb/RavenDB/Server/Raven.Server" sig=11 res=1
Aug 06 19:56:28 vm9e618664fb kernel: .NET Finalizer[32047]: segfault at 440 ip 00007fa6d3f8cb62 sp 00007fa6d0439ff8 error 6 in libc-2.27.so[7fa6d3dfe000+1e7000]
Aug 06 19:56:28 vm9e618664fb kernel: Code: 1c 26 00 0f 87 07 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 <c5> fd 7f 07 c5 fd 7f 4f 20 c5 fd 7f 57 40 c5 fd 7f 5f 60 48 81 c7
Aug 06 19:56:29 vm9e618664fb systemd[1]: ravendb.service: Main process exited, code=killed, status=11/SEGV
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]: Fatal error. The RW block to unmap was not found
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Sparrow.Server.Platform.PalHelper.ThrowLastError(FailCodes, Int32, System.String)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Paging.RvnMemoryMapPager.AllocateMorePages(Int64)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferFile.Allocate(Voron.Impl.LowLevelTransaction, Int32, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferPool.Allocate(Voron.Impl.LowLevelTransaction, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, Int64, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocateOverflowRawPage(Int64, Int32 ByRef, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.AllocateNextPage()
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.Write(System.IO.Stream)

I believe that this is related to this: #80580

Reproduction Steps

Run for a long while under load, and the number of mapping will increase until you reach the limit

Expected behavior

Should not crash

Actual behavior

It crashes

Regression?

Did not see that in .NET 6.0

Known Workarounds

Increase vm.map_count to a very high number

Configuration

Linux, x64, Ubuntu, .NET 7.0

Other information

The stack trace is really strange, we are attempting to allocate memory and then fail.
That is a handled exception, but we are dying with segmentation fault.

Given that this is thrown from executableallocator, I wonder if this is possibly trying to JIT the method or maybe tier it up, and then failing.

Note that it is possible for unmap to fail in Linux (if it will increase the mapping amount.
This should probably be handled without killing the process.

The text was updated successfully, but these errors were encountered:

mangod9 · 2023-08-10T17:02:42Z

Assume this issue doesnt repro with W^X disabled? Is there a dump which can be shared to debug further? Appears the original mapping return code issue is fixed in 8 and might be ported to 7 too. Are you testing on the latest servicing release?

ayende · 2023-08-10T19:51:58Z

I'm afraid that we don't have a dump, only those logs. This is a production instance, so we jump bumped the map_count to alleviate the issue.
@gregolsky - can you answer regarding the W^X and the service release?

gregolsky · 2023-08-14T08:24:52Z

Runtime version is 7.0.8. AFAIK WriteXorExecute is enabled there by default? I'm afraid we cannot do any experiments on how it works when it's disabled on the system in question.

We can try to repro on another one though artificially reducing the number of available maps.

janvorli · 2023-08-15T12:54:44Z

Is it possible that you have growing number of dynamically created assemblies like in the #80580 (comment)?

ayende · 2023-08-15T13:20:30Z

Not likely, we aren't really generating many new assemblies on the fly, and not at all in the scenario we tested

janvorli · 2023-08-15T20:47:18Z

@ayende re-reading the issue description, I am not sure I understand this:

The stack trace is really strange, we are attempting to allocate memory and then fail.
That is a handled exception, but we are dying with segmentation fault.

The failure to map memory as RW in the executable allocator is fatal fail fast. It is not an exception. The stack trace is a stack trace of the managed code at the time the fail fast happened. At that point, we don't have any other option than to fail fast. There are more than a hundred of places all over the source base when we need to modify or write executable code and there is no way to recover from that at majority of the places.

The fact that you get a sigsegv after the fail fast message is printed is strange though. I wonder if that could be related to the wrong checks for mmap return value that were fixed in .NET 8 (#77952, #78069), but not ported to .NET 7. It is actually quite possible it is the case.

gregolsky · 2023-08-15T20:57:17Z

Would it be possible to port these to 7? wt., 15 sie 2023, 22:47 użytkownik Jan Vorlicek ***@***.***> napisał:

…

@ayende <https://github.com/ayende> re-reading the issue description, I am not sure I understand this: The stack trace is really strange, we are attempting to allocate memory and then fail. That is a handled exception, but we are dying with segmentation fault. The failure to map memory as RW in the executable allocator is fatal fail fast. It is not an exception. The stack trace is a stack trace of the managed code at the time the fail fast happened. At that point, we don't have any other option than to fail fast. There are more than a hundred of places all over the source base when we need to modify or write executable code and there is no way to recover from that at majority of the places. The fact that you get a sigsegv after the fail fast message is printed is strange though. I wonder if that could be related to the wrong checks for mmap return value that were fixed in .NET 8 (#77952 <#77952>, #78069 <#78069>), but not ported to .NET 7. It is actually quite possible it is the case. — Reply to this email directly, view it on GitHub <#90230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALYNLKOEWLUSY6RWU52S5DXVPN6DANCNFSM6AAAAAA3KBPHAE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

mangod9 · 2023-08-22T17:13:53Z

@gregolsky, have you tried the scenario on 8 to ensure these fixes indeed work in your case?

ayende · 2023-08-27T09:02:33Z

What would be the expected scenario here?
Ideally, I would rather get a proper error message / diagnostics so that we can detect that in production.

The usual metrics (memory usage, etc) are not a problem in this case.

mangod9 · 2024-07-24T18:06:38Z

@janvorli, believe there was a change to update the error message for these conditions?

janvorli · 2024-07-24T18:22:40Z

Yes, the message was updated in #102458

mangod9 · 2024-07-24T18:23:50Z

Ok closing this issue based on that.

ghost added the untriaged New issue has not been triaged by the area owner label Aug 9, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 9, 2023

vcsjones added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 9, 2023

mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 10, 2023

mangod9 added this to the 8.0.0 milestone Aug 10, 2023

mangod9 modified the milestones: 8.0.0, 9.0.0 Sep 5, 2023

mangod9 closed this as completed Jul 24, 2024

github-actions bot locked and limited conversation to collaborators Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application crash when reaching vm.map_count limit #90230

Application crash when reaching vm.map_count limit #90230

ayende commented Aug 9, 2023

mangod9 commented Aug 10, 2023

ayende commented Aug 10, 2023

gregolsky commented Aug 14, 2023 •

edited

Loading

janvorli commented Aug 15, 2023

ayende commented Aug 15, 2023

janvorli commented Aug 15, 2023

gregolsky commented Aug 15, 2023 via email

mangod9 commented Aug 22, 2023

ayende commented Aug 27, 2023

mangod9 commented Jul 24, 2024

janvorli commented Jul 24, 2024

mangod9 commented Jul 24, 2024

Application crash when reaching vm.map_count limit #90230

Application crash when reaching vm.map_count limit #90230

Comments

ayende commented Aug 9, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

mangod9 commented Aug 10, 2023

ayende commented Aug 10, 2023

gregolsky commented Aug 14, 2023 • edited Loading

janvorli commented Aug 15, 2023

ayende commented Aug 15, 2023

janvorli commented Aug 15, 2023

gregolsky commented Aug 15, 2023 via email

mangod9 commented Aug 22, 2023

ayende commented Aug 27, 2023

mangod9 commented Jul 24, 2024

janvorli commented Jul 24, 2024

mangod9 commented Jul 24, 2024

gregolsky commented Aug 14, 2023 •

edited

Loading