Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application crash when reaching vm.map_count limit #90230

Closed
ayende opened this issue Aug 9, 2023 · 12 comments
Closed

Application crash when reaching vm.map_count limit #90230

ayende opened this issue Aug 9, 2023 · 12 comments
Milestone

Comments

@ayende
Copy link
Contributor

ayende commented Aug 9, 2023

Description

We are tracking what looks like a memory / fragmentation leak, see here: #89776

As a result of that, we run into the limit of vm.map_count, see:

sudo cat /proc/$(pidof Raven.Server)/maps | wc -l
65406

This was set to 65535, and we got several crashes from the finalizer.

Aug 06 19:56:28 vm9e618664fb audit[32018]: ANOM_ABEND auid=4294967295 uid=1001 gid=1001 ses=4294967295 pid=32018 comm=2E4E45542046696E616C697A6572 exe="/ravendb/RavenDB/Server/Raven.Server" sig=11 res=1
Aug 06 19:56:28 vm9e618664fb kernel: .NET Finalizer[32047]: segfault at 440 ip 00007fa6d3f8cb62 sp 00007fa6d0439ff8 error 6 in libc-2.27.so[7fa6d3dfe000+1e7000]
Aug 06 19:56:28 vm9e618664fb kernel: Code: 1c 26 00 0f 87 07 01 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 48 81 ea 80 00 00 00 <c5> fd 7f 07 c5 fd 7f 4f 20 c5 fd 7f 57 40 c5 fd 7f 5f 60 48 81 c7
Aug 06 19:56:29 vm9e618664fb systemd[1]: ravendb.service: Main process exited, code=killed, status=11/SEGV
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]: Fatal error. The RW block to unmap was not found
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Sparrow.Server.Platform.PalHelper.ThrowLastError(FailCodes, Int32, System.String)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Paging.RvnMemoryMapPager.AllocateMorePages(Int64)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferFile.Allocate(Voron.Impl.LowLevelTransaction, Int32, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.Scratch.ScratchBufferPool.Allocate(Voron.Impl.LowLevelTransaction, Int32)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, Int64, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocatePage(Int32, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Impl.LowLevelTransaction.AllocateOverflowRawPage(Int64, Int32 ByRef, System.Nullable`1<Int64>, System.Nullable`1<Voron.Page>, Boolean)
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.AllocateNextPage()
Aug 06 22:03:51 vm9e618664fb cloud.sh[17259]:    at Voron.Data.BTrees.Tree+StreamToPageWriter.Write(System.IO.Stream)

I believe that this is related to this: #80580

Reproduction Steps

Run for a long while under load, and the number of mapping will increase until you reach the limit

Expected behavior

Should not crash

Actual behavior

It crashes

Regression?

Did not see that in .NET 6.0

Known Workarounds

Increase vm.map_count to a very high number

Configuration

Linux, x64, Ubuntu, .NET 7.0

Other information

The stack trace is really strange, we are attempting to allocate memory and then fail.
That is a handled exception, but we are dying with segmentation fault.

Given that this is thrown from executableallocator, I wonder if this is possibly trying to JIT the method or maybe tier it up, and then failing.

Note that it is possible for unmap to fail in Linux (if it will increase the mapping amount.
This should probably be handled without killing the process.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 9, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Aug 9, 2023
@vcsjones vcsjones added area-VM-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Aug 9, 2023
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Aug 10, 2023
@mangod9 mangod9 added this to the 8.0.0 milestone Aug 10, 2023
@mangod9
Copy link
Member

mangod9 commented Aug 10, 2023

Assume this issue doesnt repro with W^X disabled? Is there a dump which can be shared to debug further? Appears the original mapping return code issue is fixed in 8 and might be ported to 7 too. Are you testing on the latest servicing release?

@ayende
Copy link
Contributor Author

ayende commented Aug 10, 2023

I'm afraid that we don't have a dump, only those logs. This is a production instance, so we jump bumped the map_count to alleviate the issue.
@gregolsky - can you answer regarding the W^X and the service release?

@gregolsky
Copy link

gregolsky commented Aug 14, 2023

Runtime version is 7.0.8. AFAIK WriteXorExecute is enabled there by default? I'm afraid we cannot do any experiments on how it works when it's disabled on the system in question.

We can try to repro on another one though artificially reducing the number of available maps.

@janvorli
Copy link
Member

Is it possible that you have growing number of dynamically created assemblies like in the #80580 (comment)?

@ayende
Copy link
Contributor Author

ayende commented Aug 15, 2023

Not likely, we aren't really generating many new assemblies on the fly, and not at all in the scenario we tested

@janvorli
Copy link
Member

@ayende re-reading the issue description, I am not sure I understand this:

The stack trace is really strange, we are attempting to allocate memory and then fail.
That is a handled exception, but we are dying with segmentation fault.

The failure to map memory as RW in the executable allocator is fatal fail fast. It is not an exception. The stack trace is a stack trace of the managed code at the time the fail fast happened. At that point, we don't have any other option than to fail fast. There are more than a hundred of places all over the source base when we need to modify or write executable code and there is no way to recover from that at majority of the places.

The fact that you get a sigsegv after the fail fast message is printed is strange though. I wonder if that could be related to the wrong checks for mmap return value that were fixed in .NET 8 (#77952, #78069), but not ported to .NET 7. It is actually quite possible it is the case.

@gregolsky
Copy link

gregolsky commented Aug 15, 2023 via email

@mangod9
Copy link
Member

mangod9 commented Aug 22, 2023

@gregolsky, have you tried the scenario on 8 to ensure these fixes indeed work in your case?

@ayende
Copy link
Contributor Author

ayende commented Aug 27, 2023

What would be the expected scenario here?
Ideally, I would rather get a proper error message / diagnostics so that we can detect that in production.

The usual metrics (memory usage, etc) are not a problem in this case.

@mangod9 mangod9 modified the milestones: 8.0.0, 9.0.0 Sep 5, 2023
@mangod9
Copy link
Member

mangod9 commented Jul 24, 2024

@janvorli, believe there was a change to update the error message for these conditions?

@janvorli
Copy link
Member

Yes, the message was updated in #102458

@mangod9
Copy link
Member

mangod9 commented Jul 24, 2024

Ok closing this issue based on that.

@mangod9 mangod9 closed this as completed Jul 24, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants