-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET Core 2.0.3 runs OOM in docker #9261
Comments
cc @swgillespie |
@emanuelbalea, I was wondering if you have really tried to use the runtime 2.0.3. SDK (CLI) 2.0.3 is a completely unrelated thing that still contains runtime 2.0.0. Could you please share where did you get the coreclr stuff you were using from? |
@janvorli , @emanuelbalea said he was using the latest docker nightly (https://github.com/dotnet/coreclr/issues/13489#issuecomment-343390478).
@emanuelbalea is this the image tag you are using? I'm not sure if there is an image that has a patched runtime. Perhaps you can try one of the 2.1 tags? Or create an image yourself. |
@janvorli and @tmds you are right the docker image might not be 2.0.3... Sorry about that I got confused by the numbering scheme, thought it was back in line with the clr version. I will post the tag number as soon as I get to work and I will try the 2.1 and if that fails create my own image. Thanks for all the help and will update in a couple of hours. |
Update. The nightly docker images are on 2.0.0 even the preview ones. Made myself a new images based on those an will update in a few hours. |
Using the latest nightly of 2.0.4 it works as expected inside custom docker image and ec2 container service in my dev environment. @tmds feel free to close this. Thanks for help :) |
@emanuelbalea no problem. It was a good verification to see the OOM with the 2.0.0 runtime and 2.0.4 no longer going OOM. |
@janvorli I'm trying to verify docker containers won't crash due to OOM conditions. using System;
using System.Collections.Generic;
namespace oom
{
class Program
{
static void Main(string[] args)
{
var list = new List<byte[]>();
int i = 0;
while (true)
{
try
{
System.Console.WriteLine(i++);
var buffer = CreateBuffer();
list.Add(buffer);
}
catch (Exception e)
{
System.Console.WriteLine(e.Message);
return;
}
}
}
static byte[] CreateBuffer()
{
var buffer = new byte[1024 * 1024]; // 1Mb
for (int j = 0; j < buffer.Length; j++)
{
buffer[j] = 1;
}
return buffer;
}
}
} oom.csproj <Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>
</Project> Dockerfile
This process gets killed (after allocating some 10ish buffers) due to out of memory, as shown in dmesg:
Shouldn't it throw |
The full dmesg - on kill -:
|
@tmds The GC will generally only throw an I looked into this for a while (since some GC functional tests were getting repeatedly killed by the OOM killer instead of failing in a predictable way) and I didn't find a good solution. Disabling the OOM killer entirely is bad because the kernel will simply refuse to schedule the memory-heavy process, so we'll never get the processor time to actually do a GC. This problem isn't unique to .NET, JVMs also don't always get a chance to throw |
@swgillespie thank you, that is very interesting to know. Do you know a test I can use to validate the runtime is taking into account the docker memory limit? |
I'm not sure how tracing in containers works (@brianrob would know) but if you can collect a trace and then view it with PerfView (https://github.com/Microsoft/perfview), you should see the GC aggressively compacting the heap as it approaches the Docker memory limit. You could also use a debugger and set a breakpoint on |
@tmds, LTTng-UST should work inside of a container with the default seccomp profile so you should be able to collect a trace of the GC behavior. Probably the easiest thing to do is to follow the instructions at https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md#collecting-in-a-docker-container, which should make it possible to use the standard non-container workflow once you have a privileged shell (assuming you can get one). |
We're running into OOM with 2.0.0. We just upgraded to 2.0.3 and see if that fixes the problem. @emanuelbalea / @tmds , you mentioned the nightly 2.0.4 docker image will fix this problem. Where can I find it? The best I could found was microsoft/aspnetcore-nightly, but that only contains 2.0.1. microsoft/aspnetcore already contains 2.0.3. Thank you for hints or more details on what specific docker tag you used. |
@thoean since this issue was created 2.0.3 has been released. so the official images at https://hub.docker.com/r/microsoft/dotnet/ contain the fix. |
Thanks @tmds. Upgrading to the 2.0.3 docker image seems to have fixed the problem on our side. Thank you. Should this issue be closed? |
@swgillespie I wonder, are there minimal size requirements for the runtime to establish heaps? It would be meaningful to have some guidelines. For example: if I create a docker container with server gc and it has 4 logical cpus, how much memory should I at least allocate to that? |
@tmds The GC will commit the ephemeral segment on startup, so for server GC with four logical CPUs you can figure that you'll have at minimum four ephemeral segments resident. The size of this varies a little based on processor topology (in particular, L1 cache size) but the defaults are (from here: https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/fundamentals)
If we fail to commit a heap segment (or part of a heap segment), we'll throw an |
@swgillespie Thanks for taking time to explain these things. When I start a 100MB container on a 64-bit system, it doesn't crash. So the runtime is not actually trying to ensure those amounts of memory are available. I've been looking a bit in gc.cpp, one function caught my attention because it is using // Get the max gen0 heap size, making sure it conforms.
size_t GCHeap::GetValidGen0MaxSize(size_t seg_size)
{
size_t gen0size = static_cast<size_t>(GCConfig::GetGen0Size());
if ((gen0size == 0) || !g_theGCHeap->IsValidGen0MaxSize(gen0size))
{
#ifdef SERVER_GC
// performance data seems to indicate halving the size results
// in optimal perf. Ask for adjusted gen0 size.
gen0size = max(GCToOSInterface::GetLargestOnDieCacheSize(FALSE)/GCToOSInterface::GetLogicalCpuCount(),(256*1024));
// if gen0 size is too large given the available memory, reduce it.
// Get true cache size, as we don't want to reduce below this.
size_t trueSize = max(GCToOSInterface::GetLargestOnDieCacheSize(TRUE)/GCToOSInterface::GetLogicalCpuCount(),(256*1024));
dprintf (2, ("cache: %Id-%Id, cpu: %Id",
GCToOSInterface::GetLargestOnDieCacheSize(FALSE),
GCToOSInterface::GetLargestOnDieCacheSize(TRUE),
GCToOSInterface::GetLogicalCpuCount()));
// if the total min GC across heaps will exceed 1/6th of available memory,
// then reduce the min GC size until it either fits or has been reduced to cache size.
while ((gen0size * gc_heap::n_heaps) > GCToOSInterface::GetPhysicalMemoryLimit() / 6)
{
gen0size = gen0size / 2;
if (gen0size <= trueSize)
{
gen0size = trueSize;
break;
}
}
#else //SERVER_GC
gen0size = max((4*GCToOSInterface::GetLargestOnDieCacheSize(TRUE)/5),(256*1024));
#endif //SERVER_GC
}
// Generation 0 must never be more than 1/2 the segment size.
if (gen0size >= (seg_size / 2))
gen0size = seg_size / 2;
return (gen0size);
} There are two things I find interesting here:
|
@tmds The runtime reserves (in the virtual memory sense) that amount of memory on startup. Linux is happy to hand out 4GB of virtual address space on startup even if your container has a 100MB resident memory limit; it'll only complain when your resident set starts bumping up against 100MB. It is really interesting to me that workstation GC doesn't ever look at |
@swgillespie thoughts on this? |
@tmds That's what I'm saying here:
|
What are typical values of |
e.g. if this is 45MB, then a container with 1 CPU (GetLogicalCpuCount) and 100MB (GetPhysicalMemoryLimit) will have Workstation gen0 of 36MB and Server gen0 of 22.5MB. |
@tmds I just tried it on the beefiest machine I could find and got 30MB for |
I'm not sure, I think GetPhysicalMemoryLimit is a value in bytes.
I don't think they are related. The table values are in the INITIAL_ALLOC and LHEAP_ALLOC defines which get adjusted for processor count in get_valid_segment_size. |
Yeah, I don't know for sure. At any rate, I do think it's weird to not look at the physical memory limit at all when using workstation GC; Maoni probably has some thoughts too on that. |
@Maoni0 can you please take a look at this: https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003? |
@swgillespie can you please ping @Maoni0 to take a look at this issue? |
@tmds Maoni is currently out of the office; she'll be back in about a week. |
@Maoni0, can you please take a look at https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003? |
@tmds Sorry I was out for a long time end of last year and missed some conversations. I believe when people added the check for physical mem limit, they were doing tuning for server workloads and generally server machines would have much larger caches than typical client machines; and workstation GC generally also did GCs more frequently. so it was sufficient to add this only for server GC. I don't see any reason why we shouldn't be checking for physical mem limit for workstation GC if we have configurations that warrant it. feel free to propose a change. |
PR dotnet/coreclr#15975 makes some changes based on https://github.com/dotnet/coreclr/issues/14991#issuecomment-348428003 |
As reported by @emanuelbalea here: https://github.com/dotnet/coreclr/issues/13489#issuecomment-343416765
The text was updated successfully, but these errors were encountered: