-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase Server Memory When Client program repeats ActorSelection and Terminated #3430
Comments
Thx for profiling dump. Its very helpfull. We are investigating this issue. |
Ah i see you have updated your starting post. Yeah I came to the conclusion, massive amounts of byte[] are being rooted inside the dotnetty codebase. And at first glance their PoolThreadCache seems to be the culprit. Although im still doing some experimentation to see if its really being rooted. Or if it just needs some time to be GC'ed. For example, i wonder what happens when you do a manual GC.collect. |
@iris4865 how long did you let your sample run before getting in the GB memory range ? What i have done is clone your repo and then:
In all cases memory usage was stable. Can you provide more specifics about your test environment? |
@Danthar
I tried to manual GC.collect. GC.collect calls has increased, but memory has increased. Profiling result image(after 1 hour 55 minutes)
|
This is very detailed, thank you! Looks to me like an issue with buffers being released in DotNetty... I think this may have been fixed in their v0.4.8 branch but I'll need to confirm. |
Ah i didn't dig deep enough in that memory dump. Totally missed it ran for over 2 hours. Very odd behavior though, for it to balloon like that after 2 hours. |
Its interesting to note, that at a certain point in time, we start seeing these type of errors. And it seems to strongly correlate with the sudden increase in memory consumption. If its the cause of, or a symptom of, I don't know. /cc @Aaronontheweb |
Alright @iris4865 after some internal debate, we decided we are sticking with the conclusion that its something in DotNetty.
|
Thank you for the suggestion. I wonder about your 2 options
I also try to test it with new config flag. |
@iris4865 haven't tested it myself yet - I'm working on a custom |
https://github.com/Aaronontheweb/Akka.Remote.DiagnosticDotNettyTransport - finished the first cut of my diagnostic transport. I'm running the code sample you sent me with it now and I'm going to leave it for a couple of hours. So far, not seeing anything that looks like a memory leak - the dumps we're taking of the buffer pools look like they're empty most of the time. I'll let you know what I turn up though. The large object heap remains steady at 32mb, which is where all of the pooled buffers live. FYI, I'm running on .NET 4.6.1 since I don't have 4.7.1 installed. |
Alrighty, so the system was totally stable for about the first hour or so, but then something happened, randomly by the looks of it, which created a massive spike in memory consumption quickly. That memory usage jumped to about 1.8GB and then leveled off.
Took about 10 minutes to go from ~100mb total memory to 1.8GB total memory. Looks like nearly all of it went into the large object heap: I'm going to analyze the snapshot in a second, but here's the key bit: it took a random event of some kind to trigger this massive increase in memory consumption suddenly. This tells me that we're looking at a race condition inside the |
The relevant DotNetty metric reporting code that produced the data I captured in the previous comment can be found here: public string DumpStats()
{
int heapArenasLen = this.heapArenas?.Length ?? 0;
StringBuilder buf = new StringBuilder(512)
.Append(heapArenasLen)
.Append(" heap arena(s):")
.Append(StringUtil.Newline);
if (heapArenasLen > 0)
{
// ReSharper disable once PossibleNullReferenceException
foreach (PoolArena<byte[]> a in this.heapArenas)
{
buf.Append(a);
}
}
int directArenasLen = this.directArenas?.Length ?? 0;
buf.Append(directArenasLen)
.Append(" direct arena(s):")
.Append(StringUtil.Newline);
if (directArenasLen > 0)
{
// ReSharper disable once PossibleNullReferenceException
foreach (PoolArena<byte[]> a in this.directArenas)
{
buf.Append(a);
}
}
return buf.ToString();
}
} Looks to me like the issue that is, rather suddenly, DotNetty started allocating much larger heap pool buffers than it had been doing before. The average pool size jumped up to about 16mb at the mark when memory consumption started climbing... |
When I start the app clean the very first time, the total amount of buffer capacity is meant to be capped at 16mb total, roughly.
|
That clip from above is the readout for all 32 of the individual heap buffers... When the memory spike happened, we were reading about 3x the amount of data inside the first chunk than we did in that same chunk at startup. There's also a correlation with the number of allocated objects too - it's roughly 3x during its period of high memory consumption than it was during its "steady state" for the first hour of execution or so. |
When I took my first memory snapshot earlier... There were only two |
I have a theory that this might be something we introduced in 1.3.7 as a result of my trying to work around some other DotNetty buffer-related issues in 1.3.6. I'm going to test it now using my instrumented transport and let you know what I find. |
@iris4865 and I are the same team. We tested the proposed option.
|
Ok, this is caused by a bug I introduced in v1.3.6. I was able to submit a fix to it in my DiagnosticDotNettyTransport project and demonstrate that the issue did not occur again within 15-16 hours of continuous use. PR to come to Akka.NET shortly. |
Welp, I'm not so sure my original theory was right. I left a comment on#3436 about why. I'm running a new test using your reproduction sample with the latest DotNetty v0.4.8 bits. Going to let it run for 24 hours and let you know what I find! |
I tested with pooling config.
Memory usage is stable when the pooling setting is disabled. But I think this is a temporary solution. Profiling Link: https://raw.githubusercontent.com/iris4865/IssueStorages/master/AkkaRemote_1.3.6/ActorSelection/ActorSelection-Akka_1.3.6-DotNetty-0.4.6-pool%3Dfalse.dmw @Aaronontheweb |
@Aaronontheweb
|
@iris4865 looks like my fixes in #3436 really did help - I do see some slight increases in memory pool usage over time from DotNetty, but we're talking like 32mb to 76mb over the span of 36 hours. Not the 1.6GB we were looking at earlier. Was able to reproduce this using both v0.4.6 and v0.4.8. I think this is because the singleton I'm using in #3436 actually behaves differently than the allocator that I'm https://github.com/Azure/DotNetty/blob/dev/src/DotNetty.Buffers/PooledByteBufferAllocator.cs#L132 - what we currently use (in 1.3.6 only) What we are reverting back to using in 1.3.7 if #3436 gets merged in: This configuration prefers direct buffers, when available, and thus utilizes the pool less frequently, depending on the runtime. |
I've been taking a look at the DotNetty single thread event executor and wondering if between what we do with Akka.Remote and what DotNetty does internally, if it's possible that we're churning threads - i.e. killing off less-used threads and creating new threads later when there's more work to be done. https://github.com/Azure/DotNetty/blob/dev/src/DotNetty.Common/Concurrency/SingleThreadEventExecutor.cs - source for the STEE in DotNetty. Looks like they keep the threads and park them, periodically, with timeouts. Reason I mention this is that DotNetty's internal shared resources, such as buffer pools, operate using Reason I mention this is because I'm pretty sure that the custom So it's possible those actors could drive up memory consumption if those old Compounding this to some extent... DotNetty also uses a custom When using direct buffers, we avoid all of this - because nothing is pooled into thread-local storage. I think I'm going to need to do some testing against some these threading scenarios and see if I can demonstrate a leak with the thread-local storage one way or another. |
Closed it because I forgot to update my commit message :p |
Ok, this makes me feel like I'm not crazy - looks like gRPC ran into a similar problem with That's exactly the sort of behavior I'm seeing during these long-running memory tests. It's not as severe when running with the changes from #3436, which BTW are available in the latest nightlies now @iris4865 - you can get them here: http://getakka.net/community/getting-access-to-nightly-builds.html |
@Aaronontheweb test project with nightly version |
@iris4865 so it looks like the changes there had a positive effect. That's probably going to be the best we can do in the short run - I think we're going to want to write some tests to check for leaks in DotNetty's pooling code, but there's not much that we can do to influence their release cycle other than submit pull requests and ask nicely. Worth noting: it's possible that memory consumption will increase over time when using Akka.Remote since we do things like cache deserialized When you're churning through new connections like how your reproduction sample does, that cache will slowly fill up over time. That could also be responsible for a small gradual memory increase (2-4mb) but nothing dramatic like a big 16mb chunk (which is the size of the heap arenas DotNetty allocates.) I'm going to open some issues on the DotNetty project about this, but I think we've resolved the major issues in the short run - do you agree? |
@Aaronontheweb Let's meet again in the next issue. ^^ |
Sounds like a plan! Closing. |
Akka version: 1.3.6
Akka.Remote: 1.3.6 (with DotNetty 0.4.6)
Platform: .Net Framework 4.7
Project link: https://github.com/iris4865/IssueStorages/tree/master/AkkaRemote_1.3.6/ActorSelection
Server memory usage has increased When Client program connects and disconnects. At the start of the program, the memory was 35.85MB. ActorSelection test result increased to 1.88GB. I did profiling with JetBrains DotMemory 2018.1. I think that Dotnetty.Buffers is not garbage collected.
p.s
I think Akka version 1.1.3 (with Akka.Remote 1.1.3 and Helios 2.1.2) is no problem.
so I am using version 1.1.3 temporarily.
The text was updated successfully, but these errors were encountered: