Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when polling ssl endpoint #109600

Open
CoenraadS opened this issue Nov 7, 2024 · 8 comments
Open

Memory leak when polling ssl endpoint #109600

CoenraadS opened this issue Nov 7, 2024 · 8 comments

Comments

@CoenraadS
Copy link

CoenraadS commented Nov 7, 2024

This issue is a follow up of #108741

Unmanaged memory growth occurs when polling a single ssl endpoint.

It occurs on an internal device which I do not have permission to share on a public network. I tested against other public websites, but could not reproduce, I did however capture a pcap. The leak occurs on an arm32 linux system. About 10MB an hour with 1s poll time.

Setting DOTNET_SYSTEM_NET_SECURITY_TLSCACHESIZE=100 fixed it.

pcap:

capture.zip

OpenSSL 1.1.1t  7 Feb 2023
dotnet 8.0.403

Code

/// <summary>
/// This sample aims to reproduce an issue with unmanaged memory growth related to the HttpClient TLSCache
/// https://github.com/dotnet/runtime/issues/108741
/// Note that the issue was only observed on arm32 devices, and could not be reproduced in a windows environment.
/// Build args: dotnet publish ./memoryleak.csproj -r linux-arm -c Release --self-contained true -o ./publish
/// =====
/// The app will print the managed heap size, which eventually stabilizes
/// However running a tool such as 'htop', it can be observed the residential memory will continue to grow about 10MB an hour
/// </summary>
class Program
{
    private static readonly HttpClient client;

    static Program()
    {
        var handler = new HttpClientHandler
        {
            ServerCertificateCustomValidationCallback = HttpClientHandler.DangerousAcceptAnyServerCertificateValidator
        };
        client = new HttpClient(handler);
    }

    static async Task Main(string[] args)
    {
        while (true)
        {
            try
            {
                using HttpResponseMessage response = await client.GetAsync("https://192.168.140.238:443/"); // This IP refers to an internal system and will not work.
            }
            catch (Exception ex)
            {
                Console.WriteLine($"[{DateTime.Now}] Error: {ex.Message}");
            }

            Console.WriteLine($"[{DateTime.Now}]: {GC.GetTotalMemory(true)}");
            await Task.Delay(1000);
        }
    }
}

Build:

dotnet publish ./memoryleak.csproj -r linux-arm -c Release --self-contained true -o ./publish
@janvorli
Copy link
Member

janvorli commented Nov 7, 2024

cc: @rzikm

@rzikm rzikm self-assigned this Nov 7, 2024
@rzikm
Copy link
Member

rzikm commented Nov 8, 2024

Note, although the issue occurs on arm-linux, this pcap was captured from windows. I hope the packets are the same.

This makes the capture not much useful as the two TLS stack implementations are very different underneath. We need captures from as close a configuration as you can get.

The captures also seem all to be TLS resumes, can you make sure that the first (non-resume) exchange is captured as well?

@rzikm rzikm added the needs-author-action An issue or pull request that requires more info or actions from the author. label Nov 8, 2024
@CoenraadS
Copy link
Author

CoenraadS commented Nov 8, 2024 via email

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Nov 8, 2024
@wfurt
Copy link
Member

wfurt commented Nov 11, 2024

If DOTNET_SYSTEM_NET_SECURITY_TLSCACHESIZE "fixes" the problem it is probably not a leak. The default cache is pretty large and it may be more visible on ARM as the systems are typically much smaller and have less resources.

As @rzikm mentioned the implementations are very different on each platform. You can try to run it on WSL2 @CoenraadS to see if this is truly ARM specific. I may be able to give it shot next week on my Raspberry.

@CoenraadS
Copy link
Author

CoenraadS commented Nov 18, 2024

@wfurt @rzikm I updated the original issue with with a pcap capture from the arm device.

If it is not a bug, then I suppose the discussion of what is an appropriate cache size for a small device is more nuanced, I'm ok to close it in that case, my only remark is that I would expected on a device with e.g. 512 ram, (and less actually available), the cache default (unsure what it is) is probably too large.

I noticed in the pcap, that there are many New Session Ticket messages (with 2 hour lifespan), so I'm wondering if it's related to that.

Some measurements (manually looking at htop used memory (app running is code in the OP))

Time Used Memory
3:20PM 42.6MB
3:42PM 59.9MB
3:50PM 68.3MB
3:54PM 72.3MB

(I didn't have time to run a longer test, but was just confirming the growth happens quite rapidly, the app on startup uses ~36MB, so within an hour it has basically doubled, all while only polling a single endpoint)

@rzikm rzikm modified the milestone: 10.0.0 Nov 18, 2024
@rzikm rzikm removed the untriaged New issue has not been triaged by the area owner label Nov 18, 2024
@rzikm rzikm removed their assignment Nov 18, 2024
@milen-denev
Copy link

I had a .NET-based reverse proxy which was leaking 100% after upgrading to version 9.0. The server was reaching 100% RAM usage in the span of a few hours, this very same program wasn't having any issue on version 8.

@rzikm
Copy link
Member

rzikm commented Nov 25, 2024

@milen-denev does setting DOTNET_SYSTEM_NET_SECURITY_TLSCACHESIZE=100 work around the issue? Are you able to provide a minimal reproducible solution which we can run locally? Are you also running on ARM CPU?

So far I did not have time to dig deeper into the issue, but having more data can be only beneficial.

@milen-denev
Copy link

@milen-denev does setting DOTNET_SYSTEM_NET_SECURITY_TLSCACHESIZE=100 work around the issue? Are you able to provide a minimal reproducible solution which we can run locally? Are you also running on ARM CPU?

So far I did not have time to dig deeper into the issue, but having more data can be only beneficial.

  1. I will try my best to test with this env var, it was replaced by Rust based solution
  2. I will try my best to find which part was leaking and upload a MRS
  3. It was running on AMD Epyc 4th gen x86 CPU arch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants