-
Notifications
You must be signed in to change notification settings - Fork 29.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468
Comments
I've checked versions 13.0.0, 13.8.0 and 13.9.0 but all fail to close connections after the keep-alive timeout has elapsed, meaning that they bypass the test entirely, and leak unclosed sockets instead of closed ones. I went back to 12.6.3 and it behaves as expected with sockets closed after the keep-alive timeout elapses and does not retain TLSWrap objects. When it completes the snapshot only references a single TLSWrap object, which would be expected when both the client and server are destroyed. Conversely the 13.x versions above retain the TLSWrap objects even after client, server and all sockets are closed, which suggests that the retention was before 9fdb6e6. |
Ran into what appears to be the same or similar issue when updating node.js to 14.3 (from 8.x) in one of our services. In our case we're also using keep-alive and to exacerbate the issue, we deliberately destroy sockets 10 minutes after first use to ensure we're responding to DNS changes of servers that may not necessarily close connections. We've set up the service to take heapsnapshots every hour. Here's a comparison of one of the workers between hour 1 and hour 2 of operation, June 2 00:48Z and 01:48Z respectively. In this case I don't have an hour 3, because a graceful restart was necessary ahead of any OOM issues. I compiled and deployed a package of v14.x with 6f6bf01 cherry-picked on top. This appears to have resolved the issue, and the service thankfully made it through the night without intervention. Certainly some slightly different load profiles due to difference in time, but meaningful nonetheless. Here's a comparison of one of the workers between hour 1 and hour 9 of operation. Thank you @addaleax, and thanks @mgilbey for reporting the issue - hope it's fixed for you as well. |
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of nodejs#33468). Thus, this is a partial revert of f59ec2a. Refs: nodejs#33468 Refs: nodejs#27018
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Since I upgraded from 12.16.3 to 12.17 I have had a memory leak in production that I am trying to figure out. I took a memory snapshot in the staging environment and I can see 898 TLSWrap objects with the same 4 retainers as mentioned by @mgilbey. I couldn't validate yet that is the leak because I was only able to take one snapshot but if it is, it was introduced in 12.17. |
We also started seeing this after upgrading from 12.16 to 12.17. |
I have done further testing on this issue. I have taken a memory snapshot in after 24 hours and I see 64000 TLSWrap objects using about 400MB of memory. In previous experiments I have seen that the number of these TLSWrap objects is always growing. I also tested that even 10 mins after I stopped all traffic to my service, and Http keepalive should have expired, the TLSWrap objects were still around. Is there any chances we could get this fixed? |
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@christhegrand @abartomeu I think it’s a good sign that for @omsmith this has been fixed through e8cc269. I’ve opened a backport PR for v12.x to expedite getting the fix in: #33962 It would be very helpful if you could build Node.js from source for that PR and confirm that it does solve the issue for you as well. |
Hi @addaleax, we have tried the patched version for 12h and the memory trend seems similar to that in v12.16.3. That alone would justify promoting this fix. I will leave it running for at least one more day and will update it. |
About 30h running and memory continues to be similar to that in v12.16.3. I haven't noticed any other change in any other performance parameters (CPU usage, response time, etc) @addaleax, could you kindly comment how long would this usually take to get to an official nodejs version? |
I've been debugging a memory leak in our Node.js app after upgrading from Node 10 to Node 14. I was close to giving up when stumbling upon this issue. Downgrading from 14.4.0 to 12.6.0 fixed it for us. |
We recently upgraded from 12.16 to 12.18 and noticed the memory leak. |
@karicem Just to make sure, are you confident that it is this one or is there a chance it’s a separate bug?
See #33962 – probably yes.
According to nodejs/Release#494, that would be July 14th. However, as you can see in the PR, we’re discussing potentially releasing this earlier than that. |
We haven't rebuilt from source with the patch but we have some data that seems to indicate it will. We have one app running with 12.18 that does not use keep-alive and it does not show signs of growing memory usage. When we look at memory usage of other apps running 12.18 that are using keep-alive it keeps increasing, at different rates depending on the app. |
Opened a v12.18.2 release proposal (#34077) which contains this patch. Aiming for it to be released next Tuesday, 30th June. Here's a release candidate build that you may be able to test with - https://nodejs.org/download/rc/v12.18.2-rc.0/ |
Thanks, we will have a look. |
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Confirming that v12.18.2 and v14.5.0 no longer exhibit this behaviour. Thank you for the fix! I've updated the reproduction repository with new certificates as the original expired if people want to test other versions. |
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Platform:
Darwin <host> 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64
and node docker imagenode:13.13.0-alpine
Subsystem: https
What steps will reproduce the bug?
I have created a script to reproduce at https://github.com/mgilbey/node-tls-wrap
The script runs a TLS server and https Agent to be self contained, but only the Agent is needed to reproduce the problem with a HTTPS server to connect to. The classes are only used to help identify which objects are holding references in the heap snapshots.
The script will open 100 connections, send and receive two HTTP requests on each connection (demonstrating keep-alive functions as expected) and then lets the sockets close due to the keep-alive timeout. This repeats five times, leaving 501 TLSWrap objects in memory. The scripts forces a garbage collection, and then collects a heap dump before exiting.
The agent reports the expected number of active / free sockets (never more than 100).
How often does it reproduce? Is there a required condition?
The script will consistently reproduce the issue.
I believe the salient points for reproduction are:
What is the expected behavior?
TLSWrap and TLSSocket objects are garbage collected when the socket from the agent is closed due to the keep-alive timeout, and no further keep-alive socket re-use is possible.
What do you see instead?
TLSWrap and TLSSocket objects will grow in memory until the old-space heap limits are reached at e.g. 10 or 300 MB limits. The number of TLSWrap and TLSSocket objects in memory will exceed the maxSockets, maxFreeSockets or maxCachedSessions Agent options.
An allocation timeline and heap snapshot are also included in the repository.
Each TLSWrap object has four retainers, although none appear to be the https Agent.
Additional information
To demonstrate the process crashing due to heap limits, run with
--max-old-space-size=10
and increase the number of iterations on line 128 to 50. You can increase heap and the number of iterations should you be more patient.The issue does not reproduce with the http Agent.
The cause appears to be very similar to resolved issue #19859 that affected the http Agent, particularly the reference to the open sockets being less than the maxSockets option.
The text was updated successfully, but these errors were encountered: