Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls: fix leak of WriteWrap+TLSWrap combination #9626

Closed
wants to merge 1 commit into from

Conversation

indutny
Copy link
Member

@indutny indutny commented Nov 15, 2016

Checklist
  • make -j8 test (UNIX), or vcbuild test nosign (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

tls

Description of change

Writing data to TLSWrap instance during handshake will result in it
being queued in write_item_queue_. This queue won't get cleared up
until the end of the handshake.

Technically, it gets cleared on ~TLSWrap invocation, however this
won't ever happen because every WriteWrap holds a reference to the
TLSWrap through JS object, meaning that they are doomed to be alive
for eternity.

To breach this dreadful contract a knight shall embark from the
close function to kill the dragon of memory leak with his magic
spear of destroySSL.

destroySSL cleans up write_item_queue_ and frees SSL structure,
both are good for memory usage.

PR-URL: #9586
Reviewed-By: Ben Noordhuis info@bnoordhuis.nl


Backport of #9586, same PR should apply to v6.x too.

cc @thealphanerd @nodejs/lts

Writing data to TLSWrap instance during handshake will result in it
being queued in `write_item_queue_`. This queue won't get cleared up
until the end of the handshake.

Technically, it gets cleared on `~TLSWrap` invocation, however this
won't ever happen because every `WriteWrap` holds a reference to the
`TLSWrap` through JS object, meaning that they are doomed to be alive
for eternity.

To breach this dreadful contract a knight shall embark from the
`close` function to kill the dragon of memory leak with his magic
spear of `destroySSL`.

`destroySSL` cleans up `write_item_queue_` and frees `SSL` structure,
both are good for memory usage.

PR-URL: nodejs#9586
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@nodejs-github-bot nodejs-github-bot added tls Issues and PRs related to the tls subsystem. v4.x labels Nov 15, 2016
@MylesBorins
Copy link
Contributor

I've added the ctc-agenda tag to discuss how quickly we should be cutting v4 and v6 if this lands. If there is urgency we don't exactly have time to wait until the next LTS meeting

@subeeshcbabu-zz
Copy link

Thx @indutny 👍 . We are eagerly waiting for this release. It would be awesome, if there could be a possibility to release the patch at an earlier timeframe.

@rvagg
Copy link
Member

rvagg commented Nov 16, 2016

@indutny you need to help us come to an understanding of the severity of this so we can judge how urgent it is for LTS release lines to get pushed out, i.e. whether we need to pull our next planned release forward from the 6th of December.

My guess here is that it's not super urgent as you'd need to either be making lots of TLS client connections with this pattern of write-before-handshake-complete during the lifetime of a process or writing lots of data in that brief window or a combination of both situations. So it's either going to be unlikely to be experienced by a lot of people or if it is the leak will be small enough to not require urgent action on our end.

@indutny
Copy link
Member Author

indutny commented Nov 16, 2016

@rvagg this is a pretty serious bug in my opinion. Doing TLS client requests with some ETIMEOUT (which are usually inevitable) will result in constant leak of memory.

This resulted in many "out of memory" crashes for us. I don't have any data about either leaks or crashes at other companies.

@rvagg
Copy link
Member

rvagg commented Nov 16, 2016

@nodejs/lts we need to make a decision here. I'm of the opinion that we continue with our current schedule and roll this in to updates for v4 and v6 on the 6th as planned. That's only 3 weeks away. Doing it earlier will either mean rushing it out with less time for RCs or overlapping with Node Interactive. @thealphanerd @jasnell @Fishrock123 what think you?

@indutny
Copy link
Member Author

indutny commented Nov 16, 2016

@thealphanerd @jasnell @Fishrock123 I would like to emphasize how critical this is for my company. It practically stops adoption of v4, because it crashes in production.

@MylesBorins
Copy link
Contributor

I personally would like to see this land in a v7 release ASAP. We can then cut an R.C. with this commit included next week. This would then give 2 weeks for the change to bake in both the R.C. and the v7 release before we officially release.

@indutny does that seem reasonable?

@indutny
Copy link
Member Author

indutny commented Nov 16, 2016

@thealphanerd RC should be fine for our use case. @subeeshcbabu could you confirm this?

@subeeshcbabu-zz
Copy link

Yeah rc works for us.

@subeeshcbabu-zz
Copy link

We are constantly able to reproduce this leak, in the event of any connect timeout. I am surprised that others have not reported this yet.

@mstuart
Copy link

mstuart commented Nov 17, 2016

+1 This leak is a huge problem for our app. It can't stay alive more than 1 day... app starts up ~300mb and grows to 1.4gb with just a few hours of traffic until it's restarted.

Really looking forward to landing this fix! It'll be huge for those using node at scale!

Thanks @indutny.

@bluepnume
Copy link

Thanks for the fix, this is causing huge problems in our app with restarts every hour. Looking forward to trying out the patch.

@jayalane
Copy link

+1 very useful for stability and nicely implemented change.

@dmikey
Copy link

dmikey commented Nov 17, 2016

👍 could use this as well. Thanks!

@andreabondi
Copy link

+1 useful one!

});

c.write('hello', common.mustCall((err) => {
assert.equal(err.code, 'ECANCELED');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer strictEqual

@aashishRamamurthy
Copy link

+1

@grawk
Copy link

grawk commented Nov 17, 2016

Just want to reiterate the impact of the current behavior. Across all our application pools we are seeing roughly 400-700 restarts a day due to this. If we could get a rc by early next week it could smooth our holiday traffic considerably.

@MylesBorins
Copy link
Contributor

@MylesBorins
Copy link
Contributor

CI Looks good. BSD Failure expected

@thefourtheye as this is a backport I'm not going to land the nit, but have someone submitting a first PR against master to get it done 🎉

bf09078028

@indutny indutny deleted the backport/tls-leak-fix-4 branch November 18, 2016 19:18
@addaleax addaleax mentioned this pull request Nov 22, 2016
2 tasks
Fishrock123 added a commit that referenced this pull request Nov 22, 2016
This is a security release impacting Windows 10 users.

Notable changes:

* crypto: The `Decipher` methods `setAuthTag()` and `setAAD` now return
`this`. (Kirill Fomichev) #9398
* dns: Implemented `{ttl: true}` for `resolve4()` and `resolve6()`.
(Ben Noordhuis) #9296 &
#9296
* libuv: Upgrade to v1.10.1 (cjihrig)
#9647
  - Fixed a potential buffer overflow when writing data to console on
Windows 10. (CVE-2016-9551)
* process: Added a new `external` property to the data returned by
`memoryUsage()`. (Fedor Indutny)
#9587
* tls: Fixed a memory leak when writes were queued on TLS connection
that was destroyed during handshake. (Fedor Indutny)
#9626
* V8 (dep): Upgrade to v5.4.500.43 (Michaël Zasso)
#9697
* v8: The data returned by `getHeapStatistics()` now includes three new
fields: `malloced_memory`, `peak_malloced_memory`, and
`does_zap_garbage`. (Gareth Ellis)
#8610

PR-URL: #9745
Fishrock123 added a commit that referenced this pull request Nov 22, 2016
This is a security release impacting Windows 10 users.

Notable changes:

* crypto: The `Decipher` methods `setAuthTag()` and `setAAD` now return
`this`. (Kirill Fomichev) #9398
* dns: Implemented `{ttl: true}` for `resolve4()` and `resolve6()`.
(Ben Noordhuis) #9296 &
#9296
* libuv: Upgrade to v1.10.1 (cjihrig)
#9647
  - Fixed a potential buffer overflow when writing data to console on
Windows 10. (CVE-2016-9551)
* process: Added a new `external` property to the data returned by
`memoryUsage()`. (Fedor Indutny)
#9587
* tls: Fixed a memory leak when writes were queued on TLS connection
that was destroyed during handshake. (Fedor Indutny)
#9626
* V8 (dep): Upgrade to v5.4.500.43 (Michaël Zasso)
#9697
* v8: The data returned by `getHeapStatistics()` now includes three new
fields: `malloced_memory`, `peak_malloced_memory`, and
`does_zap_garbage`. (Gareth Ellis)
#8610

PR-URL: #9745
@AdriVanHoudt
Copy link
Contributor

Is there an ETA on when this will land on v4?

@MylesBorins
Copy link
Contributor

the rc went out yesterday. the intent is to ship it on the 6th of December

On Wed, Nov 23, 2016, 2:24 PM Adri Van Houdt notifications@github.com
wrote:

Is there an ETA on when this will land on v4?


You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub
#9626 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAecV3kavr_y09L3DqfEFQWTJAV_hz0bks5rA9w1gaJpZM4Ky-dm
.

@indutny
Copy link
Member Author

indutny commented Nov 23, 2016

@AdriVanHoudt it landed on v4 already, and is released in the latest RC.

@AdriVanHoudt
Copy link
Contributor

Ok cool! 👌 I'm seeing a memory leak in production only (yey) and hoping this will fix it

@indutny
Copy link
Member Author

indutny commented Nov 23, 2016

Please let us know if it will or won't

@AdriVanHoudt
Copy link
Contributor

@indutny I can try running the RC this weekend, problem is only production and high amounts of traffic. Is the RC stable, I know what it stands for but there are no like major things that are not working right?

@MylesBorins
Copy link
Contributor

MylesBorins commented Nov 23, 2016 via email

@AdriVanHoudt
Copy link
Contributor

I looked at them and seems fine indeed, I even found a commit I did in there haha, let me run in staging for a few hours and then production and I'll come back if everything is on fire or my memory is smooth sailing ;)

@AdriVanHoudt
Copy link
Contributor

I think it is fair to say that the issue is fixed (graph is memory, last drop is reboot with new rc version)

Thank you so much for finding and fixing this @indutny! And for the backport to v4 @thealphanerd ! 🚀

@indutny
Copy link
Member Author

indutny commented Nov 24, 2016

Hooray!

imyller added a commit to imyller/meta-nodejs that referenced this pull request Nov 28, 2016
This is a security release impacting Windows 10 users.

    Notable changes:

    * crypto: The `Decipher` methods `setAuthTag()` and `setAAD` now return
    `this`. (Kirill Fomichev) nodejs/node#9398
    * dns: Implemented `{ttl: true}` for `resolve4()` and `resolve6()`.
    (Ben Noordhuis) nodejs/node#9296 &
    nodejs/node#9296
    * libuv: Upgrade to v1.10.1 (cjihrig)
    nodejs/node#9647
      - Fixed a potential buffer overflow when writing data to console on
    Windows 10. (CVE-2016-9551)
    * process: Added a new `external` property to the data returned by
    `memoryUsage()`. (Fedor Indutny)
    nodejs/node#9587
    * tls: Fixed a memory leak when writes were queued on TLS connection
    that was destroyed during handshake. (Fedor Indutny)
    nodejs/node#9626
    * V8 (dep): Upgrade to v5.4.500.43 (Michaël Zasso)
    nodejs/node#9697
    * v8: The data returned by `getHeapStatistics()` now includes three new
    fields: `malloced_memory`, `peak_malloced_memory`, and
    `does_zap_garbage`. (Gareth Ellis)
    nodejs/node#8610

Signed-off-by: Ilkka Myller <ilkka.myller@nodefield.com>
@Trott Trott removed the ctc-agenda label Nov 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.