Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in idle socket handling #24980

Closed
timcosta opened this issue Dec 12, 2018 · 13 comments
Closed

Regression in idle socket handling #24980

timcosta opened this issue Dec 12, 2018 · 13 comments
Assignees
Labels
http Issues or PRs related to the http subsystem.

Comments

@timcosta
Copy link
Contributor

timcosta commented Dec 12, 2018

  • Version: v8.14.0
  • Platform: macOS 10.13.3, node:carbon LTS docker image
  • Subsystem: http

I'd like to report a possible regression introduced to the http module between versions 8.9.4 and 8.14.0.

Sockets that are opened but data is not transferred are closed immediately after data transmission once they have idled for >40 seconds.

Reproduction available here: https://github.com/timcosta/node_tcp_regression_test

We ran a tcpdump on our servers that were 504ing, and saw that node is responding with an ACK followed almost immediately by a duplicate ACK with an additional RST on the same socket.

Timeouts are set to 60 seconds on the client (AWS ELB) and 2 minutes on the node server (hapi.js).

I'm filing this as a node core issue as the error can be reproduced by using both hapi and the bare node http module as can be seen in this travis build: https://travis-ci.com/timcosta/node_tcp_regression_test/builds/94440224

The error seems to be not consistent per travis on versions 8.14.0, 10.14.2, and 11.4.0 but the build consistently passes on v8.9.4 which leads me to believe there is a possible regression.

cc: @jtymann @dstreby

@Trott
Copy link
Member

Trott commented Dec 12, 2018

@nodejs/http

@lpinca
Copy link
Member

lpinca commented Dec 12, 2018

Could this be caused by eb43bc04b1?

@lpinca lpinca added the http Issues or PRs related to the http subsystem. label Dec 12, 2018
@timcosta
Copy link
Contributor Author

Seems likely @lpinca. The timings and behavior match up. That seems to have broken AWS ELBs with default settings that front node.js back ends though, as this timeout is lower than the ELB default of 60 seconds.

@lpinca
Copy link
Member

lpinca commented Dec 12, 2018

I see, that change was part of a security release so it didn't go through the normal release cycle. I'm not sure why 40 sec was chosen as default value but it can be customised.

@timcosta
Copy link
Contributor Author

Hm okay, I'd propose the default value be changed to something > 60 seconds, as that's the default timeout for ELBs and this issue likely broke node in a default configuration behind ELBs for more than just us.

@lpinca
Copy link
Member

lpinca commented Dec 15, 2018

cc: @mcollina

@mcollina
Copy link
Member

Currently we are starting to wait for the headers when we receive a connection, however we could do this on first byte solving the issue at hand. I'll see if I can code something up that will address this.

Note that this is configurable https://nodejs.org/api/http.html#http_server_headerstimeout, so you can increase that to 60s, solving your immediate issue.

We picked 40 seconds, because it is the default Apache is using.

cc @nodejs/lts @MylesBorins

@MylesBorins MylesBorins self-assigned this Dec 18, 2018
@thomasjungblut
Copy link

Does #26166 look related to you guys? I was just looking through other connection RST issues here.

@mcollina mcollina assigned mcollina and unassigned MylesBorins Mar 5, 2019
@thomasjungblut
Copy link

thomasjungblut commented Mar 17, 2019

cross posting this with #26166:

we tried setting the header timeout to 0s on top of node:dubnium-jessie-slim (currently at v10.15.3), but to no avail. Also tried on top of node:11 now and we still get these pesky 504s. Also setting the IDLE timeout to 30s on the ELB seemed to not help either.

furthermore we have a similar issue using ALBs and websockets which give us 502s in that scenario

edit: we ultimately fixed it by just sidecar'ing a golang single host reverse proxy in front of the nodejs process.

@ezekg
Copy link

ezekg commented Jul 16, 2019

@thomasjungblut how did sidecaring a golang reverse proxy fix your issue? I'm not sure I understand. I've been dealing with this ELB <-> Node issue for weeks and have not been able to find a solution. For awhile I thought it was an unlikely k8s bug causing an RST packet, but after applying several patches to k8s for different problems, nothing worked.

Currently, my idle timeout setup looks like this (the high numbers are recommended by Google's load balancer documentation which I used as a reference):

  • Cloudflare idle timeout: 300s (not configurable)
  • ELB idle timeout: 600s
  • Server keepalive timeout: 620s
  • Server headers timeout: 621s
  • Server timeout: 120s (should this be higher?)

Which should technically work, right? But it doesn't. And it doesn't make sense why I'm seeing so many 504s and connection resets. Any ideas?

@timcosta
Copy link
Contributor Author

@ezekg you can read a bit more about how I solved it here: https://www.timcosta.io/how-we-found-a-tcp-hangup-issue-between-aws-elbs-and-node-js/

There are code snippets in that article to help you figure out exactly when the socket timeout is occurring, which will tell you which of the timeouts you are hitting.

tldr though is that all of your server timeouts need to be above the ELB timeout, so my guess is that yes, your server timeout needs to be higher.

@thomasjungblut
Copy link

@ezekg apparently Go closes connections properly with a FIN and it deals with the RST packets from the nodejs somewhat gracefully in that regard.

timcosta added a commit to timcosta/node that referenced this issue Oct 22, 2019
@Trott Trott closed this as completed in e17403e Dec 14, 2019
MylesBorins pushed a commit that referenced this issue Dec 17, 2019
Fixes: #24980
Refs: eb43bc04b1

PR-URL: #30071
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Rich Trott <rtrott@gmail.com>
targos pushed a commit to targos/node that referenced this issue Apr 25, 2020
Fixes: nodejs#24980
Refs: nodejs@eb43bc04b1

PR-URL: nodejs#30071
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Rich Trott <rtrott@gmail.com>
targos pushed a commit that referenced this issue Apr 28, 2020
Fixes: #24980
Refs: eb43bc04b1

PR-URL: #30071
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Trivikram Kamat <trivikr.dev@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Rich Trott <rtrott@gmail.com>
@ismailyagci
Copy link

Friends have had the same problem After working for 15 days, the source of the problem is "server.headersTimeout = 7200000;" you can fix it by adding this code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants