Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET errors after a while #179

Closed
karlhorky opened this issue Apr 28, 2021 · 17 comments · Fixed by #259
Closed

ECONNRESET errors after a while #179

karlhorky opened this issue Apr 28, 2021 · 17 comments · Fixed by #259
Labels
bug Something isn't working
Milestone

Comments

@karlhorky
Copy link
Contributor

Hi @porsager !

Similar to #43, I'm also encountering similar ECONNRESET errors with a PostgreSQL service on Render.com, but with a very low number of connections:

Apr 27 10:22:53 PM  node:internal/process/promises:246
Apr 27 10:22:53 PM            triggerUncaughtException(err, true /* fromPromise */);
Apr 27 10:22:53 PM            ^
Apr 27 10:22:53 PM  
Apr 27 10:22:53 PM  Error: read ECONNRESET    at TCP.onStreamRead (node:internal/stream_base_commons:211:20) {
Apr 27 10:22:53 PM    errno: -104,
Apr 27 10:22:53 PM    code: 'ECONNRESET',
Apr 27 10:22:53 PM    syscall: 'read'
Apr 27 10:22:53 PM  }
Apr 27 10:22:54 PM  error Command failed with exit code 1.

It seems like after a certain amount of time, the connection with PostgreSQL goes into this state where the ECONNRESET errors start happening. Once it's in that state, it fails all queries for a while (a few minutes?). But then it eventually "wakes up" again (I guess Postgres.js eventually figured it out and reconnected at that point).

I'm using postgres@2.0.0-beta.5 - I'm assuming this beta version is not super unstable like this...? I never had any issues with this with Heroku's PostgreSQL service.

I've also asked for assistance from the Render.com support team over here: https://community.render.com/t/econnreset-with-node-js-and-postgresql/1498

@karlhorky
Copy link
Contributor Author

If the connection is being terminated after a while of inactivity, maybe another option would be to try the idle_timeout option...

@karlhorky
Copy link
Contributor Author

Seems like Render.com is still on PostgreSQL 11.7 (maybe to be upgraded in the summer), although I guess that would not be the cause...

@karlhorky
Copy link
Contributor Author

karlhorky commented Apr 28, 2021

Ok, some updates:

  1. connect_timeout: 10 had no effect
  2. idle_timeout: 2 has the service working again now 🤯

So I guess Render.com terminates idle PostgreSQL connections on the server side... I'll ask about this on the community page.

Maybe a good thing to add to the Postgres.js docs too, in case this is a more common configuration...

@porsager
Copy link
Owner

Hi Karl.

Yeah, this is probably a common scenario.

Could be included with the part you already contributed about lambdas etc. 🙂 https://github.com/porsager/postgres#idle-timeout

@karlhorky
Copy link
Contributor Author

Ok, I've added a quick note there then: #180

I guess closing this issue for now, since there's nothing else that's really actionable unless there's more info from the Render.com support team.

@karlhorky
Copy link
Contributor Author

karlhorky commented May 12, 2021

So @joshi4 from the Render.com team helped with debugging the issue more on the topic I opened on the community site.

Copying here for visibility:

...the server will close the connection when the server hasn’t seen any activity for apprx 2 hours and 11 minutes. That number comes from a combination of the following kernel settings:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200 

If there’s no activity on a TCP connection for 2 hours then the server will send 9 keep alive probes every 75 seconds. If the connection peer(client) doesn’t reply to any of the probes the server will terminate the connection. It seems that replying to the empty keep alive ACK is optional in the spec.

To test this out, I issued requests to the db 3 hours apart and noticed a bunch of ECONNRESET in my logs.

May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp }
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp }
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   code: 'ECONNRESET',
May 07 19:49:20Z test-conn-reset worker-95lpp   errno: -104,
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp   syscall: 'read'
May 07 19:49:20Z test-conn-reset worker-95lpp Error Error: read ECONNRESET    at TCP.onStreamRead (internal/stream_base_commons.js:209:20) {

I think the best way to handle this scenario is to do what you are already doing: which is to set an idle timeout. Setting the idle timeout to something less than 2hrs and 11 mins should help you avoid this issue.

What I ended up with was this configuration (2 hour idle timeout), which seems to work so far:

const sql = postgres({ idle_timeout: 7200 });

@karlhorky
Copy link
Contributor Author

Updates:

  1. Ok, just failed again with the idle_timeout: 7200 - trying a new, lower value (idle_timeout: 6000)
  2. idle_timeout: 6000 is also intermittently failing, will try 3000

@karlhorky
Copy link
Contributor Author

Update:

Just had another issue with 3000, going to reduce this to a low number of seconds like idle_timeout: 100.

This is unfortunate that we cannot keep a persistent connection to the database with Render.com PostgreSQL.

@joshi4
Copy link

joshi4 commented Jul 6, 2021

Following up from Karl's original issue.

I wanted to summarize a few of my observations based on the code in the following two repositories:

I created a database in Render's Europe region and deployed both the above repos as background workers.

For the postgresjs-conn-test repo, I was able to consistently trigger the ECONNRESET log by setting the idleTimeout to 1 hour and querying the db every 50 minutes. I observed that I got the ECONNRESET on every other run. i:e once every 100 minutes.

I have yet to see a similar error in the conn-reset-go service. Both apps are pointing at the exact same database and make the same number of requests in the same time intervals.

I also tested the same setup in Render's US region and I haven't been able to replicate the issue there across both Node and Go apps.

Putting everything together, it seems to me that the most robust solution seems to be to handle this on the client side. I notice that Go's standard library does account for such behavior here:

// QueryContext executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

This issue might also be related to: #121

Note: I currently work at Render.

@porsager
Copy link
Owner

porsager commented Jul 7, 2021

I've just released beta.6 which simply sets keepalive from the client to 60 seconds (a postgres db connection is fairly high bandwidth already, so I feel 60 seconds is a good start). I'm curious if something as simple as this will do? Would you mind giving it a go?

If not we could look into doing like the go library, which appears to retry if this connection failure happens.

@karlhorky karlhorky reopened this Jul 7, 2021
@karlhorky
Copy link
Contributor Author

karlhorky commented Jul 7, 2021

I've just released beta.6 which simply sets keepalive from the client to 60 seconds (a postgres db connection is fairly high bandwidth already, so I feel 60 seconds is a good start). I'm curious if something as simple as this will do? Would you mind giving it a go?

Nice! Going to deploy a new version with beta.6 then. Am I understanding correctly that I can remove the idle_timeout config?

@porsager
Copy link
Owner

porsager commented Jul 7, 2021

It depends on the idle handling on the server you connect to. If there's a load balancer in between (eg. pgbouncer) it might not respect tcp keep alives, but only care about timeout after no queries (i don't know, but you could most likely find that information somewhere).

If my understanding of the issue is correct, what happens is that the server closes the tcp socket abruptly, so that the connected Postgres.js client doesn't know it's gone. Then when the next query is executed an error is returned when trying to write because the connection is closed, and hence the query errors.

Now if the server has a rule that it always closes a connection after a specific amount of time (regardless of idle time) there's not much else we can do but try to reconnect on failure. That would require some logic around backoff + retries, but it's definitely doable.

@karlhorky
Copy link
Contributor Author

Ok, beta.6 is now deployed, and the configuration for idle_timeout has been removed - let's see!

@karlhorky
Copy link
Contributor Author

Haven't had any issues since the deploy... I'll keep watching!

@porsager
Copy link
Owner

porsager commented Jul 9, 2021

That's great @karlhorky - Thanks for testing!

@karlhorky
Copy link
Contributor Author

karlhorky commented Jul 18, 2021

Still no incidents since the deploy (21 days now)

@karlhorky
Copy link
Contributor Author

Maybe I'll wait for this feature to land in non-beta v2 and then we can close this.

@karlhorky karlhorky changed the title ECONNRESET after a while, reconnecting takes a few minutes :( ECONNRESET errors after a while Jul 30, 2021
@porsager porsager added the bug Something isn't working label Jul 30, 2021
@porsager porsager added bug Something isn't working and removed bug Something isn't working labels Sep 10, 2021
@porsager porsager added this to the v2 milestone Sep 19, 2021
@porsager porsager mentioned this issue Jan 11, 2022
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants