At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

meetme2meat · 2019-03-11T11:10:11Z

Well, I have a sample script which I have been using to understand whether or how the lib/pq handle the reconnect.

package main

import (
	"database/sql"
	"fmt"
	"log"
	"time"

	"github.com/lib/pq"
)

func main() {
	url := "postgres://werain:@localhost/postgres?sslmode=disable&application_name=reconnect"
	connStr, err := pq.ParseURL(url)
	conn, err := sql.Open("postgres", connStr)
	if err != nil {
		log.Fatal(err)
	}
	for {
		result, err := conn.Exec("select 1;")
		if err != nil {
			fmt.Println(err)
			time.Sleep(10 * time.Second)
		}
		fmt.Println(result.RowsAffected())
		time.Sleep(5 * time.Second)
	}
}

I see the reconnect working sometime but there are time when I see the following error

read tcp [::1]:49528->[::1]:5432: read: connection reset by peer

Now I'm not sure why it works sometimes and does not work the other time. Is this something need to handle by the individual client(i.e reconnect) here?

NOTE: I'm not entirely sure if this issue is database/sql or lib/pq please help me guide through this

The text was updated successfully, but these errors were encountered:

GeertJohan · 2019-03-19T16:56:02Z

Having the same issue here. I believe this is a regression that has been introduced about a month ago, I never had tcp connection issues stopping pq from opening a transaction or starting a query before. In my case the occurs quite often on opening a new transaction. The underlying connection has timed out on the postgres side and then there's the "read: connection reset by peer" error.

gnuletik · 2020-05-21T17:02:44Z

@GeertJohan See #871 (comment) for workarounds.

GeertJohan · 2020-05-21T17:40:56Z

I have since moved to pgx, as the readme also advertises. V4 of pgx also.corretly handles connection issues.

Pq has been awesome, it helped me a lot and I'm super grateful to everyone who contributed to.pq. Thanks!! That said, it seems pgx has better maintenance and updates now, and more future-proof.

Lekensteyn · 2021-03-23T16:32:44Z

I believe that the read: connection reset by peer issue can occur in the following scenario:

lib/pq opens a connection earlier due to a query.
Some time passes. The router/load balancer/database server closes the TCP connection (e.g. due to idleness or a server restart), but without notifying the client.
The application queries again, lib/pq writes a query to the TCP socket. Since the client thinks that the TCP socket is open, this operation succeeds.
The server sends a TCP RST back since the connection was closed on the server side.
lib/pq reads the response, but fails with connection reset by peer since the socket was marked as closed via the TCP RST.

As pointed out by @gnuletik above, this issue is made worse by #702 (comment). Instead of allowing an operation to be retried immediately by database/sql, the following happens:

The first query fails due to a bad network connection. The connection is subsequently marked as bad in lib/pq.
The underlying network error is returned to database/sql which is propagated to the client.
The application tries a query again through database/sql -> lib/pq. Due to (1), lib/pq will initially return driver.ErrBadConn to database/sql which ensures a reconnection before database/sql retries the query.
The second query succeeds.

In other words: if a network error occurs, one query will result an error, but the next one should succeed.

If I am not mistaken, pgx suffers from the same issue (jackc/pgx#74). However by using the pgxpool connection pool, connection have a maximum lifetime (one hour per https://github.com/jackc/pgx/blob/v4.10.0/pgxpool/pool.go#L17), so the problem is somewhat alleviated. (Don't mistake the pgx "health checks" for actively probing the connection of a TCP socket, it only checks the lifetime, see https://github.com/jackc/pgx/blob/v4.10.0/pgxpool/pool.go#L318-L356)

An analysis of lib/pq conn.go:

func (cn *conn) send(m *writeBuf) {
	n, err := cn.c.Write(m.wrap())
	if err != nil {
		if n == 0 { // this case was added in lib/pq 1.9.0
			err = &safeRetryError{Err: err}
		}
		panic(err)
	}
}

func (cn *conn) prepareTo(q, stmtName string) *stmt {
	st := &stmt{cn: cn, name: stmtName}

	b := cn.writeBuf('P')
	b.string(st.name)
	b.string(q)
	b.int16(0)

	b.next('D')
	b.byte('S')
	b.string(st.name)

	b.next('S')
	// If the local socket is closed, this panics with a safeRetryError{Err: "write: connection reset by peer"}
	// if the local socket is open, this one will always succeed even if the socket is remotely closed.
	cn.send(b)

	// If the remote socket is closed, this will panic with "read: connection reset by peer"
	// see conn.readParseResponse -> conn.recv1 -> conn.recv1Buf ->
	// conn->recvMessage (which panics if conn.recvMessage -> io.ReadFull(cn.buf, x) fails)
	cn.readParseResponse()
	st.paramTyps, st.colNames, st.colTyps = cn.readStatementDescribeResponse()
	st.colFmts, st.colFmtData = decideColumnFormats(st.colTyps, cn.disablePreparedBinaryResult)
	cn.readReadyForQuery()
	return st
}

conn.prepareTo is called for conn.exec, conn.Prepare, etc. These callers rely on conn.errRecover (see below) to convert a panic to an error type. For safeRetryError, it will return driver.ErrBadConn which allows database/sql to close the connection and retry. In many other error cases, the network error will be returned as-is but the connection will be marked as bad (cn.setBad):

pq/error.go

Lines 481 to 518 in 072e83d

    
           func (cn *conn) errRecover(err *error) { 
        
           	e := recover() 
        
           	switch v := e.(type) { 
        
           	case nil: 
        
           		// Do nothing 
        
           	case runtime.Error: 
        
           		cn.setBad() 
        
           		panic(v) 
        
           	case *Error: 
        
           		if v.Fatal() { 
        
           			*err = driver.ErrBadConn 
        
           		} else { 
        
           			*err = v 
        
           		} 
        
           	case *net.OpError: 
        
           		cn.setBad() 
        
           		*err = v 
        
           	case *safeRetryError: 
        
           		cn.setBad() 
        
           		*err = driver.ErrBadConn 
        
           	case error: 
        
           		if v == io.EOF || v.(error).Error() == "remote error: handshake failure" { 
        
           			*err = driver.ErrBadConn 
        
           		} else { 
        
           			*err = v 
        
           		} 
        
           	default: 
        
           		cn.setBad() 
        
           		panic(fmt.Sprintf("unknown error: %#v", e)) 
        
           	} 
        
           	// Any time we return ErrBadConn, we need to remember it since *Tx doesn't 
        
           	// mark the connection bad in database/sql. 
        
           	if *err == driver.ErrBadConn { 
        
           		cn.setBad() 
        
           	} 
        
           }

There is however one problem, some errors (such as those implementing the error interface) do not use setBad and can therefore cause a connection to be stuck forever. Not sure what the history is behind that. I will file a PR to call setBad for error too.

the postgres driver had a regression where it wouldn't drop dead connections resulting in all queries failing with tcp errors. The driver never recovers and restarting your process is the only workaround. The issue was resolved in 1.9.0 of the driver related discussion: lib/pq#835 Signed-off-by: Taylor Silva <tsilva@pivotal.io>

agis · 2022-05-23T13:56:04Z

@Lekensteyn thanks for this analysis. So, if I understand properly, lib/pq still suffers from this issue, in which the first query will fail? If so, do you have a proposed fix by any chance?

Isn't DB.SetConnMaxLifetime() and/or DB.SetConnMaxIdleTime() sufficient, supposedly?

EDIT: Can this be related to go-sql-driver/mysql/issues/257 as well?

hechaoli · 2022-06-24T21:30:36Z

I've seen the same problem when connecting to a psql DB in docker. Somehow I can't reproduce the problem locally but it happens on CI occasionally. The code is like

func connect(url string) (*sql.DB, error) {
  db, err := sql.Open("postgres", url)
  if err != nil {
    return err, nil
  }
  query := db.QueryRowContext(ctx, "SELECT current_database()")
  var name string
  err = query.Scan(&name)
  if err != nil {
    return nil, errors.Wrap(err, "Could not select current database") // <- This error happens occasionally.
  }
  return db, nil
}

Occasionally on CI, I've seen this "Could not select current database: read tcp 127.0.0.1:46768->127.0.0.1:41973: read: connection reset by peer" error.

Many PRs referencing this issue just does an upgrade of the lib/pq version. Does that really fix the problem given that this issue is still open? I tried upgrading to 1.10.0 but still saw the problem.

zekth · 2023-06-12T18:10:42Z

@agis

Isn't DB.SetConnMaxLifetime() and/or DB.SetConnMaxIdleTime() sufficient, supposedly?

I wish this would solve this issue but if you use a service mesh which monitors the connections that doesn't change it. Let's assume you have a low traffic service, sometimes your pool can be using a connection that has been shut by the mesh and the pool is not aware of it, so you'll have to know what the mesh value is to adjust it in your configuration.

Sufficient kind of, acceptable at a consumer level? Not really, especially because the errors it not explicit enough for most of the lib users.

cruzluna · 2023-10-23T01:56:43Z

The work around that worked for me:

var db *pgx.Conn
	for i := 0; i < 20; i++ {
		db, err = pgx.Connect(ctx, dsn.String())
		if err == nil {
			break
		}

		time.Sleep(500 * time.Millisecond)
	}

Here is the code that gave me the error originally:

 db, err = pgx.Connect(ctx, dsn.String())
	 if err != nil {
	 	t.Fatalf("Couldn't connect to database: %s", err)
	 }

This was referenced Feb 22, 2021

Digital Ocean Managed Postgre Primary Failover #1026

Closed

write: connection reset by peer should close the connection #1028

Closed

Lekensteyn mentioned this issue Mar 17, 2021

Connection reset by peer when dabase server cpu usage reaches 100% #613

Open

Lekensteyn mentioned this issue Mar 23, 2021

Intermitten db connection error with aws rds #702

Open

taylorsilva mentioned this issue Mar 25, 2021

release/6.7.x: Bump lib/pq to 1.10.0 concourse/concourse#6746

Merged

11 tasks

andyesys mentioned this issue Aug 30, 2021

Connection reset by peer (upgrade lib/pq) brocaar/chirpstack-network-server#547

Closed

2 tasks

hechaoli mentioned this issue Jun 25, 2022

Fix VPC service metrics test Netflix/titus-executor#912

Merged

SaadAhmedGit added a commit to SaadAhmedGit/formify-backend that referenced this issue Dec 23, 2023

Fix gh CI failing tests lib/pq#835 (comment)

56465c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

meetme2meat commented Mar 11, 2019 •

edited

Loading

GeertJohan commented Mar 19, 2019

gnuletik commented May 21, 2020

GeertJohan commented May 21, 2020

Lekensteyn commented Mar 23, 2021

agis commented May 23, 2022 •

edited

Loading

hechaoli commented Jun 24, 2022

zekth commented Jun 12, 2023

cruzluna commented Oct 23, 2023

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

Comments

meetme2meat commented Mar 11, 2019 • edited Loading

GeertJohan commented Mar 19, 2019

gnuletik commented May 21, 2020

GeertJohan commented May 21, 2020

Lekensteyn commented Mar 23, 2021

agis commented May 23, 2022 • edited Loading

hechaoli commented Jun 24, 2022

zekth commented Jun 12, 2023

cruzluna commented Oct 23, 2023

meetme2meat commented Mar 11, 2019 •

edited

Loading

agis commented May 23, 2022 •

edited

Loading