Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer #835

Open
meetme2meat opened this issue Mar 11, 2019 · 8 comments

Comments

@meetme2meat
Copy link

meetme2meat commented Mar 11, 2019

Well, I have a sample script which I have been using to understand whether or how the lib/pq handle the reconnect.

package main

import (
	"database/sql"
	"fmt"
	"log"
	"time"

	"github.com/lib/pq"
)

func main() {
	url := "postgres://werain:@localhost/postgres?sslmode=disable&application_name=reconnect"
	connStr, err := pq.ParseURL(url)
	conn, err := sql.Open("postgres", connStr)
	if err != nil {
		log.Fatal(err)
	}
	for {
		result, err := conn.Exec("select 1;")
		if err != nil {
			fmt.Println(err)
			time.Sleep(10 * time.Second)
		}
		fmt.Println(result.RowsAffected())
		time.Sleep(5 * time.Second)
	}
}

I see the reconnect working sometime but there are time when I see the following error

read tcp [::1]:49528->[::1]:5432: read: connection reset by peer

Now I'm not sure why it works sometimes and does not work the other time. Is this something need to handle by the individual client(i.e reconnect) here?

NOTE: I'm not entirely sure if this issue is database/sql or lib/pq please help me guide through this

@GeertJohan
Copy link

Having the same issue here. I believe this is a regression that has been introduced about a month ago, I never had tcp connection issues stopping pq from opening a transaction or starting a query before. In my case the occurs quite often on opening a new transaction. The underlying connection has timed out on the postgres side and then there's the "read: connection reset by peer" error.

@gnuletik
Copy link

@GeertJohan See #871 (comment) for workarounds.

@GeertJohan
Copy link

I have since moved to pgx, as the readme also advertises. V4 of pgx also.corretly handles connection issues.

Pq has been awesome, it helped me a lot and I'm super grateful to everyone who contributed to.pq. Thanks!! That said, it seems pgx has better maintenance and updates now, and more future-proof.

@Lekensteyn
Copy link

I believe that the read: connection reset by peer issue can occur in the following scenario:

  1. lib/pq opens a connection earlier due to a query.
  2. Some time passes. The router/load balancer/database server closes the TCP connection (e.g. due to idleness or a server restart), but without notifying the client.
  3. The application queries again, lib/pq writes a query to the TCP socket. Since the client thinks that the TCP socket is open, this operation succeeds.
  4. The server sends a TCP RST back since the connection was closed on the server side.
  5. lib/pq reads the response, but fails with connection reset by peer since the socket was marked as closed via the TCP RST.

As pointed out by @gnuletik above, this issue is made worse by #702 (comment). Instead of allowing an operation to be retried immediately by database/sql, the following happens:

  1. The first query fails due to a bad network connection. The connection is subsequently marked as bad in lib/pq.
  2. The underlying network error is returned to database/sql which is propagated to the client.
  3. The application tries a query again through database/sql -> lib/pq. Due to (1), lib/pq will initially return driver.ErrBadConn to database/sql which ensures a reconnection before database/sql retries the query.
  4. The second query succeeds.

In other words: if a network error occurs, one query will result an error, but the next one should succeed.

If I am not mistaken, pgx suffers from the same issue (jackc/pgx#74). However by using the pgxpool connection pool, connection have a maximum lifetime (one hour per https://github.com/jackc/pgx/blob/v4.10.0/pgxpool/pool.go#L17), so the problem is somewhat alleviated. (Don't mistake the pgx "health checks" for actively probing the connection of a TCP socket, it only checks the lifetime, see https://github.com/jackc/pgx/blob/v4.10.0/pgxpool/pool.go#L318-L356)


An analysis of lib/pq conn.go:

func (cn *conn) send(m *writeBuf) {
	n, err := cn.c.Write(m.wrap())
	if err != nil {
		if n == 0 { // this case was added in lib/pq 1.9.0
			err = &safeRetryError{Err: err}
		}
		panic(err)
	}
}

func (cn *conn) prepareTo(q, stmtName string) *stmt {
	st := &stmt{cn: cn, name: stmtName}

	b := cn.writeBuf('P')
	b.string(st.name)
	b.string(q)
	b.int16(0)

	b.next('D')
	b.byte('S')
	b.string(st.name)

	b.next('S')
	// If the local socket is closed, this panics with a safeRetryError{Err: "write: connection reset by peer"}
	// if the local socket is open, this one will always succeed even if the socket is remotely closed.
	cn.send(b)

	// If the remote socket is closed, this will panic with "read: connection reset by peer"
	// see conn.readParseResponse -> conn.recv1 -> conn.recv1Buf ->
	// conn->recvMessage (which panics if conn.recvMessage -> io.ReadFull(cn.buf, x) fails)
	cn.readParseResponse()
	st.paramTyps, st.colNames, st.colTyps = cn.readStatementDescribeResponse()
	st.colFmts, st.colFmtData = decideColumnFormats(st.colTyps, cn.disablePreparedBinaryResult)
	cn.readReadyForQuery()
	return st
}

conn.prepareTo is called for conn.exec, conn.Prepare, etc. These callers rely on conn.errRecover (see below) to convert a panic to an error type. For safeRetryError, it will return driver.ErrBadConn which allows database/sql to close the connection and retry. In many other error cases, the network error will be returned as-is but the connection will be marked as bad (cn.setBad):

pq/error.go

Lines 481 to 518 in 072e83d

func (cn *conn) errRecover(err *error) {
e := recover()
switch v := e.(type) {
case nil:
// Do nothing
case runtime.Error:
cn.setBad()
panic(v)
case *Error:
if v.Fatal() {
*err = driver.ErrBadConn
} else {
*err = v
}
case *net.OpError:
cn.setBad()
*err = v
case *safeRetryError:
cn.setBad()
*err = driver.ErrBadConn
case error:
if v == io.EOF || v.(error).Error() == "remote error: handshake failure" {
*err = driver.ErrBadConn
} else {
*err = v
}
default:
cn.setBad()
panic(fmt.Sprintf("unknown error: %#v", e))
}
// Any time we return ErrBadConn, we need to remember it since *Tx doesn't
// mark the connection bad in database/sql.
if *err == driver.ErrBadConn {
cn.setBad()
}
}

There is however one problem, some errors (such as those implementing the error interface) do not use setBad and can therefore cause a connection to be stuck forever. Not sure what the history is behind that. I will file a PR to call setBad for error too.

taylorsilva added a commit to concourse/concourse that referenced this issue Mar 25, 2021
the postgres driver had a regression where it wouldn't drop dead
connections resulting in all queries failing with tcp errors. The driver
never recovers and restarting your process is the only workaround. The
issue was resolved in 1.9.0 of the driver

related discussion: lib/pq#835

Signed-off-by: Taylor Silva <tsilva@pivotal.io>
@agis
Copy link

agis commented May 23, 2022

@Lekensteyn thanks for this analysis. So, if I understand properly, lib/pq still suffers from this issue, in which the first query will fail? If so, do you have a proposed fix by any chance?

Isn't DB.SetConnMaxLifetime() and/or DB.SetConnMaxIdleTime() sufficient, supposedly?

EDIT: Can this be related to go-sql-driver/mysql/issues/257 as well?

@hechaoli
Copy link

I've seen the same problem when connecting to a psql DB in docker. Somehow I can't reproduce the problem locally but it happens on CI occasionally. The code is like

func connect(url string) (*sql.DB, error) {
  db, err := sql.Open("postgres", url)
  if err != nil {
    return err, nil
  }
  query := db.QueryRowContext(ctx, "SELECT current_database()")
  var name string
  err = query.Scan(&name)
  if err != nil {
    return nil, errors.Wrap(err, "Could not select current database") // <- This error happens occasionally.
  }
  return db, nil
}

Occasionally on CI, I've seen this "Could not select current database: read tcp 127.0.0.1:46768->127.0.0.1:41973: read: connection reset by peer" error.

Many PRs referencing this issue just does an upgrade of the lib/pq version. Does that really fix the problem given that this issue is still open? I tried upgrading to 1.10.0 but still saw the problem.

@zekth
Copy link

zekth commented Jun 12, 2023

@agis

Isn't DB.SetConnMaxLifetime() and/or DB.SetConnMaxIdleTime() sufficient, supposedly?

I wish this would solve this issue but if you use a service mesh which monitors the connections that doesn't change it. Let's assume you have a low traffic service, sometimes your pool can be using a connection that has been shut by the mesh and the pool is not aware of it, so you'll have to know what the mesh value is to adjust it in your configuration.

Sufficient kind of, acceptable at a consumer level? Not really, especially because the errors it not explicit enough for most of the lib users.

@cruzluna
Copy link

The work around that worked for me:

var db *pgx.Conn
	for i := 0; i < 20; i++ {
		db, err = pgx.Connect(ctx, dsn.String())
		if err == nil {
			break
		}

		time.Sleep(500 * time.Millisecond)
	}

Here is the code that gave me the error originally:

 db, err = pgx.Connect(ctx, dsn.String())
	 if err != nil {
	 	t.Fatalf("Couldn't connect to database: %s", err)
	 }

SaadAhmedGit added a commit to SaadAhmedGit/formify-backend that referenced this issue Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants