handling dead connections and broken pipes #74

emicklei · 2015-04-02T15:19:12Z

Based on our experiences with this package, we have come up with what I consider a workaround for handling certain errors returned. In our apps, we use the ConnPool and therefore? have to handle dead connections. More recently, we found that the underlying connection can report an broken pipe and EOF. We think these errors are recoverable by the app by re-aquiring a new connection.

The function below is what I currently use in all Dao functions. Find below an example of this.

func doWithConnection(pool PgxConnPool, maxRetries int, work func(conn *pgx.Conn) error) error {
    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }
    // abort if still err
    if err != nil {
        return err
    }
    defer pool.Release(conn)

    // use the connection
    err = work(conn)

    // all ok?
    if err == nil {
        return nil
    }

    // if not, can we retry?
    canRetry := (strings.HasSuffix(err.Error(), "broken pipe") ||
        strings.HasSuffix(err.Error(), "EOF"))

    if canRetry && maxRetries > 0 {
        glog.Debugf("failed to do work with connection because:[%v], retry:%d", err, maxRetries)
        return doWithConnection(pool, maxRetries-1, work)
    }

    // give up
    return err
}

Example:

func (d PgPropertyDao) DeleteKey(del PropertyAccess) error {

    return doWithConnection(d.pool, 10, func(conn *pgx.Conn) error {
        _, err := conn.Exec("select * from delete_key_api($1,$2,$3,$4,$5);",
            del.Label,
            del.Version,
            del.Key,
            del.Who,
            del.Why)
        if err != nil {
            glog.Errorf("DeleteKey -> unable to delete_key_api(%#v):%v", del, err)
        }
        return err
    })
}

The question is whether the pgx package should handle this internally and if not, it this the correct way to do so.

Thanks.

The text was updated successfully, but these errors were encountered:

jackc · 2015-04-03T01:14:06Z

Are you connecting over a very unstable network? Lost connections should be a very rare occurrence under normal circumstances.

Especially this code below:

    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }

pool.Acquire should only return that error if it needed to establish a new connection and it failed -- which assuming it previously was able to connect would likely indicate a network outage. If it's not giving you a good connection for any other reason then that is potentially an area for improvement in pgx.

As far as the actual retry logic, it absolutely needs to wrap the call to work in a transaction. Otherwise, if work failed in the middle of executing multiple commands it would repeat already executed commands.

I don't think pgx should handle retry logic internally. There are many potential error cases that need careful consideration. A generalized retry framework that worked as expected in the majority of situations would be difficult to get right. For most users lost connections that can immediately be recovered from by reconnecting are rare enough that it is better have the operation fail than attempt to auto-redo and potentially cause unexpected side-effects.

emicklei · 2015-04-07T10:37:21Z

I can confirm that at least part of these problems are caused by unrealiable networking (wireless connection for development).

Especially during development (and interupting meeting etc), it can happen that my serivce running locally will not get any requests for the coming hour. I recall that these situations were not handled by the package resulting in stale connections.

I will reconsider the handling of failures on a type-by-type basis and will test them individually. For this purpose, I created a small tcp-proxy service with a http interface to inject such failure situations (closing the client, closing remote, delaying response data...)

badoet · 2015-06-13T16:40:26Z

this occurrence is very common if you connect to the GCE Cloud SQL.
the Cloud SQL will terminate idle connection after a few minutes.
in other sql driver, you can set idle connection to 0 - nt a great solution.
wondering if pgx have this idle connection parameter for the connection pooling as well.

jackc · 2015-06-13T16:55:02Z

I have not used GCE Cloud SQL, so I don't know how its idle detection works, but pgx does utilize TCP keep-alive by default.

https://github.com/jackc/pgx/blob/master/conn.go#L143

As far as the general case of lost connection and automatic retry, my opinion is still that retry logic would tend to have application-specific requirements and a generalized framework in the database driver layer would be very difficult to get right.

I haven't given this much thought, but possibly the database driver or connection pool could reduce the number of lost connection failures the application has to deal with by occasionally "pinging" the server with "select 1". That could tend to detect some/most dead connections before the application encounters an error. Not sure all the implications of that though.

gsalgado · 2015-07-31T11:07:47Z

The database/sql compatible pgx driver does return ErrBadConn as required, and when the stdlib's database/sql APIs get that error from the underlying driver, they do retry the operations (currently, up to 10 times). It'd be nice if the native pgx APIs behaved the same way

jackc · 2015-08-01T16:29:19Z

Really surprising to me that database/sql has retry logic built-in. I investigated what database/sql is doing, and was able to produce the type errors I was concerned about (see go_database_sql_retry_bug for my test case). I reported this bug on the Go project. You can check there for more details and to see where the Go project ends up on this issue.

I still hold to the opinion that an automatic retry system cannot be correct. If there is a conflict between programmer ease-of-use and correctness, I will choose correctness. Automatic retry can sabotage correctness.

That said, I'm not opposed to an opt-in system for retries. It might be pretty simple to have a retry wrapper object around ConnPool. Then the underlying system is correct, but this retry object can be used as desired.

There's also still the possibility of automatically "pinging" the server with "select 1" on occasion. This could both keep idle connections from being killed, and detect dead connections and reconnect in the library layer without the application's notice.

josephglanville · 2015-09-10T08:53:14Z

Would it be reasonable to change the behaviour of Acquire so that it will try find a live connection?

Currently if you have a connection pool which has a number of connections in it, which are then disconnected due to say Postgres server going away they will remain in the pool and will be returned to users via Acquire, who then have to check conn.IsAlive and call pool.Release (which will de-allocate the dead connection) then continue to acquire connections this way until a new connection is established with the backend.

It seems to me it would be both safe and reasonable for the connection pool to encapsulate this behaviour, attempting to exhaust the pool for a live connection before trying to establish a new one and returning an error if that fails. Thus only ever returning connections that were live at the time of acquisition.

I might put together a pull request to better illustrate the behaviour.

jackc · 2015-09-10T14:27:03Z

The ConnPool checks IsAlive when a connection is released (https://github.com/jackc/pgx/blob/master/conn_pool.go#L114). So bad connections are never returned to the pool. Conn.IsAlive only checks to see if the connection has marked itself as dead (https://github.com/jackc/pgx/blob/master/conn.go#L677). So if the connection was returned to the pool successfully, then the connection will always think it is alive on checkout.

The fundamental problem is there does not appear to be a way to determine if a net.Conn has been dropped without actually calling Read or Write.

So I don't know a way to ensure that the underlying connection is valid on ConnPool.Acquire. Where retry logic could be safely introduced is in ConnPool.Begin. That function could retry with a new connection when its internal Conn.Begin fails.

As an aside, my reply about a possible error in the database/sql's retry logic was not entirely correct. database/sql has a narrow, correct case where it will do a retry. The bug appears to be in the pq library. It returns the error code that signals to database/sql that it can do a retry when it is not safe to do so.

josephglanville · 2015-09-10T14:59:08Z

Ahh I see, that makes sense. I was under the impression Conn.IsAlive checked if the socket was still good but I guess I should have validated that assumption.

I guess I will write a wrapper that checks out a connection, attempts Begin and continues to do that until either Begin succeeds or Acquire fails (which would indicate we couldn't open a new connection).

jackc · 2015-09-10T15:52:21Z

If you are starting a transaction immediately after checking out a connection, consider using ConnPool.Transaction instead. It would save you a few lines of code.

Also, ConnPool.Begin could be altered to safely retry automatically -- I just haven't gotten around to it.

If that change was made then you may not need to use a wrapper (a wrapper still could be valuable for retrying operations known to be idempotent).

jackc · 2015-09-13T00:48:21Z

Commits 4868929 through 93aa2b2 (thanks @josephglanville) add retry logic to ConnPool Begin. I think this is as good as we can get for safely and automatically retrying on dead connections. I'm going to close this issue. If someone thinks of another case we can safely handle, feel free to reopen this issue or create a new one.

tejasmanohar · 2018-10-20T01:19:43Z

@jackc Do normal (non-pooled) connections ever retry? I'd like to use some SET SESSION CHARACTERISTICS, which is risky if there are retries

jackc · 2018-10-20T14:54:55Z

No. The only retry logic in pgx is Begin() through a connection pool as that is always safe.

tejasmanohar · 2018-10-22T03:38:30Z

Great, thanks

matthewmueller · 2019-02-18T03:57:49Z

I just wanted to add that I ran into dead connections today within a Lambda function connecting to RDS. It's the first time it's happened in months of operation. This is definitely not a problem with this library, just something we need to make sure we handle on our end.

This is only a problem if you connect outside of the lambda handler, if you connect and disconnect within each run you wouldn't run into this. I'm also just connecting to the database directly, rather than acquiring a connection from the pool. This might be a mistake but it seemed pointless to do pooling inside lambdas.

The way this surfaced itself was in the form of 2 errors:

set tcp $IP:$PORT: use of closed network connection
write tcp $IP1:$PORT1->$IP2:5432: write: connection timed out

If anyone knows the best way to handle this situation please let me know!

MOZGIII · 2019-02-18T12:54:24Z

I have similar issues, also with RDS. No lambda in my case, just traditional long-living deployment. We're using connection pool.
I don't understand what's going on with the connection.

matthewmueller · 2019-02-22T06:35:36Z

You know, I haven't had a chance to test this yet, but I think using Pooling with a maxConnection of 1 would resolve this

MOZGIII · 2019-02-22T07:07:42Z

Seems like it should, however it won't fit our needs in terms of performance.

MOZGIII · 2019-02-22T07:09:45Z

Also, last commit is about just this error, and we've already updated our pgx everywhere. You should try that too!

fixes #74

Per jackc/pgx#74 (comment) moving reads into a transaction will initiate retries on network-level errors.

jackc closed this as completed Sep 13, 2015

vladimir-mencl-eresearch mentioned this issue Jun 19, 2018

Feature request: connection testing on checkout #427

Closed

This was referenced Jul 19, 2020

retry db reads and idempotent writes google/exposure-notifications-server#652

Closed

Retry database reads google/exposure-notifications-server#747

Merged

Lekensteyn mentioned this issue Mar 23, 2021

At time DB reconnect fail with read tcp [::1]:49528->[::1]:5432: read: connection reset by peer lib/pq#835

Open

jackc added a commit that referenced this issue Dec 4, 2021

Fix Timestamptz.DecodeText with too short text

36a8da5

fixes #74

craigpastro mentioned this issue Sep 11, 2022

bug: open fga crashes for no apparent reason openfga/openfga#176

Closed

iskakaushik added a commit to PeerDB-io/peerdb that referenced this issue Nov 24, 2024

Wrap get current LSN in txn as pgx will retry

2fa901d

Per jackc/pgx#74 (comment) moving reads into a transaction will initiate retries on network-level errors.

iskakaushik mentioned this issue Nov 24, 2024

Wrap get current LSN in txn as pgx will retry PeerDB-io/peerdb#2284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling dead connections and broken pipes #74

handling dead connections and broken pipes #74

emicklei commented Apr 2, 2015

jackc commented Apr 3, 2015

emicklei commented Apr 7, 2015

badoet commented Jun 13, 2015

jackc commented Jun 13, 2015

gsalgado commented Jul 31, 2015

jackc commented Aug 1, 2015

josephglanville commented Sep 10, 2015

jackc commented Sep 10, 2015

josephglanville commented Sep 10, 2015

jackc commented Sep 10, 2015

jackc commented Sep 13, 2015

tejasmanohar commented Oct 20, 2018

jackc commented Oct 20, 2018

tejasmanohar commented Oct 22, 2018

matthewmueller commented Feb 18, 2019 •

edited

Loading

MOZGIII commented Feb 18, 2019

matthewmueller commented Feb 22, 2019

MOZGIII commented Feb 22, 2019

MOZGIII commented Feb 22, 2019

handling dead connections and broken pipes #74

handling dead connections and broken pipes #74

Comments

emicklei commented Apr 2, 2015

jackc commented Apr 3, 2015

emicklei commented Apr 7, 2015

badoet commented Jun 13, 2015

jackc commented Jun 13, 2015

gsalgado commented Jul 31, 2015

jackc commented Aug 1, 2015

josephglanville commented Sep 10, 2015

jackc commented Sep 10, 2015

josephglanville commented Sep 10, 2015

jackc commented Sep 10, 2015

jackc commented Sep 13, 2015

tejasmanohar commented Oct 20, 2018

jackc commented Oct 20, 2018

tejasmanohar commented Oct 22, 2018

matthewmueller commented Feb 18, 2019 • edited Loading

MOZGIII commented Feb 18, 2019

matthewmueller commented Feb 22, 2019

MOZGIII commented Feb 22, 2019

MOZGIII commented Feb 22, 2019

matthewmueller commented Feb 18, 2019 •

edited

Loading