Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handling dead connections and broken pipes #74

Closed
emicklei opened this issue Apr 2, 2015 · 19 comments
Closed

handling dead connections and broken pipes #74

emicklei opened this issue Apr 2, 2015 · 19 comments

Comments

@emicklei
Copy link

emicklei commented Apr 2, 2015

Based on our experiences with this package, we have come up with what I consider a workaround for handling certain errors returned. In our apps, we use the ConnPool and therefore? have to handle dead connections. More recently, we found that the underlying connection can report an broken pipe and EOF. We think these errors are recoverable by the app by re-aquiring a new connection.

The function below is what I currently use in all Dao functions. Find below an example of this.

func doWithConnection(pool PgxConnPool, maxRetries int, work func(conn *pgx.Conn) error) error {
    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }
    // abort if still err
    if err != nil {
        return err
    }
    defer pool.Release(conn)

    // use the connection
    err = work(conn)

    // all ok?
    if err == nil {
        return nil
    }

    // if not, can we retry?
    canRetry := (strings.HasSuffix(err.Error(), "broken pipe") ||
        strings.HasSuffix(err.Error(), "EOF"))

    if canRetry && maxRetries > 0 {
        glog.Debugf("failed to do work with connection because:[%v], retry:%d", err, maxRetries)
        return doWithConnection(pool, maxRetries-1, work)
    }

    // give up
    return err
}

Example:

func (d PgPropertyDao) DeleteKey(del PropertyAccess) error {

    return doWithConnection(d.pool, 10, func(conn *pgx.Conn) error {
        _, err := conn.Exec("select * from delete_key_api($1,$2,$3,$4,$5);",
            del.Label,
            del.Version,
            del.Key,
            del.Who,
            del.Why)
        if err != nil {
            glog.Errorf("DeleteKey -> unable to delete_key_api(%#v):%v", del, err)
        }
        return err
    })
}

The question is whether the pgx package should handle this internally and if not, it this the correct way to do so.

Thanks.

@jackc
Copy link
Owner

jackc commented Apr 3, 2015

Are you connecting over a very unstable network? Lost connections should be a very rare occurrence under normal circumstances.

Especially this code below:

    conn, err := pool.Acquire()

    if err == pgx.ErrDeadConn {
        if maxRetries > 0 {
            glog.Debugf("%v, try acquiring a non-dead connection", err)
            return doWithConnection(pool, maxRetries-1, work)
        }
    }

pool.Acquire should only return that error if it needed to establish a new connection and it failed -- which assuming it previously was able to connect would likely indicate a network outage. If it's not giving you a good connection for any other reason then that is potentially an area for improvement in pgx.

As far as the actual retry logic, it absolutely needs to wrap the call to work in a transaction. Otherwise, if work failed in the middle of executing multiple commands it would repeat already executed commands.

I don't think pgx should handle retry logic internally. There are many potential error cases that need careful consideration. A generalized retry framework that worked as expected in the majority of situations would be difficult to get right. For most users lost connections that can immediately be recovered from by reconnecting are rare enough that it is better have the operation fail than attempt to auto-redo and potentially cause unexpected side-effects.

@emicklei
Copy link
Author

emicklei commented Apr 7, 2015

I can confirm that at least part of these problems are caused by unrealiable networking (wireless connection for development).

Especially during development (and interupting meeting etc), it can happen that my serivce running locally will not get any requests for the coming hour. I recall that these situations were not handled by the package resulting in stale connections.

I will reconsider the handling of failures on a type-by-type basis and will test them individually. For this purpose, I created a small tcp-proxy service with a http interface to inject such failure situations (closing the client, closing remote, delaying response data...)

@badoet
Copy link

badoet commented Jun 13, 2015

this occurrence is very common if you connect to the GCE Cloud SQL.
the Cloud SQL will terminate idle connection after a few minutes.
in other sql driver, you can set idle connection to 0 - nt a great solution.
wondering if pgx have this idle connection parameter for the connection pooling as well.

@jackc
Copy link
Owner

jackc commented Jun 13, 2015

I have not used GCE Cloud SQL, so I don't know how its idle detection works, but pgx does utilize TCP keep-alive by default.

https://github.com/jackc/pgx/blob/master/conn.go#L143

As far as the general case of lost connection and automatic retry, my opinion is still that retry logic would tend to have application-specific requirements and a generalized framework in the database driver layer would be very difficult to get right.

I haven't given this much thought, but possibly the database driver or connection pool could reduce the number of lost connection failures the application has to deal with by occasionally "pinging" the server with "select 1". That could tend to detect some/most dead connections before the application encounters an error. Not sure all the implications of that though.

@gsalgado
Copy link

The database/sql compatible pgx driver does return ErrBadConn as required, and when the stdlib's database/sql APIs get that error from the underlying driver, they do retry the operations (currently, up to 10 times). It'd be nice if the native pgx APIs behaved the same way

@jackc
Copy link
Owner

jackc commented Aug 1, 2015

Really surprising to me that database/sql has retry logic built-in. I investigated what database/sql is doing, and was able to produce the type errors I was concerned about (see go_database_sql_retry_bug for my test case). I reported this bug on the Go project. You can check there for more details and to see where the Go project ends up on this issue.

I still hold to the opinion that an automatic retry system cannot be correct. If there is a conflict between programmer ease-of-use and correctness, I will choose correctness. Automatic retry can sabotage correctness.

That said, I'm not opposed to an opt-in system for retries. It might be pretty simple to have a retry wrapper object around ConnPool. Then the underlying system is correct, but this retry object can be used as desired.

There's also still the possibility of automatically "pinging" the server with "select 1" on occasion. This could both keep idle connections from being killed, and detect dead connections and reconnect in the library layer without the application's notice.

@josephglanville
Copy link
Contributor

Would it be reasonable to change the behaviour of Acquire so that it will try find a live connection?

Currently if you have a connection pool which has a number of connections in it, which are then disconnected due to say Postgres server going away they will remain in the pool and will be returned to users via Acquire, who then have to check conn.IsAlive and call pool.Release (which will de-allocate the dead connection) then continue to acquire connections this way until a new connection is established with the backend.

It seems to me it would be both safe and reasonable for the connection pool to encapsulate this behaviour, attempting to exhaust the pool for a live connection before trying to establish a new one and returning an error if that fails. Thus only ever returning connections that were live at the time of acquisition.

I might put together a pull request to better illustrate the behaviour.

@jackc
Copy link
Owner

jackc commented Sep 10, 2015

The ConnPool checks IsAlive when a connection is released (https://github.com/jackc/pgx/blob/master/conn_pool.go#L114). So bad connections are never returned to the pool. Conn.IsAlive only checks to see if the connection has marked itself as dead (https://github.com/jackc/pgx/blob/master/conn.go#L677). So if the connection was returned to the pool successfully, then the connection will always think it is alive on checkout.

The fundamental problem is there does not appear to be a way to determine if a net.Conn has been dropped without actually calling Read or Write.

So I don't know a way to ensure that the underlying connection is valid on ConnPool.Acquire. Where retry logic could be safely introduced is in ConnPool.Begin. That function could retry with a new connection when its internal Conn.Begin fails.

As an aside, my reply about a possible error in the database/sql's retry logic was not entirely correct. database/sql has a narrow, correct case where it will do a retry. The bug appears to be in the pq library. It returns the error code that signals to database/sql that it can do a retry when it is not safe to do so.

@josephglanville
Copy link
Contributor

Ahh I see, that makes sense. I was under the impression Conn.IsAlive checked if the socket was still good but I guess I should have validated that assumption.

I guess I will write a wrapper that checks out a connection, attempts Begin and continues to do that until either Begin succeeds or Acquire fails (which would indicate we couldn't open a new connection).

@jackc
Copy link
Owner

jackc commented Sep 10, 2015

If you are starting a transaction immediately after checking out a connection, consider using ConnPool.Transaction instead. It would save you a few lines of code.

Also, ConnPool.Begin could be altered to safely retry automatically -- I just haven't gotten around to it.

If that change was made then you may not need to use a wrapper (a wrapper still could be valuable for retrying operations known to be idempotent).

@jackc
Copy link
Owner

jackc commented Sep 13, 2015

Commits 4868929 through 93aa2b2 (thanks @josephglanville) add retry logic to ConnPool Begin. I think this is as good as we can get for safely and automatically retrying on dead connections. I'm going to close this issue. If someone thinks of another case we can safely handle, feel free to reopen this issue or create a new one.

@tejasmanohar
Copy link
Contributor

@jackc Do normal (non-pooled) connections ever retry? I'd like to use some SET SESSION CHARACTERISTICS, which is risky if there are retries

@jackc
Copy link
Owner

jackc commented Oct 20, 2018

No. The only retry logic in pgx is Begin() through a connection pool as that is always safe.

@tejasmanohar
Copy link
Contributor

Great, thanks

@matthewmueller
Copy link

matthewmueller commented Feb 18, 2019

I just wanted to add that I ran into dead connections today within a Lambda function connecting to RDS. It's the first time it's happened in months of operation. This is definitely not a problem with this library, just something we need to make sure we handle on our end.

This is only a problem if you connect outside of the lambda handler, if you connect and disconnect within each run you wouldn't run into this. I'm also just connecting to the database directly, rather than acquiring a connection from the pool. This might be a mistake but it seemed pointless to do pooling inside lambdas.

The way this surfaced itself was in the form of 2 errors:

set tcp $IP:$PORT: use of closed network connection
write tcp $IP1:$PORT1->$IP2:5432: write: connection timed out

If anyone knows the best way to handle this situation please let me know!

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 18, 2019

I have similar issues, also with RDS. No lambda in my case, just traditional long-living deployment. We're using connection pool.
I don't understand what's going on with the connection.

@matthewmueller
Copy link

You know, I haven't had a chance to test this yet, but I think using Pooling with a maxConnection of 1 would resolve this

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 22, 2019

Seems like it should, however it won't fit our needs in terms of performance.

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 22, 2019

Also, last commit is about just this error, and we've already updated our pgx everywhere. You should try that too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants