Panic in (*Pool).SendBatch: close of closed channel #1920

ytzur · 2024-02-26T15:47:05Z

Describe the bug
We have a service in Go that uses *Pool from vendor/github.com/jackc/pgx/v5/pgxpool/pool.go to perform write statements (insert/update/delete) on tables in a Postgres instance (technically the connection is directly established to pgbouncer instance and the pgbouncer instance connects to the DB instance itself, but I'm not sure how relevant it is).
The write statements are performed using the pgx Batch construct and the (*Pool).SendBatch method.
We use version v5.4.3 of pgx, use Go version 1.21.5 and our Postgres DB instance runs version 15.4.
We have several hundred instances of the service running with roughly the same code in this area for at least 3 months connected to several different Postgres instances indirectly via pgbouncer and today for the first time one of the instances of our service encountered a panic with the following stack trace:

panic: close of closed channel

goroutine 23998 [running]:
github.com/jackc/pgx/v5/pgconn.(*PgConn).receiveMessage(0xc016b5fb00)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:546 +0x165
github.com/jackc/pgx/v5/pgconn.(*Pipeline).GetResults(0xc016b5fc48)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:1997 +0x52
github.com/jackc/pgx/v5/pgconn.(*Pipeline).Close(0xc016b5fc48)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:2093 +0xf6
github.com/jackc/pgx/v5.(*Conn).sendBatchExtendedWithDescription.func1()
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1068 +0x2b
github.com/jackc/pgx/v5.(*Conn).sendBatchExtendedWithDescription(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60}, 0xc016918078, {0xc0914abe20, 0x3, 0xc077cd6410?}, {0x82b2c20, 0xc068b078f0})
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1086 +0xa0a
github.com/jackc/pgx/v5.(*Conn).sendBatchQueryExecModeCacheStatement(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60}, 0xc016918078)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1008 +0x426
github.com/jackc/pgx/v5.(*Conn).SendBatch(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60?}, 0xc016918078)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:913 +0x5a7
github.com/jackc/pgx/v5/pgxpool.(*Conn).SendBatch(0xc00a1ae1e0?, {0x82a4ba0?, 0xba6bd60?}, 0xc01f7cda40?)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgxpool/conn.go:95 +0x33
github.com/jackc/pgx/v5/pgxpool.(*Pool).SendBatch(0xc02e4628e8?, {0x82a4ba0, 0xba6bd60}, 0xc02e462828?)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgxpool/pool.go:645 +0x45
...

The part omitted after the ellipsis is our application code, the first call to the pgx package's code is the call to (*Pool).SendBatch.

To Reproduce
How we initialize the client:

// connectionString := "<connection string>"
config, err := pgxpool.ParseConfig(connectionString)
if err != nil {
    return nil, err
}
config.MaxConnIdleTime = maxWriteConnIdleTime
writeClient, err := pgxpool.NewWithConfig(context.Background(), config)

A schematic example of how we create a batch, add a query to the batch and invoke the client's SendBatch method:

func update() error {
    batch := &pgx.Batch{}
   batch.Queue(query, arg_1, ..., arg_n)
   return client.SendBatch(context.Background(), batch).Close()
}

Several invocations of similar versions of the above update() function run concurrently in separate go-routines where all go-routines share a pointer to the same instance of *Pool.

Expected behavior
No panic.

Actual behavior
A panic happens.

Version

Go: 1.21.5
PostgreSQL: PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
pgx: github.com/jackc/pgx/v5 v5.4.3

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jackc · 2024-02-27T00:04:57Z

That path is only hit when the PostgreSQL server returns a FATAL error. If I had to guess I would say that checking if the connection is already closed in that path would resolve the issue. But I'm not sure how the connection could already be closed there...

ytzur · 2024-02-27T16:06:30Z

But I'm not sure how the connection could already be closed there...

@jackc I will add to what I mentioned above that after further investigation we found that this happened upon our Postgres DB instance being restarted as part of a maintenance by GCP (as it's managed by GCP).
In pgbouncer logs we see logs like this from around the same time the panic and DB instance restart happened:

closing because: server conn crashed? (age=188s)

In addition, as I said we are using the same pool object concurrently from several go-routines sharing a pointer to it. Is it possible that this occurred as part of some unresolved race condition that might happen if a FATAL error was returned from the PostgreSQL server/pgbouncer during concurrent access to the pool by several go-routines?

Otherwise, it might be possible to panic when closing the pipeline if it tries to read a connection that should be closed but still has a fatal error on the wire. #1920

jackc · 2024-03-01T00:50:34Z

I think I have resolved the issue in 2e84dcc, but with something so rare it is hard to be sure. It appears that the problem could occur in pipeline mode (which is used by a batch) if the connection was killed while reading the responses. The closing the pipeline would cause it to read the remaining expected responses. But if the connection was dead, but somehow there was still a FATAL error on the wire it could trigger the problem.

I was only able to induce this by mocking a PostgreSQL server and purposed sending two FATAL error messages in a row. To my knowledge the real PostgreSQL server never does this. But perhaps PgBouncer does...

Anyway, regardless of exactly how that state is reached, I think this should fix it.

ytzur · 2024-03-01T14:00:27Z

@jackc Thanks, how can we use a version of the library with this fix? Is there a new release of the library planned and if so can you tell us when we can expect it to be released?

jackc · 2024-03-01T14:45:00Z

The next patch release will probably be some time next week.

ytzur added the bug label Feb 26, 2024

jackc added a commit that referenced this issue Mar 1, 2024

*Pipeline.getResults should close pipeline on error

2e84dcc

Otherwise, it might be possible to panic when closing the pipeline if it tries to read a connection that should be closed but still has a fatal error on the wire. #1920

jackc closed this as completed Mar 1, 2024

jackc mentioned this issue Mar 2, 2024

Crash with non-set pipeline.conn inside sendBatchExtendedWithDescription() when doing SendBatch() #1689

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic in (*Pool).SendBatch: close of closed channel #1920

Panic in (*Pool).SendBatch: close of closed channel #1920

ytzur commented Feb 26, 2024 •

edited

Loading

jackc commented Feb 27, 2024

ytzur commented Feb 27, 2024 •

edited

Loading

jackc commented Mar 1, 2024

ytzur commented Mar 1, 2024

jackc commented Mar 1, 2024

Panic in (*Pool).SendBatch: close of closed channel #1920

Panic in (*Pool).SendBatch: close of closed channel #1920

Comments

ytzur commented Feb 26, 2024 • edited Loading

jackc commented Feb 27, 2024

ytzur commented Feb 27, 2024 • edited Loading

jackc commented Mar 1, 2024

ytzur commented Mar 1, 2024

jackc commented Mar 1, 2024

ytzur commented Feb 26, 2024 •

edited

Loading

ytzur commented Feb 27, 2024 •

edited

Loading