Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in (*Pool).SendBatch: close of closed channel #1920

Closed
ytzur opened this issue Feb 26, 2024 · 5 comments
Closed

Panic in (*Pool).SendBatch: close of closed channel #1920

ytzur opened this issue Feb 26, 2024 · 5 comments
Labels

Comments

@ytzur
Copy link

ytzur commented Feb 26, 2024

Describe the bug
We have a service in Go that uses *Pool from vendor/github.com/jackc/pgx/v5/pgxpool/pool.go to perform write statements (insert/update/delete) on tables in a Postgres instance (technically the connection is directly established to pgbouncer instance and the pgbouncer instance connects to the DB instance itself, but I'm not sure how relevant it is).
The write statements are performed using the pgx Batch construct and the (*Pool).SendBatch method.
We use version v5.4.3 of pgx, use Go version 1.21.5 and our Postgres DB instance runs version 15.4.
We have several hundred instances of the service running with roughly the same code in this area for at least 3 months connected to several different Postgres instances indirectly via pgbouncer and today for the first time one of the instances of our service encountered a panic with the following stack trace:

panic: close of closed channel

goroutine 23998 [running]:
github.com/jackc/pgx/v5/pgconn.(*PgConn).receiveMessage(0xc016b5fb00)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:546 +0x165
github.com/jackc/pgx/v5/pgconn.(*Pipeline).GetResults(0xc016b5fc48)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:1997 +0x52
github.com/jackc/pgx/v5/pgconn.(*Pipeline).Close(0xc016b5fc48)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgconn/pgconn.go:2093 +0xf6
github.com/jackc/pgx/v5.(*Conn).sendBatchExtendedWithDescription.func1()
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1068 +0x2b
github.com/jackc/pgx/v5.(*Conn).sendBatchExtendedWithDescription(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60}, 0xc016918078, {0xc0914abe20, 0x3, 0xc077cd6410?}, {0x82b2c20, 0xc068b078f0})
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1086 +0xa0a
github.com/jackc/pgx/v5.(*Conn).sendBatchQueryExecModeCacheStatement(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60}, 0xc016918078)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:1008 +0x426
github.com/jackc/pgx/v5.(*Conn).SendBatch(0xc0b4fbb9e0, {0x82a4ba0?, 0xba6bd60?}, 0xc016918078)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/conn.go:913 +0x5a7
github.com/jackc/pgx/v5/pgxpool.(*Conn).SendBatch(0xc00a1ae1e0?, {0x82a4ba0?, 0xba6bd60?}, 0xc01f7cda40?)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgxpool/conn.go:95 +0x33
github.com/jackc/pgx/v5/pgxpool.(*Pool).SendBatch(0xc02e4628e8?, {0x82a4ba0, 0xba6bd60}, 0xc02e462828?)
        /go/src/twistlock/vendor/github.com/jackc/pgx/v5/pgxpool/pool.go:645 +0x45
...

The part omitted after the ellipsis is our application code, the first call to the pgx package's code is the call to (*Pool).SendBatch.

To Reproduce
How we initialize the client:

// connectionString := "<connection string>"
config, err := pgxpool.ParseConfig(connectionString)
if err != nil {
    return nil, err
}
config.MaxConnIdleTime = maxWriteConnIdleTime
writeClient, err := pgxpool.NewWithConfig(context.Background(), config)

A schematic example of how we create a batch, add a query to the batch and invoke the client's SendBatch method:

func update() error {
    batch := &pgx.Batch{}
   batch.Queue(query, arg_1, ..., arg_n)
   return client.SendBatch(context.Background(), batch).Close()
}

Several invocations of similar versions of the above update() function run concurrently in separate go-routines where all go-routines share a pointer to the same instance of *Pool.

Expected behavior
No panic.

Actual behavior
A panic happens.

Version

  • Go: 1.21.5
  • PostgreSQL: PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
  • pgx: github.com/jackc/pgx/v5 v5.4.3

Additional context
N/A

@ytzur ytzur added the bug label Feb 26, 2024
@jackc
Copy link
Owner

jackc commented Feb 27, 2024

That path is only hit when the PostgreSQL server returns a FATAL error. If I had to guess I would say that checking if the connection is already closed in that path would resolve the issue. But I'm not sure how the connection could already be closed there...

@ytzur
Copy link
Author

ytzur commented Feb 27, 2024

But I'm not sure how the connection could already be closed there...

@jackc I will add to what I mentioned above that after further investigation we found that this happened upon our Postgres DB instance being restarted as part of a maintenance by GCP (as it's managed by GCP).
In pgbouncer logs we see logs like this from around the same time the panic and DB instance restart happened:

closing because: server conn crashed? (age=188s)

In addition, as I said we are using the same pool object concurrently from several go-routines sharing a pointer to it. Is it possible that this occurred as part of some unresolved race condition that might happen if a FATAL error was returned from the PostgreSQL server/pgbouncer during concurrent access to the pool by several go-routines?

jackc added a commit that referenced this issue Mar 1, 2024
Otherwise, it might be possible to panic when closing the pipeline if it
tries to read a connection that should be closed but still has a fatal
error on the wire.

#1920
@jackc
Copy link
Owner

jackc commented Mar 1, 2024

I think I have resolved the issue in 2e84dcc, but with something so rare it is hard to be sure. It appears that the problem could occur in pipeline mode (which is used by a batch) if the connection was killed while reading the responses. The closing the pipeline would cause it to read the remaining expected responses. But if the connection was dead, but somehow there was still a FATAL error on the wire it could trigger the problem.

I was only able to induce this by mocking a PostgreSQL server and purposed sending two FATAL error messages in a row. To my knowledge the real PostgreSQL server never does this. But perhaps PgBouncer does...

Anyway, regardless of exactly how that state is reached, I think this should fix it.

@jackc jackc closed this as completed Mar 1, 2024
@ytzur
Copy link
Author

ytzur commented Mar 1, 2024

@jackc Thanks, how can we use a version of the library with this fix? Is there a new release of the library planned and if so can you tell us when we can expect it to be released?

@jackc
Copy link
Owner

jackc commented Mar 1, 2024

The next patch release will probably be some time next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants