Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.2: sql/pgwire: fix statement buffer memory leak when using suspended portals #67370

Merged
merged 1 commit into from
Jul 9, 2021

Conversation

joesankey
Copy link
Contributor

Backport 1/1 commits from #67135.

/cc @cockroachdb/release


The connection statement buffer grows indefinitely when the client uses the
execute portal with limit feature of the Postgres protocol, eventually causing
the node to crash out of memory. Any long running query that uses the limit
feature will cause this memory leak such as the EXPERIMENTAL CHANGEFEED FOR
statement. The execute portal with limit feature of the Postgres protocol is
used by the JDBC Postgres driver to fetch a limited number of rows at a time.

The leak is caused by commands accumulating in the buffer and never getting
cleared out. The client sends 2 commands every time it wants to receive more
rows:

  • Execute {"Portal": "C_1", "MaxRows": 1}
  • Sync

The server processes the commands and leaves them in the buffer, every
iteration causes 2 more commands to leak.

A similar memory leak was fixed by #48859, however the execute with limit
feature is implemented by a side state machine in limitedCommandResult. The
cleanup routine added by #48859 is never executed for suspended portals as
they never return to the main conn_executor loop.

After this change the statement buffer gets trimmed to reclaim memory after
each client command is processed in the limitedCommandResult side state
machine. The StmtBuf.Ltrim function was changed to be public visibility to
enable this. While this is not ideal, it does scope the fix to the
limitedCommandResult side state machine and could be addressed when the
limitedCommandResult functionality is refactored into the conn_executor.

Added a unit test which causes the leak, used the PGWire client in the test as
neither the pg or pgx clients use execute with limit, so cant be used to
demonstrate the leak. Also tested the fix in a cluster by following the steps
outlined in #66849.

Resolves: #66849

See also: #48859

Release note (bug fix): fix statement buffer memory leak when using suspended
portals.

@rafiss

@blathers-crl
Copy link

blathers-crl bot commented Jul 8, 2021

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I have added a few people who may be able to assist in reviewing:

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jul 8, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rafiss
Copy link
Collaborator

rafiss commented Jul 8, 2021

It looks like you may need to add this testutil function:

func (p *PGTest) SendOneLine(line string) error {

Also, the comment I left on the 21.1 backport PR holds here: I believe the test will fail because it depends on the change in #63677, but this change was not backported. Is it possible for the test to be updated to explicitly close the portal?

@blathers-crl
Copy link

blathers-crl bot commented Jul 8, 2021

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@joesankey
Copy link
Contributor Author

It looks like you may need to add this testutil function:

func (p *PGTest) SendOneLine(line string) error {

Also, the comment I left on the 21.1 backport PR holds here: I believe the test will fail because it depends on the change in #63677, but this change was not backported. Is it possible for the test to be updated to explicitly close the portal?

Explicitly closed the portal and back ported test utils.

@@ -141,7 +141,7 @@ func RunTest(t *testing.T, path, addr, user string) {
}
}

func parseMessages(s string) []pgproto3.BackendMessage {
func ParseMessages(s string) []pgproto3.BackendMessage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will also need to add comments to these functions now that they are exported (in order to pass linting checks)

// ParseMessages parses a string containing multiple pgproto3 messages separated
// by the newline symbol. See testdata for examples ("until" or "receive"
// commands).
// MsgsToJSONWithIgnore converts the pgproto3 messages to JSON format. The
// second argument can specify how to adjust the messages (e.g. to make them
// more deterministic) if needed, see testdata for examples.

…tals

The connection statement buffer grows indefinitely when the client uses the
execute portal with limit feature of the Postgres protocol, eventually causing
the node to crash out of memory. Any long running query that uses the limit
feature will cause this memory leak such as the `EXPERIMENTAL CHANGEFEED FOR`
statement. The execute portal with limit feature of the Postgres protocol is
used by the JDBC Postgres driver to fetch a limited number of rows at a time.

The leak is caused by commands accumulating in the buffer and never getting
cleared out. The client sends 2 commands every time it wants to receive more
rows:

- `Execute {"Portal": "C_1", "MaxRows": 1}`
- `Sync`

The server processes the commands and leaves them in the buffer, every
iteration causes 2 more commands to leak.

A similar memory leak was fixed by cockroachdb#48859, however the execute with limit
feature is implemented by a side state machine in limitedCommandResult. The
cleanup routine added by cockroachdb#48859 is never executed for suspended portals as
they never return to the main conn_executor loop.

After this change the statement buffer gets trimmed to reclaim memory after
each client command is processed in the limitedCommandResult side state
machine. The StmtBuf.Ltrim function was changed to be public visibility to
enable this. While this is not ideal, it does scope the fix to the
limitedCommandResult side state machine and could be addressed when the
limitedCommandResult functionality is refactored into the conn_executor.

Added a unit test which causes the leak, used the PGWire client in the test as
neither the pg or pgx clients use execute with limit, so cant be used to
demonstrate the leak. Also tested the fix in a cluster by following the steps
outlined in cockroachdb#66849.

Resolves: cockroachdb#66849

See also: cockroachdb#48859

Release note (bug fix): fix statement buffer memory leak when using
suspended portals
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks

@rafiss rafiss merged commit 3d16171 into cockroachdb:release-20.2 Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants