Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: sql: update connExecutor logic for pausable portals #101026

Merged
merged 8 commits into from
Apr 10, 2023

Conversation

ZhouXing19
Copy link
Collaborator

@ZhouXing19 ZhouXing19 commented Apr 10, 2023

Backport 8/8 commits from #99663


This PR is to add limited support for multiple active portals. Now portals satisfying all following restrictions can be paused and resumed (i.e., with other queries interleaving it):

  1. Not an internal query;
  2. Read-only query;
  3. No sub-queries or post-queries.

And such a portal will only have the statement executed with a non-distributed plan.

This feature is gated by a session variable multiple_active_portals_enabled. When it's set true, all portals that satisfy the restrictions above will automatically become "pausable" when being created via the pgwire Bind stmt.

The core idea of this implementation is

  1. Add a switchToAnotherPortal status to the result-consumption state machine. When we receive an ExecPortal message for a different portal, we simply return the control to the connExecutor. (sql: add switchToAnotherPortal signal for result consumer #99052)
  2. Persist flow queryID span and instrumentationHelper for the portal, and reuse it when we re-execute a portal. This is to ensure we continue the fetching rather than starting all over. (sql: enable resumption of a flow for pausable portals #99173)
  3. To enable 2, we need to delay the clean-up of resources till we close the portal. For this we introduced the stacks of cleanup functions. (This PR)

Note that we kept the implementation of the original "un-pausable" portal, as we'd like to limit this new functionality only to a small set of statements. Eventually some of them should be replaced (e.g. the limitedCommandResult's lifecycle) with the new code.

Also, we don't support distributed plan yet, as it involves much more complicated changes. See Start with an entirely local plan section in the design doc. Support for this will come as a follow-up.

Epic: CRDB-17622

Release note (sql change): initial support for multiple active portals. Now with session variable multiple_active_portals_enabled set to true, portals satisfying all following restrictions can be executed in an interleaving manner: 1. Not an internal query; 2. Read-only query; 3. No sub-queries or post-queries. And such a portal will only have the statement executed with an entirely local plan.

Release justification: this is the implementation of an important feature

With the introduction of pausable portals, the comment for `limitedCommandResult`
needs to be updated to reflect the current behavior.

Release note: None
This change introduces a new session variable for a preview feature. When set to `true`,
all non-internal portals with read-only [`SELECT`](../v23.1/selection-queries.html)
queries without sub-queries or post-queries can be paused and resumed in an interleaving
manner, but are executed with a local plan.

Release note (SQL change): Added the session variable `multiple_active_portals_enabled`.
This setting is only for a preview feature. When set to `true`, it allows
multiple portals to be open at the same time, with their execution interleaved
with each other. In other words, these portals can be paused. The underlying
statement for a pausable portal must be a read-only `SELECT` query without
sub-queries or post-queries, and such a portal is always executed with a local
plan.
…e persistence

This commit is part of the implementation of multiple active portals. In order to
execute portals interleavingly, certain resources need to be persisted and their
clean-up must be delayed until the portal is closed. Additionally, these resources
don't need to be re-setup when the portal is re-executed.

To achieve this, we store the cleanup steps in the `cleanup` function stacks in
`portalPauseInfo`, and they are called when any of the following events occur:

1. SQL transaction is committed or rolled back
2. Connection executor is closed
3. An error is encountered when executing the portal
4. The portal is explicited closed by the user

The cleanup functions should be called in the original order of a normal portal.
Since a portal's execution follows the `execPortal() -> execStmtInOpenState() ->
dispatchToExecutionEngine() -> flow.Run()` function flow, we categorize the cleanup
functions into 4 "layers", which are stored accordingly in `PreparedPortal.pauseInfo`.
The cleanup is always LIFO, following the

- resumableFlow.cleanup
- dispatchToExecutionEngine.cleanup
- execStmtInOpenState.cleanup
- exhaustPortal.cleanup`

order. Additionally, if an error occurs in any layer, we clean up the current and
proceeding layers. For example, if an error occurs in `execStmtInOpenState()`, we
perform `resumableFlow.cleanup` and `dispatchToExecutionEngine.cleanup` (proceeding)
and then `execStmtInOpenState.cleanup` (current) before returning the error to
`execPortal()`, where `exhaustPortal.cleanup` will eventually be called. This is to
maintain the previous clean-up process for portals as much as possible.

We also pass the `PreparedPortal` as a reference to the planner in
`execStmtInOpenState()`,so that the portal's flow can be set and reused.

Release note: None
When executing or cleaning up a pausable portal, we may encounter an error and
need to run the corresponding clean-up steps, where we need to check the latest
`retErr` and `retPayload` rather than the ones evaluated when creating the
cleanup functions.

To address this, we use portal.pauseInfo.retErr and .retPayload to record the
latest error and payload. They need to be updated for each execution.

Specifically,

1. If the error happens during portal execution, we ensure `portal.pauseInfo`
records the error by adding the following code to the main body of
`execStmtInOpenState()`:

``` go
defer func() {
	updateRetErrAndPayload(retErr, retPayload)
}()
```

Note this defer should always happen _after_ the defer of running the cleanups.

2. If the error occurs during a certain cleanup step for the pausable portal,
we ensure that cleanup steps after it can see the error by always having
`updateRetErrAndPayload(retErr, retPayload)` run at the end of each cleanup step.

Release note: None
This commit adds several restrictions to pausable portals to ensure that they
work properly with the current changes to the consumer-receiver model.
Specifically, pausable portals must meet the following criteria:

1. Not be internal queries
2. Be read-only queries
3. Not contain sub-queries or post-queries
4. Only use local plans

These restrictions are necessary because the current changes to the
consumer-receiver model only consider the local push-based case.

Release note: None
When resuming a portal, we always reset the planner. However we still need the
planner to respect the outer txn's situation, as we did in cockroachdb#98120.

Release note: None
We now only support multiple active portals with local plan, so explicitly
disable it for this test for now.

Release note: None
@ZhouXing19 ZhouXing19 requested a review from a team as a code owner April 10, 2023 00:04
@ZhouXing19 ZhouXing19 requested a review from a team April 10, 2023 00:04
@ZhouXing19 ZhouXing19 requested review from a team as code owners April 10, 2023 00:04
@blathers-crl
Copy link

blathers-crl bot commented Apr 10, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@ZhouXing19 ZhouXing19 removed the request for review from cucaroach April 10, 2023 00:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for getting it through!

@ZhouXing19 ZhouXing19 merged commit 57c6f6e into cockroachdb:release-23.1 Apr 10, 2023
@yuzefovich
Copy link
Member

Great work on getting this in! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants