Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

Merged

Conversation

nvanbenschoten
Copy link
Member

At the moment, parallel statement execution works by sending batches
concurrently through a single client.Txn. This make the handling of
retryable errors tricky because it's difficult to know when its safe
to prepare the transaction state for a retry. Our approach to this is
far from optimal, and relies on a mess of locking in both client.Txn
and TxnCoordSender. This works well enough to prevent anything from
seriously going wrong (#17197), but can result in some confounding error
behavior when statements operate in the context of transaction epochs
that they weren't expecting.

The ideal situation would be for all statements with a handle to a txn
to always work under the same txn epoch at a single point in time. Any
retryable error seen by these statements would be propagated up through
client.Txn without changing any state (and without yet being converted
to a HandledRetryableTxnError), and only after the statements have all
been synchronized would the retryable error be used to update the txn
and prepare for the retry attempt. This would require a change like #22615.
I've created a POC for this approach, but it is way to invasive to
cherry-pick.

So with our current state of things, we need to do a better job catching
errors caused by concurrent retries. In the past we've tried to carefully
determine which errors could be a symptom of a concurrent retry and ignore
them. I now think this was a mistake, as this process of inferring which
errors could be caused by a txn retry is fraught for failure. We now
always return retryable errors from synchronizeParallelStmts when they
exist. The reasoning for this is that if an error was a symptom of the
txn retry, it will not be present during the next txn attempt. If it was
not and instead was a legitimate query execution error, we expect to
hit it again on the next txn attempt and the behavior will mirror that
where the statement throwing the execution error was not even run before
the parallel queue hit the retryable error.

Release note: None

At the moment, parallel statement execution works by sending batches
concurrently through a single `client.Txn`. This make the handling of
retryable errors tricky because it's difficult to know when its safe
to prepare the transaction state for a retry. Our approach to this is
far from optimal, and relies on a mess of locking in both `client.Txn`
and `TxnCoordSender`. This works well enough to prevent anything from
seriously going wrong (cockroachdb#17197), but can result in some confounding error
behavior when statements operate in the context of transaction epochs
that they weren't expecting.

The ideal situation would be for all statements with a handle to a txn
to always work under the same txn epoch at a single point in time. Any
retryable error seen by these statements would be propagated up through
`client.Txn` without changing any state (and without yet being converted
to a `HandledRetryableTxnError`), and only after the statements have all
been synchronized would the retryable error be used to update the txn
and prepare for the retry attempt. This would require a change like cockroachdb#22615.
I've created a POC for this approach, but it is way to invasive to
cherry-pick.

So with our current state of things, we need to do a better job catching
errors caused by concurrent retries. In the past we've tried to carefully
determine which errors could be a symptom of a concurrent retry and ignore
them. I now think this was a mistake, as this process of inferring which
errors could be caused by a txn retry is fraught for failure. We now
always return retryable errors from synchronizeParallelStmts when they
exist. The reasoning for this is that if an error was a symptom of the
txn retry, it will not be present during the next txn attempt. If it was
not and instead was a legitimate query execution error, we expect to
hit it again on the next txn attempt and the behavior will mirror that
where the statement throwing the execution error was not even run before
the parallel queue hit the retryable error.

Release note: None
@nvanbenschoten nvanbenschoten requested review from bdarnell, andreimatei and a team March 1, 2018 23:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten changed the title sql: prioritize retryable errors in synchronizeParallelStmts release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts Mar 1, 2018
@andreimatei
Copy link
Contributor

:lgtm:


Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@nvanbenschoten nvanbenschoten merged commit d18aa3c into cockroachdb:release-2.0 Mar 2, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/23294 branch March 2, 2018 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants