release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

nvanbenschoten · 2018-03-01T23:53:02Z

At the moment, parallel statement execution works by sending batches
concurrently through a single client.Txn. This make the handling of
retryable errors tricky because it's difficult to know when its safe
to prepare the transaction state for a retry. Our approach to this is
far from optimal, and relies on a mess of locking in both client.Txn
and TxnCoordSender. This works well enough to prevent anything from
seriously going wrong (#17197), but can result in some confounding error
behavior when statements operate in the context of transaction epochs
that they weren't expecting.

The ideal situation would be for all statements with a handle to a txn
to always work under the same txn epoch at a single point in time. Any
retryable error seen by these statements would be propagated up through
client.Txn without changing any state (and without yet being converted
to a HandledRetryableTxnError), and only after the statements have all
been synchronized would the retryable error be used to update the txn
and prepare for the retry attempt. This would require a change like #22615.
I've created a POC for this approach, but it is way to invasive to
cherry-pick.

So with our current state of things, we need to do a better job catching
errors caused by concurrent retries. In the past we've tried to carefully
determine which errors could be a symptom of a concurrent retry and ignore
them. I now think this was a mistake, as this process of inferring which
errors could be caused by a txn retry is fraught for failure. We now
always return retryable errors from synchronizeParallelStmts when they
exist. The reasoning for this is that if an error was a symptom of the
txn retry, it will not be present during the next txn attempt. If it was
not and instead was a legitimate query execution error, we expect to
hit it again on the next txn attempt and the behavior will mirror that
where the statement throwing the execution error was not even run before
the parallel queue hit the retryable error.

Release note: None

At the moment, parallel statement execution works by sending batches concurrently through a single `client.Txn`. This make the handling of retryable errors tricky because it's difficult to know when its safe to prepare the transaction state for a retry. Our approach to this is far from optimal, and relies on a mess of locking in both `client.Txn` and `TxnCoordSender`. This works well enough to prevent anything from seriously going wrong (cockroachdb#17197), but can result in some confounding error behavior when statements operate in the context of transaction epochs that they weren't expecting. The ideal situation would be for all statements with a handle to a txn to always work under the same txn epoch at a single point in time. Any retryable error seen by these statements would be propagated up through `client.Txn` without changing any state (and without yet being converted to a `HandledRetryableTxnError`), and only after the statements have all been synchronized would the retryable error be used to update the txn and prepare for the retry attempt. This would require a change like cockroachdb#22615. I've created a POC for this approach, but it is way to invasive to cherry-pick. So with our current state of things, we need to do a better job catching errors caused by concurrent retries. In the past we've tried to carefully determine which errors could be a symptom of a concurrent retry and ignore them. I now think this was a mistake, as this process of inferring which errors could be caused by a txn retry is fraught for failure. We now always return retryable errors from synchronizeParallelStmts when they exist. The reasoning for this is that if an error was a symptom of the txn retry, it will not be present during the next txn attempt. If it was not and instead was a legitimate query execution error, we expect to hit it again on the next txn attempt and the behavior will mirror that where the statement throwing the execution error was not even run before the parallel queue hit the retryable error. Release note: None

cockroach-teamcity · 2018-03-01T23:53:10Z

This change is

andreimatei · 2018-03-02T00:12:39Z

Review status: 0 of 3 files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

nvanbenschoten requested review from bdarnell, andreimatei and a team March 1, 2018 23:53

nvanbenschoten changed the title ~~sql: prioritize retryable errors in synchronizeParallelStmts~~ release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts Mar 1, 2018

bdarnell approved these changes Mar 1, 2018

View reviewed changes

nvanbenschoten merged commit d18aa3c into cockroachdb:release-2.0 Mar 2, 2018

nvanbenschoten deleted the nvanbenschoten/23294 branch March 2, 2018 00:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

nvanbenschoten commented Mar 1, 2018

cockroach-teamcity commented Mar 1, 2018

andreimatei commented Mar 2, 2018

release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

release-2.0: sql: prioritize retryable errors in synchronizeParallelStmts #23308

Conversation

nvanbenschoten commented Mar 1, 2018

cockroach-teamcity commented Mar 1, 2018

andreimatei commented Mar 2, 2018