-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-19.2: storage: write to AbortSpan less, clean up AbortSpan more #45553
Merged
nvanbenschoten
merged 6 commits into
cockroachdb:release-19.2
from
nvanbenschoten:backport19.2-42765
Mar 2, 2020
Merged
release-19.2: storage: write to AbortSpan less, clean up AbortSpan more #45553
nvanbenschoten
merged 6 commits into
cockroachdb:release-19.2
from
nvanbenschoten:backport19.2-42765
Mar 2, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…st.go This mirrors `cmd_truncate_log.go`.
Up until this point, we had no testing around this. For instance, not a single test would catch it if we removed the deletion path from SetAbortSpan. Release note: None
…ction request Fixes cockroachdb#29128. Before this change, an EndTransaction request sent to rollback a transaction record would not remove any AbortSpan entries, even if its own Poison flag was set to false. This allowed AbortSpan entries to leak. This commit fixes this behavior by removing the AbortSpan entry in this case. There were concerns raised in cockroachdb#29128 about this being safe. It turns out that we already do this on every Range except the transaction record's Range, so this isn't introducing any new concerns. Release note (bug fix): AbortSpan records are now cleaned up more aggresively when doing so is known to be safe.
This reduces the frequency of AbortSpan entries that can be abandoned even without a transaction coordinator failure. Specifically, it protects against the case where intent resolution races with a transaction coordinator cleaning up its own transaction record and intents. This can happen for both aborted and committed transactions. In the first case, a pusher might find a transaction's intent and then find its record to be aborted after that transaction had cleanly rolled back its own intents. Even though the transaction's coordinator had already cleaned up and potentially "unpoisoned" AbortSpans, the pusher would happily re-introduce AbortSpan records when it goes to resolve the intents that were already cleaned up. These AbortSpan entries would be fully abandoned and have to wait out the GC. Similarly, in the second case, the transaction might have committed. Here, the pushee might hit an intent and the txn coordinator might clean up and auto-GC its txn record before the pushee arrives at the txn record. Once the pushee gets there, it would mistake the txn for aborted (by design) and proceed to write an AbortSpan record where the intent it had once observed had been (not by design). We can tell both of these cases by simply recognizing whether intent resolution actually succeeds. If intent resolution doesn't find an intent, then we might be in either case. That's fine, because we only need to ever poison the abort span if we actually remove an intent that could confuse a zombie transaction. Release note: None
Return an error if not. Release note: None
LGTM |
tbg
approved these changes
Mar 2, 2020
Purely a rename. Release note: None
nvanbenschoten
force-pushed
the
backport19.2-42765
branch
from
March 2, 2020 21:41
de7eeec
to
8ae6eac
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 7/7 commits from #42765.
/cc @cockroachdb/release
This PR introduces two improvements to our handling of the AbortSpan. It also introduces the first set of comprehensive testing around AbortSpan entry state transitions, which was sorely missing.
This comes after a few different customer issues that at least at first glance appeared to be AbortSpan leaks. There's still more to do to resolve those, mostly by improving GC, but anything we can do here on the frontend to reduce the number of AbortSpan entries that need to be GCed in the first place helps.
Clean up span on non-poisoning, aborting EndTransaction request
Fixes #29128.
Before this change, an EndTransaction request sent to rollback a transaction record would not remove any AbortSpan entries, even if its own Poison flag was set to false. This allowed AbortSpan entries to leak. This commit fixes this behavior by removing the AbortSpan entry in this case.
There were concerns raised in #29128 about this being safe. It turns out that we already do this on every Range except the transaction record's Range during normal intent resolution, so this isn't introducing any new concerns.
Only write AbortSpan entries if intents are removed
This reduces the frequency of AbortSpan entries that can be abandoned even without a transaction coordinator failure. Specifically, it protects against the case where intent resolution races with a transaction coordinator cleaning up its own transaction record and intents. This can happen for both aborted and committed transactions.
In the first case, a pusher might find a transaction's intent and then find its record to be aborted after that transaction had cleanly rolled back its own intents. Even though the transaction's coordinator had already cleaned up and potentially "unpoisoned" AbortSpans, the pusher would happily re-introduce AbortSpan records when it goes to resolve the intents that were already cleaned up. These AbortSpan entries would be fully abandoned and have to wait out the GC.
Similarly, in the second case, the transaction might have committed. Here, the pushee might hit an intent and the txn coordinator might clean up and auto-GC its txn record before the pushee arrives at the txn record. Once the pushee gets there, it would mistake the txn for aborted (by design) and proceed to write an AbortSpan record where the intent it had once observed had been (not by design).
We can tell both of these cases by simply recognizing whether intent resolution actually succeeds. If intent resolution doesn't find an intent, then we might be in either case. That's fine, because we only need to ever poison the abort span if we actually remove an intent that could confuse a zombie transaction.