-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: bug in single delete optimization for separated intents #69891
Comments
Add a `sequenceGenerator` instance with a transition map that generates the following sequence of operations, similar to the problematic sequence generated for cockroachdb/cockroach#69414: ``` ((SET -> GET)+ -> DELETE -> GET)+ -> SINGLEDEL -> (GET)+ ``` See also cockroachdb/cockroach#69891.
The test uses the case of savepoint rollbacks to trigger the bug. Informs cockroachdb#69891 Release justification: non-production code change Release note: None
Isn't cockroach/pkg/kv/kvserver/replica_destroy.go Lines 137 to 151 in 3357d2e
and note how it passes cockroach/pkg/storage/engine.go Lines 1003 to 1007 in 6e34358
The other caller of cockroach/pkg/kv/kvserver/store_split.go Lines 101 to 102 in 793b4c8
This is easy to fix of course - we need to pass |
|
The test uses the case of savepoint rollbacks to trigger the bug. It is marked as skipped. Informs cockroachdb#69891 Release justification: non-production code change Release note: None
The snippet I linked is a special code path, where a split realizes that the right-hand side it is about to create has already been rebalanced away (and so it becomes another instance of replicaGC).
Wouldn't this be taken care of by the same strategy we discussed for avoiding the original anomaly? We make sure that
Yeah, maybe we are technically OK here? Let's say replicaGC can really make a random intent show up again (as it can). This technically doesn't really matter until this intent shows up in a Replica. For a replica to exist again, a snapshot has to apply, and from the looks of it a snapshot will put a blanket rangedel over the entire keyspace of the replica, which nukes the intent anyway. So yeah, anomaly might be benign, but I still wouldn't take my chances with it, who knows what fallout unexpected KVs in "unassigned" keyspace have. |
Agreed about using the same strategy. But we would need to fix this for the 21.2 beta which will then need to use I've added this to the list of things to do, and it includes having a unit test that creates conditions like range movement while there are live intents. Can we get someone from KV to write such a test -- this is likely to be generally useful and will really increase our confidence in the code? |
Documenting the current plan for the work item:
|
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a ket has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set` - A `SET => SINGLEDEL` continues to cause both to be consumed - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a ket has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set` - A `SET => SINGLEDEL` continues to cause both to be consumed - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
@tbg @andy-kimball |
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a ket has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set` - A `SET => SINGLEDEL` continues to cause both to be consumed - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a ket has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set` - A `SET => SINGLEDEL` continues to cause both to be consumed - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a key has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction. - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set`. - A `SET => SINGLEDEL` continues to cause both to be consumed. - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a key has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction. - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set`. - A `SET => SINGLEDEL` continues to cause both to be consumed. - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
I believe Serverless will be bootstrapped in production from an 21.2 beta, but @andy-kimball should confirm. |
Serverless is currently running on existing 21.1 clusters. We will upgrade those to the 21.2 beta, and then later to 21.2 GA. Both of those upgrades are expected to work smoothly and without downtime. |
This blocks the beta further then. I'll open a discussion privately to see if we can work around that still. |
Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
…luster Specifically, if the cluster version indicates that there can be nodes with the broken SingleDelete logic for separated intent resolution, the MVCCMetadata.TxnDidNotUpdateMeta field will never be set to true. See code comments and cockroachdb#69891 (comment) Informs cockroachdb#69891 Release justification: fix for a release blocker that causes incorrect behavior for transactional writes. Release note: None
…luster Specifically, if the cluster version indicates that there can be nodes with the broken SingleDelete logic for separated intent resolution, the MVCCMetadata.TxnDidNotUpdateMeta field will never be set to true. See code comments and #69891 (comment) Informs #69891 Release justification: fix for a release blocker that causes incorrect behavior for transactional writes. Release note: None
Synced with Sumeer offline:
thanks! |
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a key has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction. - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set`. - A `SET => SINGLEDEL` continues to cause both to be consumed. - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a key has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in #1252. A solution, outlined in #1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction. - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set`. - A `SET => SINGLEDEL` continues to cause both to be consumed. - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses #1255, cockroachdb/cockroach#69891.
This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs cockroachdb#69891 Release note: None Release justification: high-severity bug fix
c6751b08 metamorphic: add keyManager abstraction and SetWithDelete bug fix 96dc1efe internal/base: add SetWithDelete key kind 5db9bf33 base: add `InternalKeyKindSeparator` a2d58e95 db: improve use-after-free test coverage for compactions a3b61ae3 testdata: remove values from deletes 99414d59 db: support nil logger in EventListener.EnsureDefaults e0860f77 compaction: fix nil pointer during errored compactions Informs cockroachdb#69891 Release note: none Release justification: high-severity bug fix
This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs cockroachdb#69891 Release note: None Release justification: high-severity bug fix
70586: sql,server: enable tenant status server to work with persisted SQL stats r=maryliag,dhartunian a=Azhng Previsouly, tenant status server was not able to work with the new persisted SQL Stats. This result in SQL Stats RPC not able to return or reset persisted SQL stats. This commit addresses this issue to update tenant status server's implementation to be able to work with persisted SQL Stats. Resolves #70585 #70529 #68177 Release Justification: Bug fixes and low-risk updates to new functionality Release note (bug fix): Statement/Transaction Page now is able to display and reset persisted SQL Stats. 70707: clusterversion,storage: cluster version that uses Pebble's SetWithDelete r=sumeerbhola a=sumeerbhola This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs #69891 Release note: None Release justification: high-severity bug fix Co-authored-by: Azhng <archer.xn@gmail.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs cockroachdb#69891 Release note: None Release justification: high-severity bug fix
This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs cockroachdb#69891 Release note: None Release justification: high-severity bug fix
This is a backwards incompatible change in Pebble to provide SingleDelete semantics that are clean and robust to programming error. This is being done post v21.2-beta but before GA since this was deemed a GA blocker and not a release blocker. It allows for upgrading beta clusters to the GA version. For more on these discussions see the following threads cockroachdb/pebble#1255 (comment), https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600 (Cockroach Labs internal link). Informs cockroachdb/pebble#1255 Informs cockroachdb#69891 Release note: None Release justification: high-severity bug fix
After an existing test related PR is merged, the only remaining item is:
This is going to included in improvements to the MVCC metamorphic test. |
@sumeerbhola don't we also need to remove (and default to cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 721 to 727 in 793b4c8
as per #69891 (comment)? |
regarding The original problem was assumption [A2] from #69891 (comment) being violated.
We realized we could possibly have: I now think we were mistaken about the original problem too. Since we apply a RANGEDEL before adding back the range (#69891 (comment) and the 2 comments after that), what we really have is Also, from a performance perspective I think we are doing the right thing by using |
The test uses the case of savepoint rollbacks to trigger the bug. The bug is now fixed, so this test succeeds. Informs cockroachdb#69891 Release justification: non-production code change Release note: None
69902: storage: add randomized test to trigger intent single deletion bug r=sumeerbhola a=sumeerbhola The test uses the case of savepoint rollbacks to trigger the bug. Informs #69891 Release justification: non-production code change Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
70515: sql: add telemetry for ON UPDATE r=ZhouXing19 a=ZhouXing19 This commit is to add telemetry for a column created / altered with `ON UPDATE` syntax. Release Note: None 71290: metamorphic: Increase test surface area, run many more operations r=sumeerbhola a=itsbilal This PR contains the following major changes (all in separate commits): 1) A move from on-disk to memFS and a move from 1,000 operations per run to 100,000 to take advantage of the faster performance of memFS 2) Addition of savepoints and txn aborts 3) A move from 2 standard 1 random configs to 18 standard and 10 random configs Implements a large part of #70935, but not all of it. Remaining changes will be in a future PR. Fallout of discussion in #69891. Release notes: None. Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
The SINGLEDEL optimization for separated intents is used for removing the intent when possible. It is important since
SET=>SINGLEDEL
(SET followed by/happens before SINGLEDEL) will result in both disappearing from Pebble when they meet in a compaction. In contrastSET=>DEL
will cause the SET to disappear, but the DEL will typically fall all the way to L6 before being elided.<key>@0
), separated intents use a different key for each txn. This is done to allow for lock contention reduction in the future. With interleaved intents we had<key>@0.SET=><key>@0.DEL=><key>@0.SET=><key>@0.DEL
, so even if the latest DEL had to fall all the way to L6, the older DELs would vanish because of the newer SET (this is potentially important for keys with high write rates). With separated intents and high write rates, where each set and delete pair will be a different key, the SINGLEDEL optimization is important for not resulting in more garbage relative to interleaved intents.The optimization is performed by tracking a
TxnDidNotUpdateMeta
bool in the intent'sMVCCMetadata
proto, which defaults to false for legacy code, and starts as true for non-legacy code. If theMVCCMetadata
is updated using another SET, this value transitions to false. When resolving the intent we use SINGLEDEL if this value is true, else DEL. There are two correctness assumptions made here:MVCCMetadata
for a txn is never deleted and recreated during the lifetime of a transaction. This is important since the history ofTxnDidNotUpdateMeta
would be lost if it were deleted and recreated.SET=>RANGEDEL=>SET=>SINGLEDEL
if the range is removed before the intent is resolved, then added back, and then the intent is resolved. If the RANGEDEL were replaced by a DEL, we would have the bug described below.Assumption [A1] is violated, by intent resolution for a non-finalized txn which has (a) rolled back savepoints that cause the intent to be removed, (b) the txn epoch has been incremented and the intent is from an older epoch.
Due to this violation we can see sequences like
SET=>SET=>DEL=>SET=>SINGLEDEL
.To understand why this causes a problem, note that compactions only see a continuous subsequence of the sequence of operations on a key — they can be missing operations newer and older. A compaction can see the
DEL=>SET
and the SET will consume the DEL. The reasoning being that it is either (a) the current latest SET in the LSM in which case the DEL is no longer relevant, or (b) something later has deleted the whole key, in which the DEL is also no longer needed. The SINGLEDEL violates (b). The result of this compaction will cause the LSM to haveSET=>SET=>SET=>SINGLEDEL
. And if this SET and SINGLEDEL later meet in a compaction, both will vanish. Now all the deleted SETs from before will reappear incorrectly.List of things we need to do to fix this:
CockroachDB randomized unit test that reproduces bug under all possible intent resolution scenarios @sumeerbholaRange movement or any other range related operations may also trigger this bug -- see storage: bug in single delete optimization for separated intents #69891 (comment)SET=>DEL=>SET=>SINGLEDEL
andSET=>SINGLEDEL=>SET=>SINGLEDEL
.The text was updated successfully, but these errors were encountered: