Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: bug in single delete optimization for separated intents #69891

Closed
6 tasks done
sumeerbhola opened this issue Sep 7, 2021 · 17 comments
Closed
6 tasks done

storage: bug in single delete optimization for separated intents #69891

sumeerbhola opened this issue Sep 7, 2021 · 17 comments
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Sep 7, 2021

The SINGLEDEL optimization for separated intents is used for removing the intent when possible. It is important since

  • A combination of SET=>SINGLEDEL (SET followed by/happens before SINGLEDEL) will result in both disappearing from Pebble when they meet in a compaction. In contrast SET=>DEL will cause the SET to disappear, but the DEL will typically fall all the way to L6 before being elided.
  • Unlike interleaved intents, where we reused the same key for intents (<key>@0), separated intents use a different key for each txn. This is done to allow for lock contention reduction in the future. With interleaved intents we had <key>@0.SET=><key>@0.DEL=><key>@0.SET=><key>@0.DEL, so even if the latest DEL had to fall all the way to L6, the older DELs would vanish because of the newer SET (this is potentially important for keys with high write rates). With separated intents and high write rates, where each set and delete pair will be a different key, the SINGLEDEL optimization is important for not resulting in more garbage relative to interleaved intents.

The optimization is performed by tracking a TxnDidNotUpdateMeta bool in the intent's MVCCMetadata proto, which defaults to false for legacy code, and starts as true for non-legacy code. If the MVCCMetadata is updated using another SET, this value transitions to false. When resolving the intent we use SINGLEDEL if this value is true, else DEL. There are two correctness assumptions made here:

  • [A1] The MVCCMetadata for a txn is never deleted and recreated during the lifetime of a transaction. This is important since the history of TxnDidNotUpdateMeta would be lost if it were deleted and recreated.
  • [A2] CockroachDB range drops (due to the range being moved to another node) are done using a RANGEDEL across the whole range and not using individual DELs for the data to be removed. So we can get a sequence of: SET=>RANGEDEL=>SET=>SINGLEDEL if the range is removed before the intent is resolved, then added back, and then the intent is resolved. If the RANGEDEL were replaced by a DEL, we would have the bug described below.

Assumption [A1] is violated, by intent resolution for a non-finalized txn which has (a) rolled back savepoints that cause the intent to be removed, (b) the txn epoch has been incremented and the intent is from an older epoch.
Due to this violation we can see sequences like SET=>SET=>DEL=>SET=>SINGLEDEL.
To understand why this causes a problem, note that compactions only see a continuous subsequence of the sequence of operations on a key — they can be missing operations newer and older. A compaction can see the DEL=>SET and the SET will consume the DEL. The reasoning being that it is either (a) the current latest SET in the LSM in which case the DEL is no longer relevant, or (b) something later has deleted the whole key, in which the DEL is also no longer needed. The SINGLEDEL violates (b). The result of this compaction will cause the LSM to have SET=>SET=>SET=>SINGLEDEL. And if this SET and SINGLEDEL later meet in a compaction, both will vanish. Now all the deleted SETs from before will reappear incorrectly.

List of things we need to do to fix this:

@sumeerbhola sumeerbhola added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-transactions Relating to MVCC and the transactional model. labels Sep 7, 2021
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 7, 2021
Add a `sequenceGenerator` instance with a transition map that generates
the following sequence of operations, similar to the problematic
sequence generated for cockroachdb/cockroach#69414:

```
((SET -> GET)+ -> DELETE -> GET)+ -> SINGLEDEL -> (GET)+
```

See also cockroachdb/cockroach#69891.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 7, 2021
The test uses the case of savepoint rollbacks to trigger the bug.

Informs cockroachdb#69891

Release justification: non-production code change
Release note: None
@tbg
Copy link
Member

tbg commented Sep 8, 2021

Isn't [A2] also violated? The code that deletes the replica data is here

func (r *Replica) destroyRaftMuLocked(ctx context.Context, nextReplicaID roachpb.ReplicaID) error {
startTime := timeutil.Now()
ms := r.GetMVCCStats()
batch := r.Engine().NewUnindexedBatch(true /* writeOnly */)
defer batch.Close()
clearRangeIDLocalOnly := !r.IsInitialized()
if err := r.preDestroyRaftMuLocked(
ctx,
r.Engine(),
batch,
nextReplicaID,
clearRangeIDLocalOnly,
false, /* mustUseClearRange */
); err != nil {

and note how it passes false for mustUseClearRange, which translates into calling into this method:

// ClearRangeWithHeuristic clears the keys from start (inclusive) to end
// (exclusive). Depending on the number of keys, it will either use ClearRawRange
// or clear individual keys. It works with EngineKeys, so don't expect it to
// find and clear separated intents if [start, end) refers to MVCC key space.
func ClearRangeWithHeuristic(reader Reader, writer Writer, start, end roachpb.Key) error {

The other caller of destroyRaftMuLocked (during splits) similarly passes false (so nobody passes true anywhere):

const mustUseClearRange = false
if err := clearRangeData(&split.RightDesc, readWriter, readWriter, rangeIDLocalOnly, mustUseClearRange); err != nil {

This is easy to fix of course - we need to pass true in both places - but possibly at the expense of subjecting pebble to a higher frequency of rangedels.

@sumeerbhola
Copy link
Collaborator Author

  • Are we deleting any range local or global data when splitting a range? If not we may be ok there since those are the only keys with intents.
  • Passing true for those cases would also require a change to 21.1, which is troublesome. I had been thinking of the range tombstone we place here
    if err := msstw.currSST.ClearRawRange(
    msstw.keyRanges[msstw.currRange].Start.Key, msstw.keyRanges[msstw.currRange].End.Key); err != nil {
    msstw.currSST.Close()
    return errors.Wrap(err, "failed to clear range on sst file writer")
    }
    which I had been assuming happens whenever we add a range. Is that a correct assumption?

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 8, 2021
The test uses the case of savepoint rollbacks to trigger the bug.
It is marked as skipped.

Informs cockroachdb#69891

Release justification: non-production code change
Release note: None
@tbg
Copy link
Member

tbg commented Sep 9, 2021

Are we deleting any range local or global data when splitting a range? If not we may be ok there since those are the only keys with intents.

The snippet I linked is a special code path, where a split realizes that the right-hand side it is about to create has already been rebalanced away (and so it becomes another instance of replicaGC).

Passing true for those cases would also require a change to 21.1, which is troublesome.

Wouldn't this be taken care of by the same strategy we discussed for avoiding the original anomaly? We make sure that SingleDel isn't used in the first place until 21.1 is out of the picture, by overriding TxnDidNotUpdateMeta to false until a 21.2 cluster version gate is active.

I had been thinking of the range tombstone we place here

Yeah, maybe we are technically OK here? Let's say replicaGC can really make a random intent show up again (as it can). This technically doesn't really matter until this intent shows up in a Replica. For a replica to exist again, a snapshot has to apply, and from the looks of it a snapshot will put a blanket rangedel over the entire keyspace of the replica, which nukes the intent anyway. So yeah, anomaly might be benign, but I still wouldn't take my chances with it, who knows what fallout unexpected KVs in "unassigned" keyspace have.

@sumeerbhola
Copy link
Collaborator Author

Wouldn't this be taken care of by the same strategy we discussed for avoiding the original anomaly? We make sure that SingleDel isn't used in the first place until 21.1 is out of the picture, by overriding TxnDidNotUpdateMeta to false until a 21.2 cluster version gate is active.
...
who knows what fallout unexpected KVs in "unassigned" keyspace have.

Agreed about using the same strategy. But we would need to fix this for the 21.2 beta which will then need to use mustUseClearRange=true, even in a cluster with only 21.2 beta nodes, since we are deeming the Pebble-level fix as a GA blocker and not a release blocker.
And I agree that there can be unexpected fallout. One example is that the intentInterleavingIter enters an error state if it finds an intent without a provisional value. We should not really be scanning a range that is no longer local, but this would still make me nervous.

I've added this to the list of things to do, and it includes having a unit test that creates conditions like range movement while there are live intents. Can we get someone from KV to write such a test -- this is likely to be generally useful and will really increase our confidence in the code?

@sumeerbhola
Copy link
Collaborator Author

Documenting the current plan for the work item:
CockroachDB correctness when migrating from 21.1 to 21.2

  • We are fixing the intent resolution code path in 21.2-beta to use SingleDelete more conservatively. The 21.2-GA will likely include Pebble changes to make the current usage of SingleDelete correct.
  • There is a problem if someone upgrades from 21.1 to 21.2-beta or 21.2-GA, described next.
  • Problem: 21.1 nodes will not write separated intents while they are the leaseholder for a range. However they can become the leaseholder for a range after a separated intent was written (in a mixed version cluster). Hence they can resolve separated intents. The logic in 21.1 for using SingleDelete when resolving intents is similarly buggy, and the Pebble code included in 21.1 will not make this buggy usage correct.
  • Solution (by @tbg): Change code in 21.2 to never set txnDidNotUpdateMeta=true when writing separated intents, until the cluster version is at the latest version defined in clusterversion when the buggy code was fixed in 21.2. So 21.1 code will never use SingleDelete when resolving these separated intents (since the only separated intents being written are by 21.2 nodes).

nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 14, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`
- A `SET => SINGLEDEL` continues to cause both to be consumed
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 14, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`
- A `SET => SINGLEDEL` continues to cause both to be consumed
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
@sumeerbhola
Copy link
Collaborator Author

@tbg @andy-kimball
we did not reach a conclusion if the migration issue discussed in #69891 (comment) was a release blocker for beta, or only a GA blocker. It depends on whether serverless is starting from an existing 21.1 cluster.

nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 14, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`
- A `SET => SINGLEDEL` continues to cause both to be consumed
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 14, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`
- A `SET => SINGLEDEL` continues to cause both to be consumed
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 15, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a key has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction.
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`.
- A `SET => SINGLEDEL` continues to cause both to be consumed.
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
nicktrav added a commit to nicktrav/pebble that referenced this issue Sep 15, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a key has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction.
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`.
- A `SET => SINGLEDEL` continues to cause both to be consumed.
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
@tbg
Copy link
Member

tbg commented Sep 15, 2021

I believe Serverless will be bootstrapped in production from an 21.2 beta, but @andy-kimball should confirm.

@andy-kimball
Copy link
Contributor

andy-kimball commented Sep 15, 2021

Serverless is currently running on existing 21.1 clusters. We will upgrade those to the 21.2 beta, and then later to 21.2 GA. Both of those upgrades are expected to work smoothly and without downtime.

@tbg
Copy link
Member

tbg commented Sep 15, 2021

This blocks the beta further then. I'll open a discussion privately to see if we can work around that still.

@tbg tbg added the GA-blocker label Sep 15, 2021
@blathers-crl
Copy link

blathers-crl bot commented Sep 15, 2021

Hi @tbg, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 20, 2021
…luster

Specifically, if the cluster version indicates that there can
be nodes with the broken SingleDelete logic for separated
intent resolution, the MVCCMetadata.TxnDidNotUpdateMeta
field will never be set to true.

See code comments and cockroachdb#69891 (comment)

Informs cockroachdb#69891

Release justification: fix for a release blocker that causes incorrect
behavior for transactional writes.

Release note: None
aliher1911 pushed a commit that referenced this issue Sep 20, 2021
…luster

Specifically, if the cluster version indicates that there can
be nodes with the broken SingleDelete logic for separated
intent resolution, the MVCCMetadata.TxnDidNotUpdateMeta
field will never be set to true.

See code comments and #69891 (comment)

Informs #69891

Release justification: fix for a release blocker that causes incorrect
behavior for transactional writes.

Release note: None
@celiala
Copy link
Collaborator

celiala commented Sep 21, 2021

Synced with Sumeer offline:

thanks!

@celiala celiala removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 21, 2021
sumeerbhola pushed a commit to sumeerbhola/pebble that referenced this issue Sep 23, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a key has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction.
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`.
- A `SET => SINGLEDEL` continues to cause both to be consumed.
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
sumeerbhola pushed a commit to cockroachdb/pebble that referenced this issue Sep 23, 2021
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a key has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in #1252.

A solution, outlined in #1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction.
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`.
- A `SET => SINGLEDEL` continues to cause both to be consumed.
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses #1255, cockroachdb/cockroach#69891.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 24, 2021
This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs cockroachdb#69891

Release note: None

Release justification: high-severity bug fix
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 24, 2021
c6751b08 metamorphic: add keyManager abstraction and SetWithDelete bug fix
96dc1efe internal/base: add SetWithDelete key kind
5db9bf33 base: add `InternalKeyKindSeparator`
a2d58e95 db: improve use-after-free test coverage for compactions
a3b61ae3 testdata: remove values from deletes
99414d59 db: support nil logger in EventListener.EnsureDefaults
e0860f77 compaction: fix nil pointer during errored compactions

Informs cockroachdb#69891

Release note: none
Release justification: high-severity bug fix
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 24, 2021
This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs cockroachdb#69891

Release note: None

Release justification: high-severity bug fix
craig bot pushed a commit that referenced this issue Sep 25, 2021
70586: sql,server: enable tenant status server to work with persisted SQL stats r=maryliag,dhartunian a=Azhng

Previsouly, tenant status server was not able to work with the new
persisted SQL Stats. This result in SQL Stats RPC not able to return
or reset persisted SQL stats.
This commit addresses this issue to update tenant status server's
implementation to be able to work with persisted SQL Stats.

Resolves #70585 #70529 #68177

Release Justification: Bug fixes and low-risk updates to new functionality

Release note (bug fix): Statement/Transaction Page now is able to
display and reset persisted SQL Stats.

70707: clusterversion,storage: cluster version that uses Pebble's SetWithDelete r=sumeerbhola a=sumeerbhola

This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs #69891

Release note: None

Release justification: high-severity bug fix

Co-authored-by: Azhng <archer.xn@gmail.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 26, 2021
This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs cockroachdb#69891

Release note: None

Release justification: high-severity bug fix
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 27, 2021
This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs cockroachdb#69891

Release note: None

Release justification: high-severity bug fix
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 27, 2021
This is a backwards incompatible change in Pebble to provide
SingleDelete semantics that are clean and robust to programming
error.

This is being done post v21.2-beta but before GA since this
was deemed a GA blocker and not a release blocker. It allows
for upgrading beta clusters to the GA version. For more on these
discussions see the following threads
cockroachdb/pebble#1255 (comment),
https://cockroachlabs.slack.com/archives/C4A9ALLRL/p1631213490022600
(Cockroach Labs internal link).

Informs cockroachdb/pebble#1255
Informs cockroachdb#69891

Release note: None

Release justification: high-severity bug fix
@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Sep 27, 2021

After an existing test related PR is merged, the only remaining item is:

  • CockroachDB randomized unit test that reproduces bug under all possible intent resolution scenarios.

This is going to included in improvements to the MVCC metamorphic test.
(edit: see #70935)

@tbg
Copy link
Member

tbg commented Sep 29, 2021

@sumeerbhola don't we also need to remove (and default to true) the mustClearRange argument in

func clearRangeData(
desc *roachpb.RangeDescriptor,
reader storage.Reader,
writer storage.Writer,
rangeIDLocalOnly bool,
mustClearRange bool,
) error {

as per #69891 (comment)?

@sumeerbhola
Copy link
Collaborator Author

regarding mustClearRange, the problem goes away with the new SingleDelete semantics in 21.2

The original problem was assumption [A2] from #69891 (comment) being violated.

[A2] CockroachDB range drops (due to the range being moved to another node) are done using a RANGEDEL across the whole range and not using individual DELs for the data to be removed. So we can get a sequence of: SET=>RANGEDEL=>SET=>SINGLEDEL if the range is removed before the intent is resolved, then added back, and then the intent is resolved. If the RANGEDEL were replaced by a DEL, we would have the bug described below.

We realized we could possibly have:
SET => (DEL => SET) => SINGLEDEL
which will now work correctly since it will transition to
SET => (SETWITHDEL => SINGLEDEL)
SET => DEL

I now think we were mistaken about the original problem too. Since we apply a RANGEDEL before adding back the range (#69891 (comment) and the 2 comments after that), what we really have is
SET => DEL => RANGEDEL => SET => SINGLEDEL
The RANGEDEL works in a special way and will prevent the DEL and SET that are in the middle from combining.
And before the RANGEDEL, while the range is not at this node, we simply have
SET => DEL
which is also correct prior to the recent semantic change in SingleDelete.

Also, from a performance perspective I think we are doing the right thing by using mustClearRange=false. The code paths related to RANGEDELs are more complicated, and there is a very real possibility that narrow RANGEDELs would be inefficient. This is something we would need to benchmark before changing (I've created cockroachdb/pebble#1295)

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 5, 2021
The test uses the case of savepoint rollbacks to trigger the bug.
The bug is now fixed, so this test succeeds.

Informs cockroachdb#69891

Release justification: non-production code change
Release note: None
craig bot pushed a commit that referenced this issue Oct 6, 2021
69902: storage: add randomized test to trigger intent single deletion bug r=sumeerbhola a=sumeerbhola

The test uses the case of savepoint rollbacks to trigger the bug.

Informs #69891

Release justification: non-production code change
Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Oct 27, 2021
70515: sql: add telemetry for ON UPDATE r=ZhouXing19 a=ZhouXing19

This commit is to add telemetry for a column
created / altered with `ON UPDATE` syntax.

Release Note: None

71290: metamorphic: Increase test surface area, run many more operations r=sumeerbhola a=itsbilal

This PR contains the following major changes (all in separate commits):

1) A move from on-disk to memFS and a move from 1,000 operations per run to 100,000 to take advantage of the faster performance of memFS
2) Addition of savepoints and txn aborts
3) A move from 2 standard 1 random configs to 18 standard and 10 random configs

Implements a large part of #70935, but not all of it. Remaining changes will be
in a future PR.

Fallout of discussion in #69891.

Release notes: None.

Co-authored-by: Jane Xing <zhouxing@uchicago.edu>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

5 participants