Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/base: add SETWITHDEL key kind #1261

Merged
merged 1 commit into from
Sep 16, 2021

Conversation

nicktrav
Copy link
Contributor

@nicktrav nicktrav commented Sep 9, 2021

Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set at most once. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where => implies "happens-before"):

a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

a.SET.1 => a.SET.3 => a.SINGLEDEL.4

A second compaction of this intermediate sequence would result in just
a.SET.1, as a.SET.3 and a.SINGLEDEL.4 will be elided. This
incorrectly "resurrects" a.SET.1.

This undefined behavior was demonstrated in #1252.

A solution, outlined in #1255, works as follows:

  • A new key kind, SetWithDelete, represents a Set that has potentially
    consumed a Delete or SingleDelete in a compaction
  • The existing Set key kind now represents a promise that it has not
    consumed a Delete or SingleDelete. This is what will be written when a
    caller uses pebble.Writer.Set
  • A SET => SINGLEDEL continues to cause both to be consumed
  • A SETWITHDEL => SINGLEDEL is "upgraded" into a regular DEL,
    reflecting the fact that there may be deleted entries under the
    SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the SET => SINGLEDEL
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses #1255, cockroachdb/cockroach#69891.


TODOs:

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1.
Reviewable status: 5 of 6 files reviewed, 1 unresolved discussion (waiting on @nicktrav)


compaction_iter.go, line 465 at r1 (raw file):

	// SETWITHDEL. This guards against a SINGLEDEL above us removing the
	// original SET and erroneously exposing keys underneath is.
	if i.iterKey.Kind() == InternalKeyKindDelete {

I think we need to loop until we get a sameStripeNonSkippable or newStripe because we could have DEL => SET => SET'. Say we are at SET' at the start of this method, then step to SET as the next record and emit SET. Which would be incorrect since the skip=true will cause the DEL to also be skipped.

Also, we have a slightly bigger code-structuring problem here in that we don't want to stop if we see a rangedelete which causes sameStripeNonSkippable to be true, since there may be a DEL underneath that which we need to know about now in order to return SETWITHDEL. Note that the current code will leave skip=true and continue skipping after returning the rangedelete. The following comment in the code explains this

 `			// Although, note that `skip` may already be true before reaching here
			// due to an earlier key in the stripe. Then it is fine to leave it set
			// to true, as the earlier key must have had a higher sequence number.

Changing the code structure is going to be quite intrusive since we'll need to save the rangedelete too, iterate until we are done with this user key or stripe, then emit the SET or SETWITHDEL, and then the saved RANGEDEL and then resume.
Given the rarity that we will have a SET coincident with the start key of a RANGEDEL, we could be conservative and set skip=false if we are stopping because of sameStripeNonSkippable. This will cause this to be emitted as SET, then RANGEDEL, and then the other keys in the same stripe. So the compaction would possibly keep an extra key-value pair. Hmm, not sure when this extra key-value pair will eventually get deleted. Maybe a better approach is that if we encounter a sameStripeNonSkippable, be conservative and emit a SETWITHDEL now because that is always safe.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, sumeerbhola wrote…

I think we need to loop until we get a sameStripeNonSkippable or newStripe because we could have DEL => SET => SET'. Say we are at SET' at the start of this method, then step to SET as the next record and emit SET. Which would be incorrect since the skip=true will cause the DEL to also be skipped.

Also, we have a slightly bigger code-structuring problem here in that we don't want to stop if we see a rangedelete which causes sameStripeNonSkippable to be true, since there may be a DEL underneath that which we need to know about now in order to return SETWITHDEL. Note that the current code will leave skip=true and continue skipping after returning the rangedelete. The following comment in the code explains this

 `			// Although, note that `skip` may already be true before reaching here
			// due to an earlier key in the stripe. Then it is fine to leave it set
			// to true, as the earlier key must have had a higher sequence number.

Changing the code structure is going to be quite intrusive since we'll need to save the rangedelete too, iterate until we are done with this user key or stripe, then emit the SET or SETWITHDEL, and then the saved RANGEDEL and then resume.
Given the rarity that we will have a SET coincident with the start key of a RANGEDEL, we could be conservative and set skip=false if we are stopping because of sameStripeNonSkippable. This will cause this to be emitted as SET, then RANGEDEL, and then the other keys in the same stripe. So the compaction would possibly keep an extra key-value pair. Hmm, not sure when this extra key-value pair will eventually get deleted. Maybe a better approach is that if we encounter a sameStripeNonSkippable, be conservative and emit a SETWITHDEL now because that is always safe.

I pushed a new commit that should address the DEL -> SET -> SET' case. I also took the conservative approach that you mentioned, where we emit SETWITHDEL when we encounter a non skippable entry. I dropped in a TODO to optimize this, following a re-think of how to structure the code to make this simpler.


compaction_iter.go, line 465 at r2 (raw file):

			if t == sameStripeNonSkippable {
				// This is a key we cannot skip. We can safely emit a SETWITHDEL
				// here, though it is suboptimal due to ... FIXME

FIXME here is to pad out the explanation, paraphrasing your comment below.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I pushed a new commit that should address the DEL -> SET -> SET' case. I also took the conservative approach that you mentioned, where we emit SETWITHDEL when we encounter a non skippable entry. I dropped in a TODO to optimize this, following a re-think of how to structure the code to make this simpler.

Logic looks correct to me.

@nicktrav nicktrav force-pushed the nickt.set-with-delete branch 2 times, most recently from 6f88c10 to 54f499b Compare September 10, 2021 19:48
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, sumeerbhola wrote…

Logic looks correct to me.

There's something subtle causing the metamorphic tests to fail.

In the diff I see records whose values have extra bytes. For example:

===== DIFF =====
_meta/210910-170740.824/{standard-000,standard-001}
@@ -5140,11 +5140,11 @@
 iter46.Prev("") // [false] <nil> #5140
 db.Delete("tgdi") // <nil> #5141
 snap36 = db.NewSnapshot() #5142
 iter43.Next("") // [false] <nil> #5143
 iter43.Prev("szgtdorsgku") // [invalid] <nil> #5144
-iter48.Prev("") // [true,"zmxbgc","nnipdhfl"] <nil> #5145
+iter48.Prev("") // [true,"zmxbgc","\x00nipdhfl"] <nil> #5145

I've been able to narrow this down to the interaction with the cache, and how it is freeing memory when we run with -tags invariants. When running without invariants, I do not see the issue.

In terms of the new code in this PR, it looks like the issue is somewhere in setNext(). Reverting the switch statement to bypass setNext(), I also see no issue:

		case InternalKeyKindSet, InternalKeyKindSetWithDelete:
			//i.setNext()
			i.saveKey()
			i.value = i.iterValue
			i.valid = true
			i.skip = true
			i.maybeZeroSeqnum(i.curSnapshotIdx)
			return &i.key, i.value

Specifically where in setNext() the problem is I'm not too sure, so could use some help in debugging further.

Running the metamorphic tests at SHA 54f499be (current tip of branch) with seed 1631305910565978000 and -tags invariants should produce the test failure I'm seeing.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

There's something subtle causing the metamorphic tests to fail.

In the diff I see records whose values have extra bytes. For example:

===== DIFF =====
_meta/210910-170740.824/{standard-000,standard-001}
@@ -5140,11 +5140,11 @@
 iter46.Prev("") // [false] <nil> #5140
 db.Delete("tgdi") // <nil> #5141
 snap36 = db.NewSnapshot() #5142
 iter43.Next("") // [false] <nil> #5143
 iter43.Prev("szgtdorsgku") // [invalid] <nil> #5144
-iter48.Prev("") // [true,"zmxbgc","nnipdhfl"] <nil> #5145
+iter48.Prev("") // [true,"zmxbgc","\x00nipdhfl"] <nil> #5145

I've been able to narrow this down to the interaction with the cache, and how it is freeing memory when we run with -tags invariants. When running without invariants, I do not see the issue.

In terms of the new code in this PR, it looks like the issue is somewhere in setNext(). Reverting the switch statement to bypass setNext(), I also see no issue:

		case InternalKeyKindSet, InternalKeyKindSetWithDelete:
			//i.setNext()
			i.saveKey()
			i.value = i.iterValue
			i.valid = true
			i.skip = true
			i.maybeZeroSeqnum(i.curSnapshotIdx)
			return &i.key, i.value

Specifically where in setNext() the problem is I'm not too sure, so could use some help in debugging further.

Running the metamorphic tests at SHA 54f499be (current tip of branch) with seed 1631305910565978000 and -tags invariants should produce the test failure I'm seeing.

I was able to come up with a small reproducer in a unit test: https://gist.github.com/nicktrav/f6fec5ff35c03f61cfabcf9fb0440d06

$ go test -tags invariants -run TestManualCompaction -count 1 ./...
--- FAIL: TestManualCompaction (0.00s)
    compaction_test.go:1151:
        testdata/manual_compaction_use_after_free_repro:39: first
        next
        next
        expected:
        a:baz
        z:bar
        .

        found:
        a:baz
        z:
        .
FAIL

It looks like there's a use-after-free bug when we iterate past the end-key of the final file in the lowest level in the compaction, when the end-key is a SET (as we are now iterating forward to look for a DELETE). When we iterate past the end-key, the sstable.singleLevelIterator frees the memory from the block cache. This is the data for the value that we still have a reference to in compaction_iter.go:L447.

Note that we only see this when the cache size is small, which is what I've observed from the metamorphic tests.

I also had to un-comment out some dead code in cache/value_invariants.go to zero out the memory when it is freed. Without this the test passes, but I believe this is only due to the fact that this memory is probably still live, and we can point to it.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I was able to come up with a small reproducer in a unit test: https://gist.github.com/nicktrav/f6fec5ff35c03f61cfabcf9fb0440d06

$ go test -tags invariants -run TestManualCompaction -count 1 ./...
--- FAIL: TestManualCompaction (0.00s)
    compaction_test.go:1151:
        testdata/manual_compaction_use_after_free_repro:39: first
        next
        next
        expected:
        a:baz
        z:bar
        .

        found:
        a:baz
        z:
        .
FAIL

It looks like there's a use-after-free bug when we iterate past the end-key of the final file in the lowest level in the compaction, when the end-key is a SET (as we are now iterating forward to look for a DELETE). When we iterate past the end-key, the sstable.singleLevelIterator frees the memory from the block cache. This is the data for the value that we still have a reference to in compaction_iter.go:L447.

Note that we only see this when the cache size is small, which is what I've observed from the metamorphic tests.

I also had to un-comment out some dead code in cache/value_invariants.go to zero out the memory when it is freed. Without this the test passes, but I believe this is only due to the fact that this memory is probably still live, and we can point to it.

Nice find!
This looks like a serious bug. We are relying on the stability of the value []byte even after stepping to the next entry, but if the next entry is in a different block we will call cache.Handle.Release on the existing block which can free it. I think we got lucky in that compaction_iter.go was only using iterPosNext (and never using iterPosPrev) for single delete and merges:

  • single delete, the iterValue byte slice is empty, so doesn't matter if we free the underlying memory.
  • merge: the iterValue is not used since we are using the output of the merge.

I don't think any of the other internal iterators are relying on this behavior. There is the use iterPosNext/iterPosPrev in iterator.go that needs to be looked at carefully -- based on a quick read I think it is doing the right thing and saving the value in valueBuf. I think similar saving logic in a valueBuf in compaction_iter.go for this setNext would fix this problem.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @sumeerbhola from a discussion.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, sumeerbhola wrote…

Nice find!
This looks like a serious bug. We are relying on the stability of the value []byte even after stepping to the next entry, but if the next entry is in a different block we will call cache.Handle.Release on the existing block which can free it. I think we got lucky in that compaction_iter.go was only using iterPosNext (and never using iterPosPrev) for single delete and merges:

  • single delete, the iterValue byte slice is empty, so doesn't matter if we free the underlying memory.
  • merge: the iterValue is not used since we are using the output of the merge.

I don't think any of the other internal iterators are relying on this behavior. There is the use iterPosNext/iterPosPrev in iterator.go that needs to be looked at carefully -- based on a quick read I think it is doing the right thing and saving the value in valueBuf. I think similar saving logic in a valueBuf in compaction_iter.go for this setNext would fix this problem.

Nice. I took the same approach that is used in iterator.go. Interestingly, the introduction of that code was to fix the reverse version of this bug (seeking backwards across the start of a block in a lower level). See #572. We have existing test coverage in this case, as each time we encounter a SET, we seek forward at least once, and there are a number of test cases that have SETs as the final key in an sstable.

Marking this as resolved, as all tests are now passing 👍

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Nice. I took the same approach that is used in iterator.go. Interestingly, the introduction of that code was to fix the reverse version of this bug (seeking backwards across the start of a block in a lower level). See #572. We have existing test coverage in this case, as each time we encounter a SET, we seek forward at least once, and there are a number of test cases that have SETs as the final key in an sstable.

Marking this as resolved, as all tests are now passing 👍

If we added another field to Options.private and used that to wrap the InternalIterator normally used by compactionIter in an invalidatingIter, would that cause the bug to trigger easily on the existing datadriven tests? If yes, we should do that. We should generally add non-metamorphic tests for bugs found by the metamorphic test since the former are easier to debug if we later regress.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)


compaction_iter.go, line 465 at r1 (raw file):

Previously, sumeerbhola wrote…

If we added another field to Options.private and used that to wrap the InternalIterator normally used by compactionIter in an invalidatingIter, would that cause the bug to trigger easily on the existing datadriven tests? If yes, we should do that. We should generally add non-metamorphic tests for bugs found by the metamorphic test since the former are easier to debug if we later regress.

Good call. I tested just now on master with an invalidatingIter and there are some test failures. I'll dig into these and put up a separate patch, along with test case(s) to verify.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that this fixes the failing metamorphic test from #1254, with the problematic SET -> ... -> DEL -> SET -> SINGLEDEL sequence in the history. I also ran the meta tests for ~1 hour without failure.

#1254 will most likely land after this change lands, but I just wanted to point out that the lack of test failures increases the confidence that we're a) resolving the underlying with SET / SINGLEDEL, and b) not introducing any additional correctness bugs.

Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola - looks like I spoke too soon. I encountered a metamorphic test that fails, when a SETWITHDEL is emitted twice at the same (zero) sequence number, which fails the level checker with an error along the lines of:

out of order keys cpqjje#0,SETWITHDEL >= cpqjje#0,SETWITHDEL in L6: fileNum=000036

Here's a minimal reproducer:

define
a.SET.3:c
a.RANGEDEL.2:z
a.SET.2:b
a.DEL.1:
----

iter allow-zero-seqnum=true
first
next
next
next
----
a#0,18:c
a#2,15:z
a#0,18:b
.

I think this is that "coincident" case we were talking about, that should be rare. However, it's not clear what the expected behavior should be?

There are also some similar existing data-driven tests that have a RANGEDEL and SET at the same sequence number, that output both the RANGEDEL and SET from the compaction iterator (example).

Reviewable status: 4 of 12 files reviewed, all discussions resolved (waiting on @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5.
Reviewable status: 5 of 12 files reviewed, 2 unresolved discussions (waiting on @nicktrav)


compaction_iter.go, line 460 at r5 (raw file):

	// i.iter-owned value buffer.
	i.valueBuf = append(i.valueBuf[:0], i.iterValue...)
	i.value = i.valueBuf

I think we should do i.value = i.iterValue here, since we may return in the following if-block.
After that if-block we should do this valueBuf stuff since then we know we are going to iterate forward.


compaction_iter.go, line 506 at r5 (raw file):

				// returns a sameStripeNonSkippable when it encounters a
				// RANGEDEL.
				// TODO(travers): optimize to handle the RANGEDEL case.

we should set i.skip=true since we do want points (not rangedel) with the same user key to be skipped after the rangedel is returned.
That should fix the zero seqnum problem.

@nicktrav nicktrav force-pushed the nickt.set-with-delete branch 3 times, most recently from df90a87 to 0a4ab3d Compare September 14, 2021 18:46
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @sumeerbhola from a discussion.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)


compaction_iter.go, line 460 at r5 (raw file):

Previously, sumeerbhola wrote…

I think we should do i.value = i.iterValue here, since we may return in the following if-block.
After that if-block we should do this valueBuf stuff since then we know we are going to iterate forward.

Done.


compaction_iter.go, line 506 at r5 (raw file):

Previously, sumeerbhola wrote…

we should set i.skip=true since we do want points (not rangedel) with the same user key to be skipped after the rangedel is returned.
That should fix the zero seqnum problem.

Done. The code reads slightly confusingly, as we're setting skip within a conditional branch intended for non-skippable keys. I added a comment here.

Also added some data-driven tests to exercise this a little more. r5/r6 have these changes.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- minor suggestions on code comments.

Reminder that the writing of SETWITHDEL needs to be gated by a new format major version.

Reviewed 6 of 6 files at r3, 2 of 2 files at r6, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


compaction_iter.go, line 489 at r6 (raw file):

				// DEL will eventually be elided in a subsequent compaction. The
				// cost for ensuring correctness is that this entry is kept
				// around for an additional compaction cycle.

... is that the DEL is kept around for additional compaction cycle(s).


compaction_iter.go, line 505 at r6 (raw file):

				// returns a sameStripeNonSkippable when it encounters a
				// RANGEDEL.
				// TODO(travers): optimize to handle the RANGEDEL case.

nit: can you add a qualification to the TODO that says:
optimize ... case if it turns out to be a performance problem.

I very much doubt we will ever need to do anything here given that the rangedel start key has to coincide with this point key, so this is going to be rare.


compaction_iter.go, line 512 at r6 (raw file):

				// if we mark the key to be skipped, subsequent loops will still
				// ensure that the keys are retained. However, keys *beyond* the
				// this key are indeed truly skippable.

I think this comment could be simplified
// By setting i.skip=true, we are saying that after the non-skippable key is emitted (which is likely a RANGEDEL), the remaining point keys that share the same user key as this saved key should be skipped.


testdata/compaction_iter, line 1408 at r6 (raw file):

a#3,18:3
b#1,1:c
.

nice test cases


testdata/singledel_manual_compaction, line 2 at r6 (raw file):

# This is not actually a manual compaction test, and simply uses manual
# compaction to demonstrate single delete semantics with used with

when used with ...

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder that the writing of SETWITHDEL needs to be gated by a new format major version.

Ah, thanks. I'll add this to my TODO list.

Dismissed @sumeerbhola from 5 discussions.
Reviewable status: 10 of 12 files reviewed, all discussions resolved (waiting on @sumeerbhola)


compaction_iter.go, line 489 at r6 (raw file):

Previously, sumeerbhola wrote…

... is that the DEL is kept around for additional compaction cycle(s).

Done.


compaction_iter.go, line 505 at r6 (raw file):

Previously, sumeerbhola wrote…

nit: can you add a qualification to the TODO that says:
optimize ... case if it turns out to be a performance problem.

I very much doubt we will ever need to do anything here given that the rangedel start key has to coincide with this point key, so this is going to be rare.

Done.


compaction_iter.go, line 512 at r6 (raw file):

Previously, sumeerbhola wrote…

I think this comment could be simplified
// By setting i.skip=true, we are saying that after the non-skippable key is emitted (which is likely a RANGEDEL), the remaining point keys that share the same user key as this saved key should be skipped.

Done.


testdata/compaction_iter, line 1408 at r6 (raw file):

Previously, sumeerbhola wrote…

nice test cases

👍


testdata/singledel_manual_compaction, line 2 at r6 (raw file):

Previously, sumeerbhola wrote…

when used with ...

Done.

@nicktrav nicktrav changed the title [WIP] internal/base: add SETWITHDEL key kind internal/base: add SETWITHDEL key kind Sep 14, 2021
@nicktrav nicktrav marked this pull request as ready for review September 14, 2021 20:59
@nicktrav nicktrav requested review from petermattis and a team September 14, 2021 21:00
@nicktrav
Copy link
Contributor Author

Opening this up for wider review.

The one bug remaining item is to add the version gate. That shouldn't affect the review of the remainder of the PR.

There are also some failing tests due to the introduction of the new key kind that I need to dig into (likely to regenerate some test fixtures).

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 12 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)


internal/base/internal.go, line 53 at r9 (raw file):

	// searching for any kind of internal key formed by a certain user key and
	// seqNum.
	InternalKeyKindMax InternalKeyKind = 19

@sumeerbhola - does this part look ok to you? I bumped the max key kind to 19. This causes some fixture tests to fail.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)


internal/base/internal_test.go, line 40 at r10 (raw file):

		"foo",
		"foo\x08\x07\x06\x05\x04\x03\x02",
		"foo\x14\x07\x06\x05\x04\x03\x02\x01",

Bumped this from 18 -> 20, so that it is no longer valid.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbowens - r11 contains the logic to switch on the format major version. Would be great if you could give this part of the PR a look.

@sumeerbhola - I also forked some of the data driven tests like we talked about, to ensure we're testing against new and old versions of the DB. I've reverted the testdata files I modified for this PR to the original state, and instead made copies that contain the modifications that are specific to the new DB format version.

The branching / switching logic clutters things a little. If you have another way you were thinking of, let me know, and I can revisit.

Question: what default version of the DB do we want on the Pebble master / release branches? It looks like by default we'll run in the most compatible mode.

Reviewable status: 5 of 24 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r9.
Reviewable status: 6 of 24 files reviewed, 1 unresolved discussion (waiting on @nicktrav, @petermattis, and @sumeerbhola)


internal/base/internal.go, line 53 at r9 (raw file):

Previously, nicktrav (Nick Travers) wrote…

@sumeerbhola - does this part look ok to you? I bumped the max key kind to 19. This causes some fixture tests to fail.

I think this needs to be 18, because the comment here says it is equal to the largest used key kind. Which means the other place I suggested using an array needs to use bool[InternalKeyKindMax+1].

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Only minor comments.

Reviewed 1 of 1 files at r10, 16 of 17 files at r11, all commit messages.
Reviewable status: 23 of 24 files reviewed, 10 unresolved discussions (waiting on @nicktrav and @petermattis)


-- commits, line 10 at r11:
key


-- commits, line 35 at r11:
nit: missing period
(same for bullets below)


compaction_iter.go, line 517 at r11 (raw file):

	// We can only proceed if the format major version is high enough that it
	// support SetWithDelete. If it does not support the format version, we fall

supports


compaction_iter.go, line 522 at r11 (raw file):

		i.skip = true
		return
	}

Can you hoist this before the valueBuf logic.


compaction_iter_test.go, line 86 at r11 (raw file):

	fileFunc := func(formatVersion FormatMajorVersion) string {
		if formatVersion < FormatSetWithDelete {
			return "testdata/compaction_iter"

Maybe reviewable is confused but looks like these existing testdata files like compaction_iter, manual_compaction, single_del_manual_compaction have been removed.


compaction_iter_test.go, line 223 at r11 (raw file):

	for _, formatVersion := range formatVersions {
		t.Run(fmt.Sprintf("version-%s", formatVersion), func(t *testing.T) {
			runTest(formatVersion)

I think we should be passing t to runTest and not using the t that is in the context of that closure.


compaction_test.go, line 1130 at r11 (raw file):

		}
		opts.private.disableAutomaticCompactions = true
		opts.testingRandomized()

how about keeping this randomization of version but changing it to opts.testingRandomized(minVersion, maxVersion) which represents the range [minVersion, maxVersion). Then change the testCases struct to have the min and max versions instead of the specific version. That way we get better coverage.


compaction_test.go, line 1328 at r11 (raw file):

	for _, tc := range testCases {
		t.Run(fmt.Sprintf("version-%s", tc.formatVersion), func(t *testing.T) {
			runTest(tc.testData, tc.formatVersion)

same comment about passing t as a parameter.


format_major_version.go, line 66 at r11 (raw file):

	FormatVersioned
	// FormatSetWithDelete is a format major version that introduces a new key
	// kind, base.InternalKeyKindSetWithDelete. Previous Pebble version will be

how about the following phrasing, since by definition they are not aware of this new key kind.
... versions will be unable to open this database.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 17 files at r11.
Dismissed @sumeerbhola from 8 discussions.
Reviewable status: 15 of 25 files reviewed, 1 unresolved discussion (waiting on @nicktrav, @petermattis, and @sumeerbhola)


-- commits, line 10 at r11:

Previously, sumeerbhola wrote…

key

Done.


compaction_iter.go, line 517 at r11 (raw file):

Previously, sumeerbhola wrote…

supports

Done.


compaction_iter.go, line 522 at r11 (raw file):

Previously, sumeerbhola wrote…

Can you hoist this before the valueBuf logic.

Done. Merged the check with the other return-early check.


compaction_iter_test.go, line 86 at r11 (raw file):

Previously, sumeerbhola wrote…

Maybe reviewable is confused but looks like these existing testdata files like compaction_iter, manual_compaction, single_del_manual_compaction have been removed.

Good catch. That was a silly on my part. Fixed.


compaction_iter_test.go, line 223 at r11 (raw file):

Previously, sumeerbhola wrote…

I think we should be passing t to runTest and not using the t that is in the context of that closure.

Fixed.


compaction_test.go, line 1130 at r11 (raw file):

Previously, sumeerbhola wrote…

how about keeping this randomization of version but changing it to opts.testingRandomized(minVersion, maxVersion) which represents the range [minVersion, maxVersion). Then change the testCases struct to have the min and max versions instead of the specific version. That way we get better coverage.

I updated this to pick a random version in a given range. I opted to do the randomization in the test itself, rather than modifying all existing callers of testingRandomized to accept a version. This seemed reasonable, given this is the only test that has the requirement for a range of values.


compaction_test.go, line 1328 at r11 (raw file):

Previously, sumeerbhola wrote…

same comment about passing t as a parameter.

Done.


format_major_version.go, line 66 at r11 (raw file):

Previously, sumeerbhola wrote…

how about the following phrasing, since by definition they are not aware of this new key kind.
... versions will be unable to open this database.

Done.


internal/base/internal.go, line 53 at r9 (raw file):

Previously, sumeerbhola wrote…

I think this needs to be 18, because the comment here says it is equal to the largest used key kind. Which means the other place I suggested using an array needs to use bool[InternalKeyKindMax+1].

Unfortunately doing that breaks things elsewhere. Specifically the internalKeyKindNames slice definition below. As the key kind is used as an index into the slice, we can't re-use.

One way to fix this is to just remove "MAX" from the slice - given it's not part of the file format, we shouldn't need to look it up. Making this change doesn't seem to break anything elsewhere. I've pushed this change.

There is still the test failure of TestFixtureOutput in writer_fixture_test.go that we need to look into. It looks like the existing InternalKeyKindMax has somehow leaked into the sstables that are generated as fixtures, so now that the max value has been incremented, it's breaking the generated output in the test.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 17 files at r11, 9 of 9 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)

Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set _at most once_. Undefined behavior results outside of
this requirement.

cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a key has
been Set at most once, violating the key assumption for SingleDeletes.

Consider the following sequence (where `=>` implies "happens-before"):

```
a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4
```

If the middle two operations meet in a single compaction, the
intermediate sequence would become:

```
a.SET.1 => a.SET.3 => a.SINGLEDEL.4
```

A second compaction of this intermediate sequence would result in just
`a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This
incorrectly "resurrects" `a.SET.1`.

This undefined behavior was demonstrated in cockroachdb#1252.

A solution, outlined in cockroachdb#1255, works as follows:

- A new key kind, SetWithDelete, represents a Set that has potentially
  consumed a Delete or SingleDelete in a compaction.
- The existing Set key kind now represents a promise that it has not
  consumed a Delete or SingleDelete. This is what will be written when a
  caller uses `pebble.Writer.Set`.
- A `SET => SINGLEDEL` continues to cause both to be consumed.
- A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`,
  reflecting the fact that there may be deleted entries under the
  SetWithDelete that should not be resurrected.

This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the `SET => SINGLEDEL`
semantics.

This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.

This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).

Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
@nicktrav nicktrav merged commit 157aa05 into cockroachdb:master Sep 16, 2021
@nicktrav nicktrav deleted the nickt.set-with-delete branch September 16, 2021 02:09
@nicktrav
Copy link
Contributor Author

TFTR!

Also forgot to note that I did another two hour burn of the meta tests and nothing cropped up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants