-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/base: add SETWITHDEL key kind #1261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r1.
Reviewable status: 5 of 6 files reviewed, 1 unresolved discussion (waiting on @nicktrav)
compaction_iter.go, line 465 at r1 (raw file):
// SETWITHDEL. This guards against a SINGLEDEL above us removing the // original SET and erroneously exposing keys underneath is. if i.iterKey.Kind() == InternalKeyKindDelete {
I think we need to loop until we get a sameStripeNonSkippable
or newStripe
because we could have DEL => SET => SET'. Say we are at SET' at the start of this method, then step to SET as the next record and emit SET. Which would be incorrect since the skip=true
will cause the DEL to also be skipped.
Also, we have a slightly bigger code-structuring problem here in that we don't want to stop if we see a rangedelete which causes sameStripeNonSkippable
to be true, since there may be a DEL underneath that which we need to know about now in order to return SETWITHDEL. Note that the current code will leave skip=true
and continue skipping after returning the rangedelete. The following comment in the code explains this
` // Although, note that `skip` may already be true before reaching here
// due to an earlier key in the stripe. Then it is fine to leave it set
// to true, as the earlier key must have had a higher sequence number.
Changing the code structure is going to be quite intrusive since we'll need to save the rangedelete too, iterate until we are done with this user key or stripe, then emit the SET or SETWITHDEL, and then the saved RANGEDEL and then resume.
Given the rarity that we will have a SET coincident with the start key of a RANGEDEL, we could be conservative and set skip=false if we are stopping because of sameStripeNonSkippable
. This will cause this to be emitted as SET, then RANGEDEL, and then the other keys in the same stripe. So the compaction would possibly keep an extra key-value pair. Hmm, not sure when this extra key-value pair will eventually get deleted. Maybe a better approach is that if we encounter a sameStripeNonSkippable
, be conservative and emit a SETWITHDEL now because that is always safe.
9d26f58
to
cd5ee70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, sumeerbhola wrote…
I think we need to loop until we get a
sameStripeNonSkippable
ornewStripe
because we could have DEL => SET => SET'. Say we are at SET' at the start of this method, then step to SET as the next record and emit SET. Which would be incorrect since theskip=true
will cause the DEL to also be skipped.Also, we have a slightly bigger code-structuring problem here in that we don't want to stop if we see a rangedelete which causes
sameStripeNonSkippable
to be true, since there may be a DEL underneath that which we need to know about now in order to return SETWITHDEL. Note that the current code will leaveskip=true
and continue skipping after returning the rangedelete. The following comment in the code explains this` // Although, note that `skip` may already be true before reaching here // due to an earlier key in the stripe. Then it is fine to leave it set // to true, as the earlier key must have had a higher sequence number.
Changing the code structure is going to be quite intrusive since we'll need to save the rangedelete too, iterate until we are done with this user key or stripe, then emit the SET or SETWITHDEL, and then the saved RANGEDEL and then resume.
Given the rarity that we will have a SET coincident with the start key of a RANGEDEL, we could be conservative and set skip=false if we are stopping because ofsameStripeNonSkippable
. This will cause this to be emitted as SET, then RANGEDEL, and then the other keys in the same stripe. So the compaction would possibly keep an extra key-value pair. Hmm, not sure when this extra key-value pair will eventually get deleted. Maybe a better approach is that if we encounter asameStripeNonSkippable
, be conservative and emit a SETWITHDEL now because that is always safe.
I pushed a new commit that should address the DEL -> SET -> SET'
case. I also took the conservative approach that you mentioned, where we emit SETWITHDEL when we encounter a non skippable entry. I dropped in a TODO to optimize this, following a re-think of how to structure the code to make this simpler.
compaction_iter.go, line 465 at r2 (raw file):
if t == sameStripeNonSkippable { // This is a key we cannot skip. We can safely emit a SETWITHDEL // here, though it is suboptimal due to ... FIXME
FIXME here is to pad out the explanation, paraphrasing your comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 6 files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I pushed a new commit that should address the
DEL -> SET -> SET'
case. I also took the conservative approach that you mentioned, where we emit SETWITHDEL when we encounter a non skippable entry. I dropped in a TODO to optimize this, following a re-think of how to structure the code to make this simpler.
Logic looks correct to me.
6f88c10
to
54f499b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, sumeerbhola wrote…
Logic looks correct to me.
There's something subtle causing the metamorphic tests to fail.
In the diff I see records whose values have extra bytes. For example:
===== DIFF =====
_meta/210910-170740.824/{standard-000,standard-001}
@@ -5140,11 +5140,11 @@
iter46.Prev("") // [false] <nil> #5140
db.Delete("tgdi") // <nil> #5141
snap36 = db.NewSnapshot() #5142
iter43.Next("") // [false] <nil> #5143
iter43.Prev("szgtdorsgku") // [invalid] <nil> #5144
-iter48.Prev("") // [true,"zmxbgc","nnipdhfl"] <nil> #5145
+iter48.Prev("") // [true,"zmxbgc","\x00nipdhfl"] <nil> #5145
I've been able to narrow this down to the interaction with the cache, and how it is freeing memory when we run with -tags invariants
. When running without invariants, I do not see the issue.
In terms of the new code in this PR, it looks like the issue is somewhere in setNext()
. Reverting the switch statement to bypass setNext()
, I also see no issue:
case InternalKeyKindSet, InternalKeyKindSetWithDelete:
//i.setNext()
i.saveKey()
i.value = i.iterValue
i.valid = true
i.skip = true
i.maybeZeroSeqnum(i.curSnapshotIdx)
return &i.key, i.value
Specifically where in setNext()
the problem is I'm not too sure, so could use some help in debugging further.
Running the metamorphic tests at SHA 54f499be
(current tip of branch) with seed 1631305910565978000
and -tags invariants
should produce the test failure I'm seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
There's something subtle causing the metamorphic tests to fail.
In the diff I see records whose values have extra bytes. For example:
===== DIFF ===== _meta/210910-170740.824/{standard-000,standard-001} @@ -5140,11 +5140,11 @@ iter46.Prev("") // [false] <nil> #5140 db.Delete("tgdi") // <nil> #5141 snap36 = db.NewSnapshot() #5142 iter43.Next("") // [false] <nil> #5143 iter43.Prev("szgtdorsgku") // [invalid] <nil> #5144 -iter48.Prev("") // [true,"zmxbgc","nnipdhfl"] <nil> #5145 +iter48.Prev("") // [true,"zmxbgc","\x00nipdhfl"] <nil> #5145I've been able to narrow this down to the interaction with the cache, and how it is freeing memory when we run with
-tags invariants
. When running without invariants, I do not see the issue.In terms of the new code in this PR, it looks like the issue is somewhere in
setNext()
. Reverting the switch statement to bypasssetNext()
, I also see no issue:case InternalKeyKindSet, InternalKeyKindSetWithDelete: //i.setNext() i.saveKey() i.value = i.iterValue i.valid = true i.skip = true i.maybeZeroSeqnum(i.curSnapshotIdx) return &i.key, i.valueSpecifically where in
setNext()
the problem is I'm not too sure, so could use some help in debugging further.Running the metamorphic tests at SHA
54f499be
(current tip of branch) with seed1631305910565978000
and-tags invariants
should produce the test failure I'm seeing.
I was able to come up with a small reproducer in a unit test: https://gist.github.com/nicktrav/f6fec5ff35c03f61cfabcf9fb0440d06
$ go test -tags invariants -run TestManualCompaction -count 1 ./...
--- FAIL: TestManualCompaction (0.00s)
compaction_test.go:1151:
testdata/manual_compaction_use_after_free_repro:39: first
next
next
expected:
a:baz
z:bar
.
found:
a:baz
z:
.
FAIL
It looks like there's a use-after-free bug when we iterate past the end-key of the final file in the lowest level in the compaction, when the end-key is a SET (as we are now iterating forward to look for a DELETE). When we iterate past the end-key, the sstable.singleLevelIterator
frees the memory from the block cache. This is the data for the value that we still have a reference to in compaction_iter.go:L447
.
Note that we only see this when the cache size is small, which is what I've observed from the metamorphic tests.
I also had to un-comment out some dead code in cache/value_invariants.go
to zero out the memory when it is freed. Without this the test passes, but I believe this is only due to the fact that this memory is probably still live, and we can point to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 12 files reviewed, 2 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I was able to come up with a small reproducer in a unit test: https://gist.github.com/nicktrav/f6fec5ff35c03f61cfabcf9fb0440d06
$ go test -tags invariants -run TestManualCompaction -count 1 ./... --- FAIL: TestManualCompaction (0.00s) compaction_test.go:1151: testdata/manual_compaction_use_after_free_repro:39: first next next expected: a:baz z:bar . found: a:baz z: . FAILIt looks like there's a use-after-free bug when we iterate past the end-key of the final file in the lowest level in the compaction, when the end-key is a SET (as we are now iterating forward to look for a DELETE). When we iterate past the end-key, the
sstable.singleLevelIterator
frees the memory from the block cache. This is the data for the value that we still have a reference to incompaction_iter.go:L447
.Note that we only see this when the cache size is small, which is what I've observed from the metamorphic tests.
I also had to un-comment out some dead code in
cache/value_invariants.go
to zero out the memory when it is freed. Without this the test passes, but I believe this is only due to the fact that this memory is probably still live, and we can point to it.
Nice find!
This looks like a serious bug. We are relying on the stability of the value []byte
even after stepping to the next entry, but if the next entry is in a different block we will call cache.Handle.Release
on the existing block which can free it. I think we got lucky in that compaction_iter.go
was only using iterPosNext
(and never using iterPosPrev
) for single delete and merges:
- single delete, the
iterValue
byte slice is empty, so doesn't matter if we free the underlying memory. - merge: the
iterValue
is not used since we are using the output of the merge.
I don't think any of the other internal iterators are relying on this behavior. There is the use iterPosNext/iterPosPrev
in iterator.go
that needs to be looked at carefully -- based on a quick read I think it is doing the right thing and saving the value in valueBuf
. I think similar saving logic in a valueBuf
in compaction_iter.go
for this setNext
would fix this problem.
54f499b
to
dc04f9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @sumeerbhola from a discussion.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, sumeerbhola wrote…
Nice find!
This looks like a serious bug. We are relying on the stability of the value[]byte
even after stepping to the next entry, but if the next entry is in a different block we will callcache.Handle.Release
on the existing block which can free it. I think we got lucky in thatcompaction_iter.go
was only usingiterPosNext
(and never usingiterPosPrev
) for single delete and merges:
- single delete, the
iterValue
byte slice is empty, so doesn't matter if we free the underlying memory.- merge: the
iterValue
is not used since we are using the output of the merge.I don't think any of the other internal iterators are relying on this behavior. There is the use
iterPosNext/iterPosPrev
initerator.go
that needs to be looked at carefully -- based on a quick read I think it is doing the right thing and saving the value invalueBuf
. I think similar saving logic in avalueBuf
incompaction_iter.go
for thissetNext
would fix this problem.
Nice. I took the same approach that is used in iterator.go
. Interestingly, the introduction of that code was to fix the reverse version of this bug (seeking backwards across the start of a block in a lower level). See #572. We have existing test coverage in this case, as each time we encounter a SET, we seek forward at least once, and there are a number of test cases that have SETs as the final key in an sstable.
Marking this as resolved, as all tests are now passing 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Nice. I took the same approach that is used in
iterator.go
. Interestingly, the introduction of that code was to fix the reverse version of this bug (seeking backwards across the start of a block in a lower level). See #572. We have existing test coverage in this case, as each time we encounter a SET, we seek forward at least once, and there are a number of test cases that have SETs as the final key in an sstable.Marking this as resolved, as all tests are now passing 👍
If we added another field to Options.private
and used that to wrap the InternalIterator
normally used by compactionIter
in an invalidatingIter
, would that cause the bug to trigger easily on the existing datadriven tests? If yes, we should do that. We should generally add non-metamorphic tests for bugs found by the metamorphic test since the former are easier to debug if we later regress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
compaction_iter.go, line 465 at r1 (raw file):
Previously, sumeerbhola wrote…
If we added another field to
Options.private
and used that to wrap theInternalIterator
normally used bycompactionIter
in aninvalidatingIter
, would that cause the bug to trigger easily on the existing datadriven tests? If yes, we should do that. We should generally add non-metamorphic tests for bugs found by the metamorphic test since the former are easier to debug if we later regress.
Good call. I tested just now on master with an invalidatingIter
and there are some test failures. I'll dig into these and put up a separate patch, along with test case(s) to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have confirmed that this fixes the failing metamorphic test from #1254, with the problematic SET -> ... -> DEL -> SET -> SINGLEDEL
sequence in the history. I also ran the meta tests for ~1 hour without failure.
#1254 will most likely land after this change lands, but I just wanted to point out that the lack of test failures increases the confidence that we're a) resolving the underlying with SET / SINGLEDEL, and b) not introducing any additional correctness bugs.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
dc04f9d
to
c1cda4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola - looks like I spoke too soon. I encountered a metamorphic test that fails, when a SETWITHDEL is emitted twice at the same (zero) sequence number, which fails the level checker with an error along the lines of:
out of order keys cpqjje#0,SETWITHDEL >= cpqjje#0,SETWITHDEL in L6: fileNum=000036
Here's a minimal reproducer:
define
a.SET.3:c
a.RANGEDEL.2:z
a.SET.2:b
a.DEL.1:
----
iter allow-zero-seqnum=true
first
next
next
next
----
a#0,18:c
a#2,15:z
a#0,18:b
.
I think this is that "coincident" case we were talking about, that should be rare. However, it's not clear what the expected behavior should be?
There are also some similar existing data-driven tests that have a RANGEDEL and SET at the same sequence number, that output both the RANGEDEL and SET from the compaction iterator (example).
Reviewable status: 4 of 12 files reviewed, all discussions resolved (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5.
Reviewable status: 5 of 12 files reviewed, 2 unresolved discussions (waiting on @nicktrav)
compaction_iter.go, line 460 at r5 (raw file):
// i.iter-owned value buffer. i.valueBuf = append(i.valueBuf[:0], i.iterValue...) i.value = i.valueBuf
I think we should do i.value = i.iterValue
here, since we may return in the following if-block.
After that if-block we should do this valueBuf
stuff since then we know we are going to iterate forward.
compaction_iter.go, line 506 at r5 (raw file):
// returns a sameStripeNonSkippable when it encounters a // RANGEDEL. // TODO(travers): optimize to handle the RANGEDEL case.
we should set i.skip=true
since we do want points (not rangedel) with the same user key to be skipped after the rangedel is returned.
That should fix the zero seqnum problem.
df90a87
to
0a4ab3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @sumeerbhola from a discussion.
Reviewable status: 4 of 12 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
compaction_iter.go, line 460 at r5 (raw file):
Previously, sumeerbhola wrote…
I think we should do
i.value = i.iterValue
here, since we may return in the following if-block.
After that if-block we should do thisvalueBuf
stuff since then we know we are going to iterate forward.
Done.
compaction_iter.go, line 506 at r5 (raw file):
Previously, sumeerbhola wrote…
we should set
i.skip=true
since we do want points (not rangedel) with the same user key to be skipped after the rangedel is returned.
That should fix the zero seqnum problem.
Done. The code reads slightly confusingly, as we're setting skip within a conditional branch intended for non-skippable keys. I added a comment here.
Also added some data-driven tests to exercise this a little more. r5/r6 have these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- minor suggestions on code comments.
Reminder that the writing of SETWITHDEL needs to be gated by a new format major version.
Reviewed 6 of 6 files at r3, 2 of 2 files at r6, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
compaction_iter.go, line 489 at r6 (raw file):
// DEL will eventually be elided in a subsequent compaction. The // cost for ensuring correctness is that this entry is kept // around for an additional compaction cycle.
... is that the DEL is kept around for additional compaction cycle(s).
compaction_iter.go, line 505 at r6 (raw file):
// returns a sameStripeNonSkippable when it encounters a // RANGEDEL. // TODO(travers): optimize to handle the RANGEDEL case.
nit: can you add a qualification to the TODO that says:
optimize ... case if it turns out to be a performance problem.
I very much doubt we will ever need to do anything here given that the rangedel start key has to coincide with this point key, so this is going to be rare.
compaction_iter.go, line 512 at r6 (raw file):
// if we mark the key to be skipped, subsequent loops will still // ensure that the keys are retained. However, keys *beyond* the // this key are indeed truly skippable.
I think this comment could be simplified
// By setting i.skip=true, we are saying that after the non-skippable key is emitted (which is likely a RANGEDEL), the remaining point keys that share the same user key as this saved key should be skipped.
testdata/compaction_iter, line 1408 at r6 (raw file):
a#3,18:3 b#1,1:c .
nice test cases
testdata/singledel_manual_compaction, line 2 at r6 (raw file):
# This is not actually a manual compaction test, and simply uses manual # compaction to demonstrate single delete semantics with used with
when used with ...
0a4ab3d
to
1a15100
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder that the writing of SETWITHDEL needs to be gated by a new format major version.
Ah, thanks. I'll add this to my TODO list.
Dismissed @sumeerbhola from 5 discussions.
Reviewable status: 10 of 12 files reviewed, all discussions resolved (waiting on @sumeerbhola)
compaction_iter.go, line 489 at r6 (raw file):
Previously, sumeerbhola wrote…
... is that the DEL is kept around for additional compaction cycle(s).
Done.
compaction_iter.go, line 505 at r6 (raw file):
Previously, sumeerbhola wrote…
nit: can you add a qualification to the TODO that says:
optimize ... case if it turns out to be a performance problem.I very much doubt we will ever need to do anything here given that the rangedel start key has to coincide with this point key, so this is going to be rare.
Done.
compaction_iter.go, line 512 at r6 (raw file):
Previously, sumeerbhola wrote…
I think this comment could be simplified
// By setting i.skip=true, we are saying that after the non-skippable key is emitted (which is likely a RANGEDEL), the remaining point keys that share the same user key as this saved key should be skipped.
Done.
testdata/compaction_iter, line 1408 at r6 (raw file):
Previously, sumeerbhola wrote…
nice test cases
👍
testdata/singledel_manual_compaction, line 2 at r6 (raw file):
Previously, sumeerbhola wrote…
when used with ...
Done.
1a15100
to
e90cfca
Compare
Opening this up for wider review. The one bug remaining item is to add the version gate. That shouldn't affect the review of the remainder of the PR. There are also some failing tests due to the introduction of the new key kind that I need to dig into (likely to regenerate some test fixtures). |
e90cfca
to
a754ba7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 9 of 12 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)
internal/base/internal.go, line 53 at r9 (raw file):
// searching for any kind of internal key formed by a certain user key and // seqNum. InternalKeyKindMax InternalKeyKind = 19
@sumeerbhola - does this part look ok to you? I bumped the max key kind to 19
. This causes some fixture tests to fail.
a754ba7
to
2fb8de6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 9 of 13 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)
internal/base/internal_test.go, line 40 at r10 (raw file):
"foo", "foo\x08\x07\x06\x05\x04\x03\x02", "foo\x14\x07\x06\x05\x04\x03\x02\x01",
Bumped this from 18 -> 20, so that it is no longer valid.
2fb8de6
to
d8ee6a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbowens - r11 contains the logic to switch on the format major version. Would be great if you could give this part of the PR a look.
@sumeerbhola - I also forked some of the data driven tests like we talked about, to ensure we're testing against new and old versions of the DB. I've reverted the testdata files I modified for this PR to the original state, and instead made copies that contain the modifications that are specific to the new DB format version.
The branching / switching logic clutters things a little. If you have another way you were thinking of, let me know, and I can revisit.
Question: what default version of the DB do we want on the Pebble master / release branches? It looks like by default we'll run in the most compatible mode.
Reviewable status: 5 of 24 files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r9.
Reviewable status: 6 of 24 files reviewed, 1 unresolved discussion (waiting on @nicktrav, @petermattis, and @sumeerbhola)
internal/base/internal.go, line 53 at r9 (raw file):
Previously, nicktrav (Nick Travers) wrote…
@sumeerbhola - does this part look ok to you? I bumped the max key kind to
19
. This causes some fixture tests to fail.
I think this needs to be 18, because the comment here says it is equal to the largest used key kind. Which means the other place I suggested using an array needs to use bool[InternalKeyKindMax+1]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Only minor comments.
Reviewed 1 of 1 files at r10, 16 of 17 files at r11, all commit messages.
Reviewable status: 23 of 24 files reviewed, 10 unresolved discussions (waiting on @nicktrav and @petermattis)
-- commits, line 10 at r11:
key
-- commits, line 35 at r11:
nit: missing period
(same for bullets below)
compaction_iter.go, line 517 at r11 (raw file):
// We can only proceed if the format major version is high enough that it // support SetWithDelete. If it does not support the format version, we fall
supports
compaction_iter.go, line 522 at r11 (raw file):
i.skip = true return }
Can you hoist this before the valueBuf
logic.
compaction_iter_test.go, line 86 at r11 (raw file):
fileFunc := func(formatVersion FormatMajorVersion) string { if formatVersion < FormatSetWithDelete { return "testdata/compaction_iter"
Maybe reviewable is confused but looks like these existing testdata files like compaction_iter, manual_compaction, single_del_manual_compaction have been removed.
compaction_iter_test.go, line 223 at r11 (raw file):
for _, formatVersion := range formatVersions { t.Run(fmt.Sprintf("version-%s", formatVersion), func(t *testing.T) { runTest(formatVersion)
I think we should be passing t
to runTest
and not using the t
that is in the context of that closure.
compaction_test.go, line 1130 at r11 (raw file):
} opts.private.disableAutomaticCompactions = true opts.testingRandomized()
how about keeping this randomization of version but changing it to opts.testingRandomized(minVersion, maxVersion) which represents the range [minVersion, maxVersion). Then change the testCases struct to have the min and max versions instead of the specific version. That way we get better coverage.
compaction_test.go, line 1328 at r11 (raw file):
for _, tc := range testCases { t.Run(fmt.Sprintf("version-%s", tc.formatVersion), func(t *testing.T) { runTest(tc.testData, tc.formatVersion)
same comment about passing t
as a parameter.
format_major_version.go, line 66 at r11 (raw file):
FormatVersioned // FormatSetWithDelete is a format major version that introduces a new key // kind, base.InternalKeyKindSetWithDelete. Previous Pebble version will be
how about the following phrasing, since by definition they are not aware of this new key kind.
... versions will be unable to open this database.
d8ee6a9
to
e1e7255
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 17 files at r11.
Dismissed @sumeerbhola from 8 discussions.
Reviewable status: 15 of 25 files reviewed, 1 unresolved discussion (waiting on @nicktrav, @petermattis, and @sumeerbhola)
Previously, sumeerbhola wrote…
key
Done.
compaction_iter.go, line 517 at r11 (raw file):
Previously, sumeerbhola wrote…
supports
Done.
compaction_iter.go, line 522 at r11 (raw file):
Previously, sumeerbhola wrote…
Can you hoist this before the
valueBuf
logic.
Done. Merged the check with the other return-early check.
compaction_iter_test.go, line 86 at r11 (raw file):
Previously, sumeerbhola wrote…
Maybe reviewable is confused but looks like these existing testdata files like compaction_iter, manual_compaction, single_del_manual_compaction have been removed.
Good catch. That was a silly on my part. Fixed.
compaction_iter_test.go, line 223 at r11 (raw file):
Previously, sumeerbhola wrote…
I think we should be passing
t
torunTest
and not using thet
that is in the context of that closure.
Fixed.
compaction_test.go, line 1130 at r11 (raw file):
Previously, sumeerbhola wrote…
how about keeping this randomization of version but changing it to opts.testingRandomized(minVersion, maxVersion) which represents the range [minVersion, maxVersion). Then change the testCases struct to have the min and max versions instead of the specific version. That way we get better coverage.
I updated this to pick a random version in a given range. I opted to do the randomization in the test itself, rather than modifying all existing callers of testingRandomized
to accept a version. This seemed reasonable, given this is the only test that has the requirement for a range of values.
compaction_test.go, line 1328 at r11 (raw file):
Previously, sumeerbhola wrote…
same comment about passing
t
as a parameter.
Done.
format_major_version.go, line 66 at r11 (raw file):
Previously, sumeerbhola wrote…
how about the following phrasing, since by definition they are not aware of this new key kind.
... versions will be unable to open this database.
Done.
internal/base/internal.go, line 53 at r9 (raw file):
Previously, sumeerbhola wrote…
I think this needs to be 18, because the comment here says it is equal to the largest used key kind. Which means the other place I suggested using an array needs to use
bool[InternalKeyKindMax+1]
.
Unfortunately doing that breaks things elsewhere. Specifically the internalKeyKindNames
slice definition below. As the key kind is used as an index into the slice, we can't re-use.
One way to fix this is to just remove "MAX" from the slice - given it's not part of the file format, we shouldn't need to look it up. Making this change doesn't seem to break anything elsewhere. I've pushed this change.
There is still the test failure of TestFixtureOutput
in writer_fixture_test.go
that we need to look into. It looks like the existing InternalKeyKindMax
has somehow leaked into the sstables that are generated as fixtures, so now that the max value has been incremented, it's breaking the generated output in the test.
e1e7255
to
a31338b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 17 files at r11, 9 of 9 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @petermattis and @sumeerbhola)
Currently, SingleDelete is only guaranteed to behave correctly if the key has been Set _at most once_. Undefined behavior results outside of this requirement. cockroachdb/cockroach#69891 identified a sequence of operations that can result in non-deterministic, undefined behavior, due to the way in which early intents are removed and cause the DB to "forget" whether a key has been Set at most once, violating the key assumption for SingleDeletes. Consider the following sequence (where `=>` implies "happens-before"): ``` a.SET.1 => a.(DEL|SINGLEDEL).2 => a.SET.3 => a.SINGLEDEL.4 ``` If the middle two operations meet in a single compaction, the intermediate sequence would become: ``` a.SET.1 => a.SET.3 => a.SINGLEDEL.4 ``` A second compaction of this intermediate sequence would result in just `a.SET.1`, as `a.SET.3` and `a.SINGLEDEL.4` will be elided. This incorrectly "resurrects" `a.SET.1`. This undefined behavior was demonstrated in cockroachdb#1252. A solution, outlined in cockroachdb#1255, works as follows: - A new key kind, SetWithDelete, represents a Set that has potentially consumed a Delete or SingleDelete in a compaction. - The existing Set key kind now represents a promise that it has not consumed a Delete or SingleDelete. This is what will be written when a caller uses `pebble.Writer.Set`. - A `SET => SINGLEDEL` continues to cause both to be consumed. - A `SETWITHDEL => SINGLEDEL` is "upgraded" into a regular `DEL`, reflecting the fact that there may be deleted entries under the SetWithDelete that should not be resurrected. This patch introduces the new key kind and implements the logic described above, required for writing out these new keys during a compaction to improve the determinism of the `SET => SINGLEDEL` semantics. This new record type is gated behind a format major version, to ensure that older DBs continue to use the existing semantics, while newer DBs will make use of the new semantics during compactions. This change is one-way (i.e. once migrated, an older version of the DB cannot read sstables written by a newer version). Addresses cockroachdb#1255, cockroachdb/cockroach#69891.
a31338b
to
d27f1d7
Compare
TFTR! Also forgot to note that I did another two hour burn of the meta tests and nothing cropped up. |
Currently, SingleDelete is only guaranteed to behave correctly if the
key has been Set at most once. Undefined behavior results outside of
this requirement.
cockroachdb/cockroach#69891 identified a sequence of operations that can
result in non-deterministic, undefined behavior, due to the way in which
early intents are removed and cause the DB to "forget" whether a ket has
been Set at most once, violating the key assumption for SingleDeletes.
Consider the following sequence (where
=>
implies "happens-before"):If the middle two operations meet in a single compaction, the
intermediate sequence would become:
A second compaction of this intermediate sequence would result in just
a.SET.1
, asa.SET.3
anda.SINGLEDEL.4
will be elided. Thisincorrectly "resurrects"
a.SET.1
.This undefined behavior was demonstrated in #1252.
A solution, outlined in #1255, works as follows:
consumed a Delete or SingleDelete in a compaction
consumed a Delete or SingleDelete. This is what will be written when a
caller uses
pebble.Writer.Set
SET => SINGLEDEL
continues to cause both to be consumedSETWITHDEL => SINGLEDEL
is "upgraded" into a regularDEL
,reflecting the fact that there may be deleted entries under the
SetWithDelete that should not be resurrected.
This patch introduces the new key kind and implements the logic
described above, required for writing out these new keys during a
compaction to improve the determinism of the
SET => SINGLEDEL
semantics.
This new record type is gated behind a format major version, to ensure
that older DBs continue to use the existing semantics, while newer DBs
will make use of the new semantics during compactions.
This change is one-way (i.e. once migrated, an older version of the DB
cannot read sstables written by a newer version).
Addresses #1255, cockroachdb/cockroach#69891.
TODOs:
Enhance meta tests to exercise SETWITHDEL(deferred to WIP - meta: add sequence generator; add SET -> DEL -> SINGLEDEL sequence #1254)