Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use SetOptions() when reusing Pebble iterators #79291

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Apr 3, 2022

Note to reviewers: this is for the mvcc-range-tombstones feature branch.


storage: remove MVCCIterator.SetUpperBound and tweak ClearIterRange

This patch removes the MVCCIterator.SetUpperBound() method, which was
only used by MVCCIterator.ClearIterRange().

ClearIterRange now also creates the iterator itself, instead of having
it passed in. This ensures the iterator is created with appropriate
bounds and options, which becomes even more important with the
introduction of range keys. If further flexibility is needed, we can
consider passing an IterOptions struct instead of a key span.

Release note: None

storage: handle multiple iterators in pebbleReadOnly and pebbleBatch

Previously, pebbleReadOnly and pebbleBatch only supported a single
concurrent iterator of a given type. Attempts to create an additional
iterator would panic with "iterator already in use".

This is particularly problematic because a subsequent change will make
more iterators reusable (specifically TBI iterators), such that e.g.
MVCCIncrementalIterator will be creating two concurrent reusable
iterators and trigger this panic.

This patch supports multiple iterators by creating new cloned iterators
if the existing iterator is in use. This preserves the pinned engine
state for the cloned iterators too.

Release note: None

storage: use SetOptions() when reusing Pebble iterators

Iterator reuse now relies on Pebble.SetOptions() to configure the
reused Pebble iterator. This allows a wider range of iterators to be
reused, since previously only the bounds could be changed on existing
iterators.

Release note: None

Jira issue: CRDB-14701

@erikgrinaker erikgrinaker self-assigned this Apr 3, 2022
@erikgrinaker erikgrinaker requested review from a team as code owners April 3, 2022 15:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title storage: use SetOptions() when reusing Pebble storage: use SetOptions() when reusing Pebble iterators Apr 3, 2022
@erikgrinaker
Copy link
Contributor Author

@jbowens Note the removed pebbleIterator.setBounds() optimization that avoided calling Iterator.SetBounds() if the bounds did not change. I think we'll need to move this kind of optimization down into SetOptions() -- some options changes may not need the iterator to be repositioned or invalidated, but only Pebble can control that since we're not able to change individual options.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch 2 times, most recently from de70b32 to fb4c0b0 Compare April 4, 2022 09:13
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 34e6ab0 to 4798ae2 Compare April 4, 2022 13:43
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch 3 times, most recently from a797bea to 0289491 Compare April 4, 2022 19:48
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 4, 2022

@jbowens I'm getting a test failure here (TestEngineTimeBound) that I think is caused by SetOptions() not replacing TableFilter. I'm essentially creating an iterator with a TableFilter set, and then cloning this iterator and calling SetOptions() on it with a new TableFilter. But it seems like the old TableFilter closure gets called on the cloned iterator. Does that check out?

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass and :lgtm:. Would be good for either Jackson or Sumeer to give it a thumbs-up too.

Reviewed 16 of 16 files at r1, 3 of 3 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @sumeerbhola)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbowens Note the removed pebbleIterator.setBounds() optimization that avoided calling Iterator.SetBounds() if the bounds did not change. I think we'll need to move this kind of optimization down into SetOptions() -- some options changes may not need the iterator to be repositioned or invalidated, but only Pebble can control that since we're not able to change individual options.

All changes require the Iterator to be repositioned (including changes to masking, key types, bounds). I don't think it's possible to move this optimization down into Pebble. Check out this comment: https://github.com/cockroachdb/pebble/blob/c8e7d4149ada9f4a06d8ae44b04b25618019bc4e/iterator.go#L1672-L1681

@jbowens I'm getting a test failure here (TestEngineTimeBound) that I think is caused by SetOptions() not replacing TableFilter. I'm essentially creating an iterator with a TableFilter set, and then cloning this iterator and calling SetOptions() on it with a new TableFilter. But it seems like the old TableFilter closure gets called on the cloned iterator. Does that check out?

Sorry, this is my mistake. I overlooked this. Filed (cockroachdb/pebble#1621) I'm not sure what we can do for the TableFilter since it's just a closure, and we cannot determine whether or not the closure changed. We can (and should) detect whether the block property filters changed. Now that we have block property filters in 22.1, do we need to use TableFilter? Block property filters are strictly more powerful and also allow filtering out whole tables. Maybe we can't because sstables were created in 21.2 that still exist in 22.2 would not be subject to filtering, since they wouldn't have block property annotations.

Reviewed 1 of 16 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911 and @sumeerbhola)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes require the Iterator to be repositioned (including changes to masking, key types, bounds). I don't think it's possible to move this optimization down into Pebble. Check out this comment: https://github.com/cockroachdb/pebble/blob/c8e7d4149ada9f4a06d8ae44b04b25618019bc4e/iterator.go#L1672-L1681

I see, thanks. I'll run some preliminary benchmarks to see what effect this has, and what can be done about it.

Sorry, this is my mistake. I overlooked this. Filed (cockroachdb/pebble#1621) I'm not sure what we can do for the TableFilter since it's just a closure, and we cannot determine whether or not the closure changed.

Why do we need to compare? Can't we unconditionally replace the existing table filter? If the concern is tearing down the iterator stacks then I suppose a nil check would suffice? In the majority of cases this will be nil.

Now that we have block property filters in 22.1, do we need to use TableFilter? Block property filters are strictly more powerful and also allow filtering out whole tables. Maybe we can't because sstables were created in 21.2 that still exist in 22.2 would not be subject to filtering, since they wouldn't have block property annotations.

I suppose the only option here would be a migration that rewrites the properties. I guess it's too late in the game to get that into 22.1? Seems unfortunate to have to maintain different code paths in 22.2 for this.

Worst case, we can just prevent reuse of time-bound iterators -- that's what we do today anyway.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911 and @sumeerbhola)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Love the code deletion. I'm not sure how important the set bounds optimization is, but we should make sure to circle back on that (thanks for recording the TODO).

If the concern is tearing down the iterator stacks then I suppose a nil check would suffice? In the majority of cases this will be nil.

Sounds good. SetOptions will now [just merged in Pebble] replace the iterator stack if filters are non-nil before or after the SetOptions call.

I suppose the only option here would be a migration that rewrites the properties. I guess it's too late in the game to get that into 22.1? Seems unfortunate to have to maintain different code paths in 22.2 for this.

Yeah, a migration would also be intense. Block property filters are new in 22.1, so this migration would effectively need to rewrite every sstable in the store. I don't think we'd want to block 22.1 cluster version finalization on that, in which case we'd need to support the different code path in 22.2 anyways. We could do the blocking migration in 22.2 with the expectation that most of the sstables within a store should've been written by 22.1. A quick 21.2->22.1->22.2 upgrade sequence would still need to rewrite almost the entirety of its stores though.

Reviewed 3 of 16 files at r1, 1 of 3 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aliher1911 and @sumeerbhola)

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones branch from 4798ae2 to 599fe1b Compare April 7, 2022 09:19
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch from 0289491 to fc07f22 Compare April 7, 2022 09:57
This patch removes the `MVCCIterator.SetUpperBound()` method, which was
only used by `MVCCIterator.ClearIterRange()`.

`ClearIterRange` now also creates the iterator itself, instead of having
it passed in. This ensures the iterator is created with appropriate
bounds and options, which becomes even more important with the
introduction of range keys. If further flexibility is needed, we can
consider passing an `IterOptions` struct instead of a key span.

Release note: None
Previously, `pebbleReadOnly` and `pebbleBatch` only supported a single
concurrent iterator of a given type. Attempts to create an additional
iterator would panic with "iterator already in use".

This is particularly problematic because a subsequent change will make
more iterators reusable (specifically TBI iterators), such that e.g.
`MVCCIncrementalIterator` will be creating two concurrent reusable
iterators and trigger this panic.

This patch supports multiple iterators by creating new cloned iterators
if the existing iterator is in use. This preserves the pinned engine
state for the cloned iterators too.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch from fc07f22 to 827766a Compare April 7, 2022 11:16
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concern is tearing down the iterator stacks then I suppose a nil check would suffice? In the majority of cases this will be nil.

Sounds good. SetOptions will now [just merged in Pebble] replace the iterator stack if filters are non-nil before or after the SetOptions call.

Awesome, thanks for the quick turnaround! Tests pass now, but I'm going to run a few benchmarks before merging.

Yeah, a migration would also be intense. Block property filters are new in 22.1, so this migration would effectively need to rewrite every sstable in the store. I don't think we'd want to block 22.1 cluster version finalization on that, in which case we'd need to support the different code path in 22.2 anyways. We could do the blocking migration in 22.2 with the expectation that most of the sstables within a store should've been written by 22.1. A quick 21.2->22.1->22.2 upgrade sequence would still need to rewrite almost the entirety of its stores though.

Good point, probably better to wait for 22.2 (or later). This will work for now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aliher1911, @jbowens, @nicktrav, and @sumeerbhola)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to run a few benchmarks before merging.

We're taking an ~8% hit on MVCC gets, using a batch which clones the batch iterator on every get. This is almost entirely due to the removed setOptions() optimization. It's probably more than we can get away with merging to master, so I'll see if I can claw it back somehow.

name                                        old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-24      2.60µs ± 1%    2.83µs ± 1%  +8.91%  (p=0.000 n=9+10)
MVCCGet_Pebble/versions=10/valueSize=8-24     3.90µs ± 0%    4.23µs ± 1%  +8.46%  (p=0.000 n=7+9)
MVCCGet_Pebble/versions=100/valueSize=8-24    10.2µs ± 4%    10.5µs ± 2%  +3.33%  (p=0.003 n=10+10)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aliher1911, @jbowens, @nicktrav, and @sumeerbhola)

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch from 827766a to f6068b7 Compare April 7, 2022 19:53
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended the setBounds() check to cover all options, which didn't turn out too bad. Got the performance diff down to about 2%:

name                                        old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-24      2.59µs ± 1%    2.63µs ± 1%  +1.52%  (p=0.000 n=18+20)
MVCCGet_Pebble/versions=10/valueSize=8-24     3.91µs ± 1%    4.00µs ± 1%  +2.42%  (p=0.000 n=18+19)
MVCCGet_Pebble/versions=100/valueSize=8-24    9.77µs ± 5%    9.95µs ± 5%  +1.78%  (p=0.012 n=19+19)

I think that's acceptable -- this is in the tens of nanoseconds, so I don't think it will show up on end-to-end benchmarks. We'll have to do another optimization pass once MVCC range tombstones are fully implemented anyway.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aliher1911, @jbowens, @nicktrav, and @sumeerbhola)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

:LGTM:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, @nicktrav, and @sumeerbhola)

Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest set of changes - :lgtm:

Reviewed 3 of 8 files at r4, 1 of 3 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @sumeerbhola)


pkg/storage/pebble_iterator.go, line 149 at r6 (raw file):

	// We don't generate a pebble.Options for comparison, because we want to reuse
	// the byte slices of the existing p.options for any key encoding.
	optsChanged := p.options.OnlyReadGuaranteedDurable != (durability == GuaranteedDurability) ||

Is there a risk of introducing "brittleness" here in that we add some new options and they aren't taken into account here? Given this is just a performance optimization rather than a correctness issue, it's probably fine, just wanted to mention it.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch from f6068b7 to b6ac05e Compare April 7, 2022 21:18
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed the performance gap by rearranging the predicates.

name                                        old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-24      2.59µs ± 1%    2.60µs ± 1%  +0.42%  (p=0.013 n=18+19)
MVCCGet_Pebble/versions=10/valueSize=8-24     3.91µs ± 1%    3.91µs ± 1%    ~     (p=0.597 n=18+20)
MVCCGet_Pebble/versions=100/valueSize=8-24    9.77µs ± 5%    9.41µs ± 4%  -3.77%  (p=0.000 n=19+20)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aliher1911, @nicktrav, and @sumeerbhola)


pkg/storage/pebble_iterator.go, line 149 at r6 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Is there a risk of introducing "brittleness" here in that we add some new options and they aren't taken into account here? Given this is just a performance optimization rather than a correctness issue, it's probably fine, just wanted to mention it.

Yes, that's a very real risk here. And it's true that this is "just" a performance optimization, but forgetting to handle a new option here can cause that option to not take effect if we take the fast path.

However, every option that's passed down to a Pebble iterator must have logic in this function to translate from a CRDB IterOptions to a Pebble IterOptions, so the hope is that someone adding logic for an option here will see this condition and update it. That's probably sufficient?

Iterator reuse now relies on `Pebble.SetOptions()` to configure the
reused Pebble iterator. This allows a wider range of iterators to be
reused, since previously only the bounds could be changed on existing
iterators.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-setoptions branch from b6ac05e to b5e4c15 Compare April 7, 2022 21:42
Copy link
Collaborator

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @aliher1911, @erikgrinaker, and @sumeerbhola)


pkg/storage/pebble_iterator.go, line 149 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yes, that's a very real risk here. And it's true that this is "just" a performance optimization, but forgetting to handle a new option here can cause that option to not take effect if we take the fast path.

However, every option that's passed down to a Pebble iterator must have logic in this function to translate from a CRDB IterOptions to a Pebble IterOptions, so the hope is that someone adding logic for an option here will see this condition and update it. That's probably sufficient?

👍

@erikgrinaker erikgrinaker merged commit 854b26f into cockroachdb:mvcc-range-tombstones Apr 7, 2022
@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-setoptions branch April 10, 2022 10:43
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


-- commits, line 25 at r5:
Can you elaborate on this statement about MVCCIncrementalIterator? Won't we continue to have one tbi that sees both point and range keys and another regular iter that also sees both kinds of keys?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


-- commits, line 25 at r5:

Previously, sumeerbhola wrote…

Can you elaborate on this statement about MVCCIncrementalIterator? Won't we continue to have one tbi that sees both point and range keys and another regular iter that also sees both kinds of keys?

Yes, that's right. There will be no changes to MVCCIncrementalIterator in that regard.

This comment refers to the subsequent SetOptions() commit in this PR. Previously, TBIs were not reusable because we had no way to change the TableFilter on an existing iterator. But now that we have SetOptions() we can reuse iterators for TBIs.

However, pebbleBatch and pebbleReadOnly did not support creating multiple concurrent reusable iterators, and would panic with "iterator already in use" because of this check here:

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble_batch.go#L231-L233

MVCCIncrementalIterator needs to create two concurrent iterators (one regular and one TBI). This used to work as a side-effect of TBIs not being reusable, since it bypassed that check in pebbleBatch and created an entirely new iterator. But now that TBIs are reusable, it triggers that assertion.

This patch removes this artificial restriction of only allowing a single concurrent iterator, by cloning a new iterator if the primary one is already in use.

craig bot pushed a commit that referenced this pull request May 16, 2022
80991: backupccl: Cosmetic changes to backup_processor.go to improve clarity. r=benbardin a=benbardin

Release note: None

Renames a few things, and moves the sstsink into its own file.

81141: ui: add TTL metrics dashboard r=rafiss,matthewtodd a=otan

Release note: None

example:
<img width="981" alt="image" src="https://user-images.githubusercontent.com/3646147/167403258-c3d546ab-a6af-4088-a090-87cd0ed8bbe7.png">


81194: storage: remove `Engine.SetUpperBound` r=jbowens a=erikgrinaker

Partial resubmit of #79291, which was merged into the `mvcc-range-tombstones` branch.

---

**storage: make `Engine.ClearIterRange` construct iterator internally**

`Engine.ClearIterRange` now creates the iterator itself, instead of
having it passed in. This ensures the iterator is created with
appropriate bounds and options, which becomes even more important with
the introduction of range keys. It will also allow removal of
`Engine.SetUpperBound` in a subsequent commit.

If further flexibility is needed, we can consider passing an
`IterOptions` struct with appropriate bounds instead of a key span.

Release note: None

**storage: remove `Engine.SetUpperBound`**

This patch removes the `Engine.SetUpperBound` method, as it is no longer
in use.

Release note: None

81195: storage: allow multiple iterators in `pebbleReadOnly` and `pebbleBatch` r=jbowens a=erikgrinaker

Partial resubmit of #79291, which was merged into the mvcc-range-tombstones branch.

---

Previously, `pebbleReadOnly` and `pebbleBatch` only supported a single
concurrent iterator of a given type, as this iterator was reused between
`NewMVCCIterator` calls. Attempts to create an additional iterator would
panic with "iterator already in use".

Time-bound iterators were excluded from this reuse, and always created
from scratch, so they sidestepped this limitation. However, we will soon
make these iterators reusable too via a new `SetOptions()` method in
Pebble. When TBIs also become reusable, this will cause
`MVCCIncrementalIterator` to panic with "iterator already in use",
because it always creates two concurrent iterators: one regular and one
time-bound.

This patch therefore ensures all `Reader` implementations support
multiple concurrent iterators, by cloning the existing cached iterator
if it is already in use. This also preserves the pinned engine state for
the new iterator.

Release note: None

81295: outliers: introduce a composite "any" detector r=matthewtodd a=matthewtodd

This gives us a place to hang the upcoming histogram-based detector
coming in #79451.

Release note: None

Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants