Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new compaction abstraction, simulator, and implementation. #5234

Closed
wants to merge 1 commit into from

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented Sep 7, 2023

This consists of three parts:

  1. A refactoring and new contract for implementing and testing compaction.

The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations.

I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old.

This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it.

  1. A new compaction algorithm, implemented using the new abstraction.

The new algorithm is tiered compaction. It is inspired by the PoC at PR #45su39, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator.

One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again.

There is just a new tenant config option to choose the compaction algrithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered".

  1. A simulator, which implements the new abstraction.

The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted.

To run the simulator:

./target/debug/compaction-simulator run-suite

@hlinnaka hlinnaka requested a review from koivunej September 7, 2023 14:19
@hlinnaka hlinnaka requested review from a team as code owners September 7, 2023 14:19
@hlinnaka hlinnaka requested review from save-buffer and removed request for a team September 7, 2023 14:19
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 7, 2023

Here's a screen capture of what the simulator output looks like

simplescreenrecorder-2023-09-07_16.48.35.mp4

@github-actions
Copy link

github-actions bot commented Sep 7, 2023

2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)


Code coverage (full report)

  • functions: 53.3% (8908 of 16701 functions)
  • lines: 79.4% (51442 of 64804 lines)

The comment gets automatically updated with the latest test results
42d7299 at 2023-11-07T09:29:32.418Z :recycle:

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch from a7603ad to edf146c Compare September 8, 2023 00:39
@jcsp
Copy link
Collaborator

jcsp commented Sep 8, 2023

I really like this.

Thinking forward, pluggable compaction will play well with a couple of other places I'm thinking about adding "special" compactions in future:

  • Sharding splits, where we might want to break up the keyspace a particular way when splitting up workloads
  • When we hibernate idle timelines, it might make sense to compact small databases down to a single compressed object for efficiency in long term storage.

Other thoughts from reading this:

  • The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline. Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.
  • Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).
  • The simulator is good for thinking about algorithmic complexity, I'm also keen to get a handle on the empirical performance aspect, so hopefully we can do a similar thing in future that runs with real layers types doing real I/O. It could make a really nice regression test for imposing a "this workload should do more than N I/Os" type regression tests.

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 8, 2023

Thinking forward, pluggable compaction will play well with a couple of other places I'm thinking about adding "special" compactions in future:

  • Sharding splits, where we might want to break up the keyspace a particular way when splitting up workloads

  • When we hibernate idle timelines, it might make sense to compact small databases down to a single compressed object for efficiency in long term storage.

Makes sense

Other thoughts from reading this:

  • The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline.

Hmm, yeah, I see what you're saying. There isn't much in Timeline::compact_tiered, though, and I think the steps there would apply to any compaction algorithm that uses the new interface. So perhaps it needs to be renamed to Timeline::compact_with_new_interface, or simply timeline::compact.

Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.

The L0 depth check is really just an optimization, to avoid calling the actual compaction when we know there's nothing to do. The compaction algorithm itself would reach the same conclusion and do nothing. It might be a premature optimization, but we nevertheless need to check the L0 layers anyway to find the top of the tree, i.e. the LSN of the newest L0. Or we could track that more explicitly.

  • Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).

+1. I didn't include any unit tests for the algorithm here, but yeah we should have them.

  • The simulator is good for thinking about algorithmic complexity, I'm also keen to get a handle on the empirical performance aspect, so hopefully we can do a similar thing in future that runs with real layers types doing real I/O. It could make a really nice regression test for imposing a "this workload should do more than N I/Os" type regression tests.

+1. @bojanserafimov had similar thoughts at #4411. Some thoughts on this:

  • It's pretty straightforward to extract the keys, LSNs and record lengths from existing layers. There would be no real database data in such a dump, so we could extract that from real databases and play with the dump pretty freely.

  • We could also extract such a dump from the original WAL in S3. The original WAL needs some processing to turn them into key+LSN+length format, because one PostgreSQL WAL record can become multiple records in the storage, and the mapping from relfilenode, block number etc to the storage Key is a little complicated. But it would be possible to write such a tool.

  • The compaction algorithm also needs the "KeySpace", i.e. the information of which keys exist at a given LSN. That information is stored in some special key-value pairs, and it would need some extra work to reconstruct it from the WAL, or to extract it from existing layer files. But it's also doable.

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch 3 times, most recently from 1f6dffe to 963dd68 Compare September 11, 2023 15:24
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 11, 2023

Other thoughts from reading this:

  • The existence of the Timeline::compact_tiered method suggests that maybe the CopmactionJobExecutor trait should expose a little bit more information so that compact_tiered::compact_tiered can do the whole job, rather than having the very early steps in compaction live in Timeline.

Hmm, yeah, I see what you're saying. There isn't much in Timeline::compact_tiered, though, and I think the steps there would apply to any compaction algorithm that uses the new interface. So perhaps it needs to be renamed to Timeline::compact_with_new_interface, or simply timeline::compact.

I did some renaming and added a few comments to address this: e52cc06

Maybe the compaction interface needs a concept of a prepare phase and an execute phase? This might become more clear when porting existing compaction to the new interface, if both impls end up with the same l0 depth check at the start.

The L0 depth check is really just an optimization, to avoid calling the actual compaction when we know there's nothing to do. The compaction algorithm itself would reach the same conclusion and do nothing. It might be a premature optimization, but we nevertheless need to check the L0 layers anyway to find the top of the tree, i.e. the LSN of the newest L0. Or we could track that more explicitly.

Added a comment about that too.

  • Ideally, compaction impls should have their own unit tests, maybe using the simulator's recorded outputs will be handy for creating regression cases for that. We should make sure that the interface we're defining is sufficient to enable that testing (it may well already be so, it's not always obvious until we actually write the tests).

+1. I didn't include any unit tests for the algorithm here, but yeah we should have them.

Added some unit tests for the identify_level function with hand-crafted sets of input layers. That was pretty straightforward. We could use more unit tests for the various subroutines of compact_level.

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch 3 times, most recently from d8ef68c to 8da43c3 Compare September 13, 2023 06:36
@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch from 8da43c3 to ee232ae Compare September 19, 2023 08:03
@hlinnaka
Copy link
Contributor Author

I'd like to move this forward. For review, I think there are two criteria:

  1. Is this safe to merge now?

    The new algorithm is added behind a new tenant config option, and is disabled by default. It should have no effect, unless you explicitly enable it. Please review the small changes to existing code to make sure that that really is the case, i.e. that this has no effect unless you enable it.

  2. Is this the general direction we want to go?

    The new tiered implementation isn't perfect by any means, there's a lot more we could and should do. There are a bunch of TODO and FIXME comments about those things. At least some of them need to be addressed and more testing needs to be performed, before we change the default. But I'd like to get this framework committed first, those things can be addressed as follow-up PRs.

    I got some positive feedback from @jcsp and @bojanserafimov already. Any objections, any competing designs?

Please review, and indicate in review comment which angle - or both - you reviewed it from.

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch from fc501a6 to 41025a0 Compare September 19, 2023 09:35
Copy link
Contributor

@bojanserafimov bojanserafimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the direction (the intent of separating the logic from everything else so it can be simulated and tested)

The API itself doesn't look like the one we'll stick with (get_layers looks inefficient) but it's easy to add new methods to the API until we have something efficient. If you start addressing things like this now, the PR will explode in size, so it's fine to leave it.

I haven't reviewed if this is safe to merge, but there's no way to give you a half-review, so I'll accept it and let you get that piece of feedback from storage team. Let's verify that it's equivalent to current behavior both in behavior and performance

pageserver/compaction/src/simulator.rs Outdated Show resolved Hide resolved
pageserver/compaction/src/simulator.rs Outdated Show resolved Hide resolved
pageserver/compaction/src/simulator.rs Outdated Show resolved Hide resolved
pageserver/compaction/src/simulator/draw.rs Show resolved Hide resolved
pageserver/compaction/src/simulator/draw.rs Outdated Show resolved Hide resolved
/// key, we still too many records to fit in the target file size. We need to
/// split in the LSN dimension too in that case.
///
/// TODO: The code to avoid this problem has not been implemented yet! So the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to merge this with the TODO unaddressed, but when do we want to address this? Before or after we activate the new compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be addressed before we can switch to the new compaction. Otherwise, we can end up with a layer file that's too large to be uploaded to S3 without multipart upload, so the upload will fail, and the timeline will be stuck trying to upload it. We had that problem before...

Copy link
Member

@koivunej koivunej Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-related: I must've accidentially removed the size check with the new layer impl as I cannot see this PR removing it either. Nope, at least the warning is still in place: https://github.com/neondatabase/neon/blob/b7f45204a28f69e344e74d653b68c7011bc4a6da/pageserver/src/tenant/timeline.rs#L3485-L3497C14

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #7243 now

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch 2 times, most recently from 1e6c86d to 51ca2ee Compare September 26, 2023 07:48
@hlinnaka hlinnaka requested a review from arpad-m September 26, 2023 07:49
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review, didn't look at all of it.

pageserver/compaction/src/simulator.rs Show resolved Hide resolved
pageserver/compaction/src/helpers.rs Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking this until my ongoing layer work, hoping to complete that before my vacation next week (2023-10-12 onwards).

@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch from 51ca2ee to 0506f70 Compare November 6, 2023 14:17
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Nov 6, 2023

Rebased this, fixing conflicts. The conflicts were mostly from PR #4938, but they weren't too hard to resolve.

This consists of three parts:

1. A refactoring and new contract for implementing and testing
compaction.

The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The
interface models things like delta and image layers, but just the
parts that the compaction algorithm needs to make decisions. That
makes it easier unit test the algorithm and experiment with different
implementations.

I did not convert the current code to the new abstraction, however.
When compaction algorithm is set to "Legacy", we just use the old
code. It might be worthwhile to convert the old code to the new
abstraction, so that we can compare the behavior of the new algorithm
against the old one, using the same simulated cases. If we do that,
have to be careful that the converted code really is equivalent to the
old.

This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty
safe to merge, even if the new implementation is buggy, as long as we
don't enable it.

2. A new compaction algorithm, implemented using the new
abstraction.

The new algorithm is tiered compaction.  It is inspired by the PoC at
PR #4539, although I did not use that code directly, as I needed the
new implementation to fit the new abstraction. The algorithm here is
less advanced, I did not implement partial image layers, for example.
I wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.

One difference to #4539 and your typical LSM tree implementations is
how we keep track of the LSM tree levels. This PR doesn't have a
permanent concept of a level, tier or sorted run at all. There are
just delta and image layers. However, when compaction starts, we look
at the layers that exist, and arrange them into levels, depending on
their shapes. That is ephemeral: when the compaction finishes, we
forget that information. This allows the new algorithm to work without
any extra bookkeeping. That makes it easier to transition from the old
algorithm to new, and back again.

There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered".

3. A simulator, which implements the new abstraction.

The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an
SVG animation of the simulation, to visualize how layers are created
and deleted.

To run the simulator:

    ./target/debug/compaction-simulator run-suite
@hlinnaka hlinnaka force-pushed the compaction-simulator-tiered branch from 0506f70 to 42d7299 Compare November 7, 2023 08:50
Comment on lines +9 to +11
//! rules. Especially if there are cases like interrupted, half-finished
//! compactions, or highly skewed data distributions that have let us "skip"
//! some levels. It's not critical to classify all cases correctly; at worst we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no more half-finished compactions now that work of #5172, unless a compaction can be completed up to RemoteTimelineClient::schedule_compaction_update in some way half-way (did not at least pick up on this yet).

Cannot suggest because re-flowing :)

Copy link
Member

@koivunej koivunej Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually there is a case of producing some layers and then running into an error later: #4749.

I think fixing it will require keeping layers as "delete on drop tempfiles" (not the new Layer) until they have all been renamed as final paths (in two phases) and ready to be inserted into LayerMap.

Would the above implementation conflict with anything in this PR? It does not seem like to me.

Comment on lines +169 to +181
match load_future.as_mut().poll(cx) {
Poll::Ready(Ok(entries)) => {
this.load_future.set(None);
*this.heap.peek_mut().unwrap() =
LazyLoadLayer::Loaded(VecDeque::from(entries));
}
Poll::Ready(Err(e)) => {
return Poll::Ready(Some(Err(e)));
}
Poll::Pending => {
return Poll::Pending;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match load_future.as_mut().poll(cx) {
Poll::Ready(Ok(entries)) => {
this.load_future.set(None);
*this.heap.peek_mut().unwrap() =
LazyLoadLayer::Loaded(VecDeque::from(entries));
}
Poll::Ready(Err(e)) => {
return Poll::Ready(Some(Err(e)));
}
Poll::Pending => {
return Poll::Pending;
}
}
match ready!(load_future.as_mut().poll(cx)) {
Ok(entries) => {
this.load_future.set(None);
*this.heap.peek_mut().unwrap() =
LazyLoadLayer::Loaded(VecDeque::from(entries));
}
Err(e) => {
return Poll::Ready(Some(Err(e)));
}
}
}

ready as in https://doc.rust-lang.org/stable/std/task/macro.ready.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #6830.

match top.deref_mut() {
LazyLoadLayer::Unloaded(ref mut l) => {
let fut = l.load_keys(this.ctx);
this.load_future.set(Some(Box::pin(fut)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.load_future.set(Some(Box::pin(fut)));
this.load_future.set(Some(fut));

It comes from async trait so surely it's already Box::pin? cargo checks out ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #6830.

Comment on lines +52 to +54
/// NB: This is a pretty expensive operation. In the real pageserver
/// implementation, it downloads the layer, and keeps it resident
/// until the DeltaLayer is dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any alternative to this; the kmerge will at minimum require all overlapping layers to be resident, but after once the layer is exhausted and popped off it will can be evicted, if no other accesses have upgraded it in the meantime. For L0s this means the same as it has so far.

(Did not yet catch on if the L0 meaning changes with this new strategy, probably does not.)

Comment on lines +265 to +269
// It can happen if compaction is interrupted after writing some
// layers but not all, and we are compacting the range again.
// The calculations in the algorithm assume that there are no
// duplicates, so the math on targeted file size is likely off,
// and we will create smaller files than expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging another #5172 related comment.

/// all the create_image() or create_delta() calls that deletion of this
/// layer depends on have finished. But if the implementor has extra lazy
/// background tasks, like uploading the index json file to remote storage,
/// it is the implemenation's responsibility to track those.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// it is the implemenation's responsibility to track those.
/// it is the implementation's responsibility to track those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #6830.

@koivunej koivunej dismissed their stale review November 10, 2023 18:29

New layer implementation is in place, seems it supports this work as well. Posted some quick comments, tried to flag comments in relation to now outdated parts.

arpad-m added a commit that referenced this pull request Feb 27, 2024
Rebased version of #5234, part of #6768

This consists of three parts:

1. A refactoring and new contract for implementing and testing
compaction.

The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The interface
models things like delta and image layers, but just the parts that the
compaction algorithm needs to make decisions. That makes it easier unit
test the algorithm and experiment with different implementations.

I did not convert the current code to the new abstraction, however. When
compaction algorithm is set to "Legacy", we just use the old code. It
might be worthwhile to convert the old code to the new abstraction, so
that we can compare the behavior of the new algorithm against the old
one, using the same simulated cases. If we do that, have to be careful
that the converted code really is equivalent to the old.

This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty safe
to merge, even if the new implementation is buggy, as long as we don't
enable it.

2. A new compaction algorithm, implemented using the new abstraction.

The new algorithm is tiered compaction. It is inspired by the PoC at PR
#4539, although I did not use that code directly, as I needed the new
implementation to fit the new abstraction. The algorithm here is less
advanced, I did not implement partial image layers, for example. I
wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.

One difference to #4539 and your typical LSM tree implementations is how
we keep track of the LSM tree levels. This PR doesn't have a permanent
concept of a level, tier or sorted run at all. There are just delta and
image layers. However, when compaction starts, we look at the layers
that exist, and arrange them into levels, depending on their shapes.
That is ephemeral: when the compaction finishes, we forget that
information. This allows the new algorithm to work without any extra
bookkeeping. That makes it easier to transition from the old algorithm
to new, and back again.

There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered", the new algorithm is used.

3. A simulator, which implements the new abstraction.

The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an SVG
animation of the simulation, to visualize how layers are created and
deleted.

To run the simulator:

    cargo run --bin compaction-simulator run-suite

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
@arpad-m
Copy link
Member

arpad-m commented Feb 29, 2024

Closing as #6830 has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants