-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new compaction abstraction, simulator, and implementation. #5234
Conversation
Here's a screen capture of what the simulator output looks like simplescreenrecorder-2023-09-07_16.48.35.mp4 |
2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
42d7299 at 2023-11-07T09:29:32.418Z :recycle: |
a7603ad
to
edf146c
Compare
I really like this. Thinking forward, pluggable compaction will play well with a couple of other places I'm thinking about adding "special" compactions in future:
Other thoughts from reading this:
|
Makes sense
Hmm, yeah, I see what you're saying. There isn't much in
The L0 depth check is really just an optimization, to avoid calling the actual compaction when we know there's nothing to do. The compaction algorithm itself would reach the same conclusion and do nothing. It might be a premature optimization, but we nevertheless need to check the L0 layers anyway to find the top of the tree, i.e. the LSN of the newest L0. Or we could track that more explicitly.
+1. I didn't include any unit tests for the algorithm here, but yeah we should have them.
+1. @bojanserafimov had similar thoughts at #4411. Some thoughts on this:
|
1f6dffe
to
963dd68
Compare
I did some renaming and added a few comments to address this: e52cc06
Added a comment about that too.
Added some unit tests for the |
d8ef68c
to
8da43c3
Compare
8da43c3
to
ee232ae
Compare
I'd like to move this forward. For review, I think there are two criteria:
Please review, and indicate in review comment which angle - or both - you reviewed it from. |
fc501a6
to
41025a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the direction (the intent of separating the logic from everything else so it can be simulated and tested)
The API itself doesn't look like the one we'll stick with (get_layers looks inefficient) but it's easy to add new methods to the API until we have something efficient. If you start addressing things like this now, the PR will explode in size, so it's fine to leave it.
I haven't reviewed if this is safe to merge, but there's no way to give you a half-review, so I'll accept it and let you get that piece of feedback from storage team. Let's verify that it's equivalent to current behavior both in behavior and performance
/// key, we still too many records to fit in the target file size. We need to | ||
/// split in the LSN dimension too in that case. | ||
/// | ||
/// TODO: The code to avoid this problem has not been implemented yet! So the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to merge this with the TODO unaddressed, but when do we want to address this? Before or after we activate the new compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs to be addressed before we can switch to the new compaction. Otherwise, we can end up with a layer file that's too large to be uploaded to S3 without multipart upload, so the upload will fail, and the timeline will be stuck trying to upload it. We had that problem before...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-related: I must've accidentially removed the size check with the new layer impl as I cannot see this PR removing it either. Nope, at least the warning is still in place: https://github.com/neondatabase/neon/blob/b7f45204a28f69e344e74d653b68c7011bc4a6da/pageserver/src/tenant/timeline.rs#L3485-L3497C14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #7243 now
1e6c86d
to
51ca2ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another partial review, didn't look at all of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm blocking this until my ongoing layer work, hoping to complete that before my vacation next week (2023-10-12 onwards).
51ca2ee
to
0506f70
Compare
Rebased this, fixing conflicts. The conflicts were mostly from PR #4938, but they weren't too hard to resolve. |
This consists of three parts: 1. A refactoring and new contract for implementing and testing compaction. The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations. I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old. This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it. 2. A new compaction algorithm, implemented using the new abstraction. The new algorithm is tiered compaction. It is inspired by the PoC at PR #4539, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator. One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again. There is just a new tenant config option to choose the compaction algorithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered". 3. A simulator, which implements the new abstraction. The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted. To run the simulator: ./target/debug/compaction-simulator run-suite
0506f70
to
42d7299
Compare
//! rules. Especially if there are cases like interrupted, half-finished | ||
//! compactions, or highly skewed data distributions that have let us "skip" | ||
//! some levels. It's not critical to classify all cases correctly; at worst we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no more half-finished compactions now that work of #5172, unless a compaction can be completed up to RemoteTimelineClient::schedule_compaction_update
in some way half-way (did not at least pick up on this yet).
Cannot suggest because re-flowing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually there is a case of producing some layers and then running into an error later: #4749.
I think fixing it will require keeping layers as "delete on drop tempfiles" (not the new Layer
) until they have all been renamed as final paths (in two phases) and ready to be inserted into LayerMap.
Would the above implementation conflict with anything in this PR? It does not seem like to me.
match load_future.as_mut().poll(cx) { | ||
Poll::Ready(Ok(entries)) => { | ||
this.load_future.set(None); | ||
*this.heap.peek_mut().unwrap() = | ||
LazyLoadLayer::Loaded(VecDeque::from(entries)); | ||
} | ||
Poll::Ready(Err(e)) => { | ||
return Poll::Ready(Some(Err(e))); | ||
} | ||
Poll::Pending => { | ||
return Poll::Pending; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match load_future.as_mut().poll(cx) { | |
Poll::Ready(Ok(entries)) => { | |
this.load_future.set(None); | |
*this.heap.peek_mut().unwrap() = | |
LazyLoadLayer::Loaded(VecDeque::from(entries)); | |
} | |
Poll::Ready(Err(e)) => { | |
return Poll::Ready(Some(Err(e))); | |
} | |
Poll::Pending => { | |
return Poll::Pending; | |
} | |
} | |
match ready!(load_future.as_mut().poll(cx)) { | |
Ok(entries) => { | |
this.load_future.set(None); | |
*this.heap.peek_mut().unwrap() = | |
LazyLoadLayer::Loaded(VecDeque::from(entries)); | |
} | |
Err(e) => { | |
return Poll::Ready(Some(Err(e))); | |
} | |
} | |
} |
ready as in https://doc.rust-lang.org/stable/std/task/macro.ready.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #6830.
match top.deref_mut() { | ||
LazyLoadLayer::Unloaded(ref mut l) => { | ||
let fut = l.load_keys(this.ctx); | ||
this.load_future.set(Some(Box::pin(fut))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.load_future.set(Some(Box::pin(fut))); | |
this.load_future.set(Some(fut)); |
It comes from async trait so surely it's already Box::pin? cargo check
s out ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #6830.
/// NB: This is a pretty expensive operation. In the real pageserver | ||
/// implementation, it downloads the layer, and keeps it resident | ||
/// until the DeltaLayer is dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any alternative to this; the kmerge will at minimum require all overlapping layers to be resident, but after once the layer is exhausted and popped off it will can be evicted, if no other accesses have upgraded it in the meantime. For L0s this means the same as it has so far.
(Did not yet catch on if the L0 meaning changes with this new strategy, probably does not.)
// It can happen if compaction is interrupted after writing some | ||
// layers but not all, and we are compacting the range again. | ||
// The calculations in the algorithm assume that there are no | ||
// duplicates, so the math on targeted file size is likely off, | ||
// and we will create smaller files than expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging another #5172 related comment.
/// all the create_image() or create_delta() calls that deletion of this | ||
/// layer depends on have finished. But if the implementor has extra lazy | ||
/// background tasks, like uploading the index json file to remote storage, | ||
/// it is the implemenation's responsibility to track those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// it is the implemenation's responsibility to track those. | |
/// it is the implementation's responsibility to track those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #6830.
New layer implementation is in place, seems it supports this work as well. Posted some quick comments, tried to flag comments in relation to now outdated parts.
Rebased version of #5234, part of #6768 This consists of three parts: 1. A refactoring and new contract for implementing and testing compaction. The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations. I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old. This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it. 2. A new compaction algorithm, implemented using the new abstraction. The new algorithm is tiered compaction. It is inspired by the PoC at PR #4539, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator. One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again. There is just a new tenant config option to choose the compaction algorithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered", the new algorithm is used. 3. A simulator, which implements the new abstraction. The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted. To run the simulator: cargo run --bin compaction-simulator run-suite --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Closing as #6830 has been merged. |
This consists of three parts:
The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations.
I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old.
This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it.
The new algorithm is tiered compaction. It is inspired by the PoC at PR #45su39, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator.
One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again.
There is just a new tenant config option to choose the compaction algrithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered".
The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted.
To run the simulator: