-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer #4938
Conversation
2340 tests run: 2225 passed, 0 failed, 115 skipped (full report)Flaky tests (2)Postgres 16
Postgres 15
Code coverage (full report)
The comment gets automatically updated with the latest test results
6778c54 at 2023-10-26T09:09:59.921Z :recycle: |
Summarizing discussion with @jcsp and @LizardWizzard regarding "gc soft-deletes" (private slack link):
Related: #4326 |
6307a36
to
6fa0be8
Compare
cb77f57
to
b54a9b3
Compare
b54a9b3
to
9a96033
Compare
I will be rebasing and collecting "inspired" work to the top to split it off, but any comments are of course welcome, esp. if they relate to |
9a96033
to
1b3a04e
Compare
Restores #4937 work relating to the ability to use `ResidentDeltaLayer` (which is an Arc wrapper) in #4938 for the ValueRef's by removing the borrow from `ValueRef` and providing it from an upper layer. This should not have any functional changes, most importantly, the `main` will continue to use the borrowed `DeltaLayerInner`. It might be that I can change #4938 to be like this. If that is so, I'll gladly rip out the `Ref` and move the borrow back. But I'll first want to look at the current test failures.
I will have to change these as I change remote_timeline_client api in #4938. So a bit of cleanup, handle my comments which were just resolved during initial review. Cleanup: - use unwrap in tests instead of mixed `?` and `unwrap` - use `Handle` instead of `&'static Reactor` to make the RemoteTimelineClient more natural - use arrays in tests - use plain `#[tokio::test]`
b4a8642
to
374a537
Compare
All of the test failures should now be linked to metrics. Upon closer look no, it's related to layermap init and how I added "faster access" to the internals for eviction, fixing. This still leaves the metric problems. Test report
|
374a537
to
cc76a53
Compare
Fixing this is rather involved, I think we need a new queue operation which we would call from neon/pageserver/src/tenant/timeline.rs Line 995 in 85f4514
wait_for_completion_and_stop . If anything is even needed because of the #4960. I think I'll just leave it not done, and we'll need to remember this if some flakyness arises.
|
it was originally in the Layer::finish_creating, however this would had created a new metrics bug in case create_image_layers or compact_level0_phase1 would fail not on the first layer but any following. this solution does lose on the speed metrics are updated, but that is as has been.
sadly, no test for this yet.
I don't see how else get any insight what'll happen in staging.
370af00
to
36ba3ba
Compare
Adding these FIXME's:
Intend to change these with #5331.
We have other virtual file related fixmes. During this PR, I think virtualfile learned how to support this use case as well, so it might be something interesting to look into.
This is not trivial to implement because now the DeltaEntry reads are "async". I was thinking of "async" feeder to spawn_blocking task. This might be good. Latter is about needing to fix the par_fsync.
These are more misc observations. The structuring is not great with regards to how new layers are added. |
this was the plan, but I forgot it from #4938.
#5649 added the concept of dangling layers which #4938 uses but only partially. I forgot to change `schedule_compaction_update` to not schedule deletions to uphold the "have a layer, you can read it". With the now remembered fix, I don't think these checks should ever fail except for a mistake I already did. These changes might be useful for protecting future changes, even though the Layer carrying the generation AND the `schedule_(gc|compaction)_update` require strong arcs. Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any leak situation which might come up.
With the layer implementation as was done in #4938, it is possible via cancellation to cause two concurrent downloads on the same path, due to how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread the init semaphore through the spawned task of downloading to make this impossible to happen.
Some of the log messages were lost with the #4938. This PR adds some of them back, most notably: - starting to on-demand download - successful completion of on-demand download - ability to see when there were many waiters for the layer download - "unexpectedly on-demand downloading ..." is now `info!` Additionally some rare events are logged as error, which should never happen.
Quest: #4745. Follow-up to #4938. - add in locks for compaction and gc, so we don't have multiple executions at the same time in tests - remove layer_removal_cs - remove waiting for uploads in eviction/gc/compaction - #4938 will keep the file resident until upload completes Co-authored-by: Christian Schwarz <christian@neon.tech>
Implement a new
struct Layer
abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The newstruct Layer
provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved thisRemoteTimelineClient::wait_completion
which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd valueDrop
.With a single
struct Layer
the closed open endedtrait Layer
,trait PersistentLayer
andstruct RemoteLayer
are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839.The new
struct Layer
is a preliminary to removeTimeline::layer_removal_cs
documented in #4745.Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.