-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crash-consistent layer map through index_part.json #5198
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
koivunej
force-pushed
the
no_more_noop_tests
branch
from
September 5, 2023 07:52
e5ae301
to
25ecce6
Compare
2298 tests run: 2182 passed, 0 failed, 116 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
2307cf1 at 2023-10-17T08:59:59.618Z :recycle: |
koivunej
force-pushed
the
no_more_noop_tests
branch
from
September 5, 2023 10:46
25ecce6
to
e9cfebd
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
koivunej
force-pushed
the
no_more_noop_tests
branch
3 times, most recently
from
September 6, 2023 18:14
453486c
to
689c6d0
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
jcsp
reviewed
Sep 7, 2023
jcsp
reviewed
Sep 7, 2023
koivunej
force-pushed
the
no_more_noop_tests
branch
from
September 7, 2023 13:11
37b5c23
to
9d054b2
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This was referenced Sep 7, 2023
jcsp
reviewed
Sep 8, 2023
koivunej
added a commit
that referenced
this pull request
Sep 8, 2023
Remote storage cleanup split from #5198: - pageserver, extensions, and safekeepers now have their separate remote storage - RemoteStorageKind has the configuration code - S3Storage has the cleanup code - with MOCK_S3, pageserver, extensions, safekeepers use different buckets - with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used as path, where $user is `pageserver`, `safekeeper` - no more `NeonEnvBuilder.enable_xxx_remote_storage` but one `enable_{pageserver,extensions,safekeeper}_remote_storage` Should not have any real changes. These will allow us to default to `LOCAL_FS` for pageserver on the next PR, remove `RemoteStorageKind.NOOP`, work towards #5172. Co-authored-by: Alexander Bayandin <alexander@neon.tech>
koivunej
force-pushed
the
no_more_noop_tests
branch
from
September 8, 2023 12:17
9d054b2
to
518c278
Compare
koivunej
force-pushed
the
no_more_noop_tests
branch
from
September 8, 2023 17:36
518c278
to
474bbf3
Compare
koivunej
added a commit
that referenced
this pull request
Sep 11, 2023
Assorted flakyness fixes from #5198, might not be flaky on `main`. Migrate some tests using neon_simple_env to just neon_env_builder and using initial_tenant to make flakyness understanding easier. (Did not understand the flakyness of `test_timeline_create_break_after_uninit_mark`.) `test_download_remote_layers_api` is flaky because we have no atomic "wait for WAL, checkpoint, wait for upload and do not receive any more WAL". `test_tenant_size` fixes are just boilerplate which should had always existed; we should wait for the tenant to be active. similarly for `test_timeline_delete`. `test_timeline_size_post_checkpoint` fails often for me with reading zero from metrics. Give it a few attempts.
koivunej
force-pushed
the
no_more_noop_tests
branch
2 times, most recently
from
September 11, 2023 08:52
69f6b62
to
10091a4
Compare
koivunej
force-pushed
the
no_more_noop_tests
branch
4 times, most recently
from
September 28, 2023 08:23
ec54dc2
to
7fd1234
Compare
jcsp
approved these changes
Oct 13, 2023
problame
reviewed
Oct 13, 2023
problame
reviewed
Oct 13, 2023
problame
reviewed
Oct 13, 2023
jcsp
reviewed
Oct 13, 2023
Pushed a commit to update the test docstrings. |
The build failure was from a review comment suggested change, fixed it. |
Merged main to resolve a conflict. |
koivunej
added a commit
that referenced
this pull request
Oct 25, 2023
- finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.
problame
added a commit
that referenced
this pull request
Mar 1, 2024
The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). In the grand scheme of things, layer durability probably doesn't matter anymore because the remote storage is authoritative at all times as of #5198. But, let's not break the discipline in htis commit. part of #6663
This was referenced Mar 1, 2024
problame
added a commit
that referenced
this pull request
Mar 4, 2024
The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). Note that layer durability still matters somewhat, even after #5198 which made remote storage authoritative. We do have the layer file length as an indicator, but no checksums on the layer file contents. So, a series of overwrites without fsyncs in the middle, plus a subsequent crash, could cause us to end up in a state where the file length matches but the contents are garbage. part of #6663
This was referenced May 17, 2024
problame
added a commit
that referenced
this pull request
Jun 4, 2024
fixes #7790 (duplicating most of the issue description here for posterity) # Background From the time before always-authoritative `index_part.json`, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50 As of #5198 , we should not be exposed to that problem anymore. # Problem 1 We still have 1. [code in Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521) than handles duplicate layers 2. [tests in the test suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15) that demonstrates the problem using a failpoint However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an `Ok()`, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised. That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior. # Problem 2 Further, if production code _did_ (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late: * the code handles it by discarding the newer `struct Layer`; that's good. * however, on disk, we have already overwritten the old with the new layer file * the fact that we do it atomically doesn't matter because ... * if the new layer file is not bit-identical, then we have a cache coherency problem * PS PageCache block cache: caches old bit battern * blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets * => reading based on these offsets from the new file might yield different data than before # Solution - Remove the test suite code pertaining to Problem 1 - Move & rename test suite code that actually tests RFC-27 crash-consistent layer map. - Remove the Pageserver code that handles duplicate layers too late (Problem 1) - Use `RENAME_NOREPLACE` to prevent over-rename the file during `.finish()`, bail with an error if it happens (Problem 2) - This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a `struct Layer` at hand. - Add `abort`s in the place where we have the layer map lock and check for duplicates (Problem 2) - Note again, we can't reach there because we bail from `.finish()` much earlier in the code. - Share the logic to clean up after failed `.finish()` between image layers and delta layers (drive-by cleanup) - This exposed that test `image_layer_rewrite` was overwriting layer files in place. Fix the test. # Future Work This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers: 1. Start a compaction that will produce 3 layers: A, B, C 2. Layer A is `finish()`ed successfully. 3. Layer B fails mid-way at some `put_value()`. 4. Compaction bails out, sleeps 20s. 5. Some disk space gets freed in the meantime. 6. Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the `.finish()` **fails because A already exists on disk**. The failure in step 5 is new with this PR, and it **causes the compaction to get stuck**. Before, it would silently overwrite the file and "successfully" complete the second iteration. The mitigation for this is to `/reset` the tenant.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5172 as it:
Additionally: