crash-consistent layer map through index_part.json #5198

koivunej · 2023-09-05T07:45:48Z

Fixes #5172 as it:

removes recoinciliation with remote index_part.json and accepts remote index_part.json as the truth, deleting any local progress which is yet to be reflected in remote
moves to prefer remote metadata

Additionally:

tests with single LOCAL_FS parametrization are cleaned up
adds a test case for branched (non-bootstrap) local only timeline availability after restart

github-actions · 2023-09-05T08:00:47Z

2298 tests run: 2182 passed, 0 failed, 116 skipped (full report)

Code coverage (full report)

functions: 52.9% (8263 of 15614 functions)
lines: 80.5% (48166 of 59830 lines)

_{The comment gets automatically updated with the latest test results
2307cf1 at 2023-10-17T08:59:59.618Z :recycle:}

pageserver/src/tenant/remote_timeline_client.rs

test_runner/fixtures/neon_fixtures.py

test_runner/fixtures/remote_storage.py

pageserver/src/tenant/timeline/init.rs

Remote storage cleanup split from #5198: - pageserver, extensions, and safekeepers now have their separate remote storage - RemoteStorageKind has the configuration code - S3Storage has the cleanup code - with MOCK_S3, pageserver, extensions, safekeepers use different buckets - with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used as path, where $user is `pageserver`, `safekeeper` - no more `NeonEnvBuilder.enable_xxx_remote_storage` but one `enable_{pageserver,extensions,safekeeper}_remote_storage` Should not have any real changes. These will allow us to default to `LOCAL_FS` for pageserver on the next PR, remove `RemoteStorageKind.NOOP`, work towards #5172. Co-authored-by: Alexander Bayandin <alexander@neon.tech>

Assorted flakyness fixes from #5198, might not be flaky on `main`. Migrate some tests using neon_simple_env to just neon_env_builder and using initial_tenant to make flakyness understanding easier. (Did not understand the flakyness of `test_timeline_create_break_after_uninit_mark`.) `test_download_remote_layers_api` is flaky because we have no atomic "wait for WAL, checkpoint, wait for upload and do not receive any more WAL". `test_tenant_size` fixes are just boilerplate which should had always existed; we should wait for the tenant to be active. similarly for `test_timeline_delete`. `test_timeline_size_post_checkpoint` fails often for me with reading zero from metrics. Give it a few attempts.

test_runner/regress/test_branching.py

#5198 (comment)

pageserver/src/tenant.rs

g#

test_runner/regress/test_branching.py

jcsp · 2023-10-16T08:57:31Z

Pushed a commit to update the test docstrings.

jcsp · 2023-10-16T17:07:17Z

The build failure was from a review comment suggested change, fixed it.

jcsp · 2023-10-17T08:19:53Z

Merged main to resolve a conflict.

- finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.

The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). In the grand scheme of things, layer durability probably doesn't matter anymore because the remote storage is authoritative at all times as of #5198. But, let's not break the discipline in htis commit. part of #6663

The `writer.finish()` methods already fsync the inode, using `VirtualFile::sync_all()`. All that the callers need to do is fsync their directory, i.e., the timeline directory. Note that there's a call in the new compaction code that is apparently dead-at-runtime, so, I couldn't fix up any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211). Note that layer durability still matters somewhat, even after #5198 which made remote storage authoritative. We do have the layer file length as an indicator, but no checksums on the layer file contents. So, a series of overwrites without fsyncs in the middle, plus a subsequent crash, could cause us to end up in a state where the file length matches but the contents are garbage. part of #6663

fixes #7790 (duplicating most of the issue description here for posterity) # Background From the time before always-authoritative `index_part.json`, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50 As of #5198 , we should not be exposed to that problem anymore. # Problem 1 We still have 1. [code in Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521) than handles duplicate layers 2. [tests in the test suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15) that demonstrates the problem using a failpoint However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an `Ok()`, so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised. That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior. # Problem 2 Further, if production code _did_ (it nowawdays doesn't!) create a duplicate layer, the code in Pageserver that handles the condition (item 1 above) is too little and too late: * the code handles it by discarding the newer `struct Layer`; that's good. * however, on disk, we have already overwritten the old with the new layer file * the fact that we do it atomically doesn't matter because ... * if the new layer file is not bit-identical, then we have a cache coherency problem * PS PageCache block cache: caches old bit battern * blob_io offsets stored in variables, based on pre-overwrite bit pattern / offsets * => reading based on these offsets from the new file might yield different data than before # Solution - Remove the test suite code pertaining to Problem 1 - Move & rename test suite code that actually tests RFC-27 crash-consistent layer map. - Remove the Pageserver code that handles duplicate layers too late (Problem 1) - Use `RENAME_NOREPLACE` to prevent over-rename the file during `.finish()`, bail with an error if it happens (Problem 2) - This bailing prevents the caller from even trying to insert into the layer map, as they don't even get a `struct Layer` at hand. - Add `abort`s in the place where we have the layer map lock and check for duplicates (Problem 2) - Note again, we can't reach there because we bail from `.finish()` much earlier in the code. - Share the logic to clean up after failed `.finish()` between image layers and delta layers (drive-by cleanup) - This exposed that test `image_layer_rewrite` was overwriting layer files in place. Fix the test. # Future Work This PR adds a new failure scenario that was previously "papered over" by the overwriting of layers: 1. Start a compaction that will produce 3 layers: A, B, C 2. Layer A is `finish()`ed successfully. 3. Layer B fails mid-way at some `put_value()`. 4. Compaction bails out, sleeps 20s. 5. Some disk space gets freed in the meantime. 6. Compaction wakes from sleep, another iteration starts, it attempts to write Layer A again. But the `.finish()` **fails because A already exists on disk**. The failure in step 5 is new with this PR, and it **causes the compaction to get stuck**. Before, it would silently overwrite the file and "successfully" complete the second iteration. The mitigation for this is to `/reset` the tenant.

koivunej changed the title ~~No more noop tests~~ test: No more RemoteStorageKind.NOOP Sep 5, 2023

koivunej force-pushed the no_more_noop_tests branch from e5ae301 to 25ecce6 Compare September 5, 2023 07:52

koivunej force-pushed the no_more_noop_tests branch from 25ecce6 to e9cfebd Compare September 5, 2023 10:46

This comment was marked as outdated.

Sign in to view

koivunej force-pushed the no_more_noop_tests branch 3 times, most recently from 453486c to 689c6d0 Compare September 6, 2023 18:14

This comment was marked as outdated.

Sign in to view

jcsp reviewed Sep 7, 2023

View reviewed changes

pageserver/src/tenant/remote_timeline_client.rs Outdated Show resolved Hide resolved

jcsp reviewed Sep 7, 2023

View reviewed changes

test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved

test_runner/fixtures/remote_storage.py Outdated Show resolved Hide resolved

test_runner/fixtures/remote_storage.py Outdated Show resolved Hide resolved

koivunej force-pushed the no_more_noop_tests branch from 37b5c23 to 9d054b2 Compare September 7, 2023 13:11

This comment was marked as outdated.

Sign in to view

This was referenced Sep 7, 2023

Misc test flakyness fixes #5233

Merged

test: Remote storage refactorings #5243

Merged

jcsp reviewed Sep 8, 2023

View reviewed changes

pageserver/src/tenant/timeline/init.rs Outdated Show resolved Hide resolved

koivunej mentioned this pull request Sep 8, 2023

Epic: crash-consistent layer map through index_part.json #5172

Closed

koivunej force-pushed the no_more_noop_tests branch from 9d054b2 to 518c278 Compare September 8, 2023 12:17

koivunej mentioned this pull request Sep 8, 2023

fix: LocalFs root in test_compatibility is PosixPath('...') #5261

Merged

koivunej force-pushed the no_more_noop_tests branch from 518c278 to 474bbf3 Compare September 8, 2023 17:36

koivunej force-pushed the no_more_noop_tests branch 2 times, most recently from 69f6b62 to 10091a4 Compare September 11, 2023 08:52

koivunej force-pushed the no_more_noop_tests branch 4 times, most recently from ec54dc2 to 7fd1234 Compare September 28, 2023 08:23

koivunej added 4 commits October 12, 2023 18:27

refactor: needless clone

0119f9f

refactor: move uses, add uses

ba118a8

test: rename availability to is_deleted

0625a0f

test: cleanup comments

a88f41b

jcsp approved these changes Oct 13, 2023

View reviewed changes

problame reviewed Oct 13, 2023

View reviewed changes

test_runner/regress/test_branching.py Outdated Show resolved Hide resolved

missing format string

1094a26

#5198 (comment)

problame reviewed Oct 13, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

problame reviewed Oct 13, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

use constant for metadata file name

bb54ce6

g#

jcsp reviewed Oct 13, 2023

View reviewed changes

test_runner/regress/test_branching.py Outdated Show resolved Hide resolved

Update test docstrings

134b268

Fix compile error from review comment 1094a26

090c089

jcsp enabled auto-merge (squash) October 16, 2023 17:07

Merge remote-tracking branch 'upstream/main' into no_more_noop_tests

2307cf1

jcsp merged commit 9e14493 into main Oct 17, 2023
34 checks passed

jcsp deleted the no_more_noop_tests branch October 17, 2023 09:04

jcsp mentioned this pull request Oct 23, 2023

Attempt to perform query just after pageserver restart fails #5620

Closed

problame mentioned this pull request Oct 24, 2023

pageserver: schedule frozen layer uploads inside the layers lock #5639

Merged

This was referenced Mar 1, 2024

layer file creation: remove redundant fsync()s #6983

Merged

pageserver: sync on startup (eliminate bug class of unclear durability after abort()) #6989

Closed

This was referenced May 17, 2024

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

Closed

fix(pageserver): duplicate layers can cause corruption #7799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash-consistent layer map through index_part.json #5198

crash-consistent layer map through index_part.json #5198

koivunej commented Sep 5, 2023 •

edited

Loading

github-actions bot commented Sep 5, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jcsp commented Oct 16, 2023

jcsp commented Oct 16, 2023

jcsp commented Oct 17, 2023

crash-consistent layer map through index_part.json #5198

crash-consistent layer map through index_part.json #5198

Conversation

koivunej commented Sep 5, 2023 • edited Loading

github-actions bot commented Sep 5, 2023 • edited Loading

2298 tests run: 2182 passed, 0 failed, 116 skipped (full report)

Code coverage (full report)

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jcsp commented Oct 16, 2023

jcsp commented Oct 16, 2023

jcsp commented Oct 17, 2023

koivunej commented Sep 5, 2023 •

edited

Loading

github-actions bot commented Sep 5, 2023 •

edited

Loading