Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash-consistent layer map through index_part.json #5198

Merged
merged 36 commits into from
Oct 17, 2023
Merged

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Sep 5, 2023

Fixes #5172 as it:

  • removes recoinciliation with remote index_part.json and accepts remote index_part.json as the truth, deleting any local progress which is yet to be reflected in remote
  • moves to prefer remote metadata

Additionally:

  • tests with single LOCAL_FS parametrization are cleaned up
  • adds a test case for branched (non-bootstrap) local only timeline availability after restart

@koivunej koivunej changed the title No more noop tests test: No more RemoteStorageKind.NOOP Sep 5, 2023
@github-actions
Copy link

github-actions bot commented Sep 5, 2023

2298 tests run: 2182 passed, 0 failed, 116 skipped (full report)


Code coverage (full report)

  • functions: 52.9% (8263 of 15614 functions)
  • lines: 80.5% (48166 of 59830 lines)

The comment gets automatically updated with the latest test results
2307cf1 at 2023-10-17T08:59:59.618Z :recycle:

@koivunej

This comment was marked as outdated.

@koivunej

This comment was marked as outdated.

@koivunej koivunej force-pushed the no_more_noop_tests branch 3 times, most recently from 453486c to 689c6d0 Compare September 6, 2023 18:14
@koivunej

This comment was marked as outdated.

@koivunej

This comment was marked as outdated.

test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved
test_runner/fixtures/remote_storage.py Outdated Show resolved Hide resolved
test_runner/fixtures/remote_storage.py Outdated Show resolved Hide resolved
@koivunej

This comment was marked as outdated.

koivunej added a commit that referenced this pull request Sep 8, 2023
Remote storage cleanup split from #5198:
- pageserver, extensions, and safekeepers now have their separate remote
storage
- RemoteStorageKind has the configuration code
- S3Storage has the cleanup code
- with MOCK_S3, pageserver, extensions, safekeepers use different
buckets
- with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used
as path, where $user is `pageserver`, `safekeeper`
- no more `NeonEnvBuilder.enable_xxx_remote_storage` but one
`enable_{pageserver,extensions,safekeeper}_remote_storage`

Should not have any real changes. These will allow us to default to
`LOCAL_FS` for pageserver on the next PR, remove
`RemoteStorageKind.NOOP`, work towards #5172.

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
koivunej added a commit that referenced this pull request Sep 11, 2023
Assorted flakyness fixes from #5198, might not be flaky on `main`.

Migrate some tests using neon_simple_env to just neon_env_builder and
using initial_tenant to make flakyness understanding easier. (Did not
understand the flakyness of
`test_timeline_create_break_after_uninit_mark`.)

`test_download_remote_layers_api` is flaky because we have no atomic
"wait for WAL, checkpoint, wait for upload and do not receive any more
WAL".

`test_tenant_size` fixes are just boilerplate which should had always
existed; we should wait for the tenant to be active. similarly for
`test_timeline_delete`.

`test_timeline_size_post_checkpoint` fails often for me with reading
zero from metrics. Give it a few attempts.
@koivunej koivunej force-pushed the no_more_noop_tests branch 2 times, most recently from 69f6b62 to 10091a4 Compare September 11, 2023 08:52
@koivunej koivunej force-pushed the no_more_noop_tests branch 4 times, most recently from ec54dc2 to 7fd1234 Compare September 28, 2023 08:23
@jcsp
Copy link
Collaborator

jcsp commented Oct 16, 2023

Pushed a commit to update the test docstrings.

@jcsp
Copy link
Collaborator

jcsp commented Oct 16, 2023

The build failure was from a review comment suggested change, fixed it.

@jcsp jcsp enabled auto-merge (squash) October 16, 2023 17:07
@jcsp
Copy link
Collaborator

jcsp commented Oct 17, 2023

Merged main to resolve a conflict.

@jcsp jcsp merged commit 9e14493 into main Oct 17, 2023
34 checks passed
@jcsp jcsp deleted the no_more_noop_tests branch October 17, 2023 09:04
koivunej added a commit that referenced this pull request Oct 25, 2023
- finally add an `#[instrument]` to Timeline::create_image_layers,
making it easier to see that something is happening because we create
image layers
- format some macro context code
- add a warning not to create new validation functions a la parse do not
validate

Split off from #5198.
problame added a commit that referenced this pull request Mar 1, 2024
The `writer.finish()` methods already fsync the inode,
using `VirtualFile::sync_all()`.

All that the callers need to do is fsync their directory,
i.e., the timeline directory.

Note that there's a call in the new compaction code that
is apparently dead-at-runtime, so, I couldn't fix up
any fsyncs there [Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211).

In the grand scheme of things, layer durability probably doesn't
matter anymore because the remote storage is authoritative at all times
as of #5198. But, let's not break the discipline in htis commit.

part of #6663
problame added a commit that referenced this pull request Mar 4, 2024
The `writer.finish()` methods already fsync the inode, using
`VirtualFile::sync_all()`.

All that the callers need to do is fsync their directory, i.e., the
timeline directory.

Note that there's a call in the new compaction code that is apparently
dead-at-runtime, so, I couldn't fix up any fsyncs there
[Link](https://github.com/neondatabase/neon/blob/502b69b33bbd4ad1b0647e921a9c665249a2cd62/pageserver/src/tenant/timeline/compaction.rs#L204-L211).

Note that layer durability still matters somewhat, even after #5198
which made remote storage authoritative.
We do have the layer file length as an indicator, but no checksums on
the layer file contents.
So, a series of overwrites without fsyncs in the middle, plus a
subsequent crash, could cause us to end up in a state where the file
length matches but the contents are garbage.

part of #6663
problame added a commit that referenced this pull request Jun 4, 2024
fixes #7790 (duplicating most
of the issue description here for posterity)

# Background

From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50

As of #5198 , we should not be exposed to that problem anymore.

# Problem 1

We still have
1. [code in
Pageserver](https://github.com/neondatabase/neon/blob/82960b2175211c0f666b91b5258c5e2253a245c7/pageserver/src/tenant/timeline.rs#L4502-L4521)
than handles duplicate layers
2. [tests in the test
suite](https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15)
that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.

That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.

# Problem 2

Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:

* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
  * PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
 
# Solution

- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.

# Future Work

This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.

The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.

The mitigation for this is to `/reset` the tenant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epic: crash-consistent layer map through index_part.json
3 participants