Handmade Release 2024-05-13 #7735

VladLazar · 2024-05-13T13:24:18Z

Handmade Release on 2024-05-13.
Based on #7713 and cherry picked #7727.

This is used by safekeeper tests.

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

…7594) ## Problem In testing of the earlier fix for OOMs under heavy write load (#7218), we saw that the limit on ephemeral layer size wasn't being reliably enforced. That was diagnosed as being due to overwhelmed compaction loops: most tenants were waiting on the semaphore for background tasks, and thereby not running the function that proactively rolls layers frequently enough. Related: #6939 ## Summary of changes - Create a new per-tenant background loop for "ingest housekeeping", which invokes maybe_freeze_ephemeral_layer() without taking the background task semaphore. - Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had been INFO, but turns out to be pretty common in the field. There's some discussion on the issue (#6939 (comment)) about alternatives for calling this maybe_freeze_epemeral_layer periodically without it getting stuck behind compaction. A whole task just for this feels like kind of a big hammer, but we may in future find that there are other pieces of lightweight housekeeping that we want to do here too. Why is it okay to call maybe_freeze_ephemeral_layer outside of the background tasks semaphore? - this is the same work we would do anyway if we receive writes from the safekeeper, just done a bit sooner. - The period of the new task is generously jittered (+/- 5%), so when the ephemeral layer size tips over the threshold, we shouldn't see an excessively aggressive thundering herd of layer freezes (and only layers larger than the mean layer size will be frozen) - All that said, this is an imperfect approach that relies on having a generous amount of RAM to dip into when we need to freeze somewhat urgently. It would be nice in future to also block compaction/GC when we recognize resource stress and need to do other work (like layer freezing) to reduce memory footprint.

## Problem The current `tenant_slots` metric becomes less useful once we have lots of secondaries, because we can't tell how many tenants are really attached (without doing a sum() on some other metric). ## Summary of changes - Add a `mode` label to this metric - Update the metric with `slot_added` and `slot_removed` helpers that are called at all the places we mutate the tenants map. - Add a debug assertion at shutdown that checks the metrics add up to the right number, as a cheap way of validating that we're calling the metric hooks in all the right places.

s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>

We don't actually use it. refs #7555

#7585 introduced test case for deletions while synthetic size is being calculated. The test has a race against deletion, but we only accept one outcome. Fix it to accept 404 as well, as we cannot control from outside which outcome happens. Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7456/8970595458/index.html#/testresult/32a5b2f8c4094bdb

Changes parameters to fix the flakiness of `test_ts_of_lsn_api`. Already now, the amount of flakiness of the test is pretty low. With this, it's even lower. cc #5768

## Problem There's allegedly a bug where if we connect a subscriber before WAL is downloaded from the safekeeper, it creates an error. ## Summary of changes Adds support for pausing safekeepers from sending WAL to computes, and then creates a compute and attaches a subscriber while it's in this paused state. Fails to reproduce the issue, but probably a good test to have --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>

## Problem Usually, the connection itself is quite fast (bellow 10ms for p999: https://neonprod.grafana.net/goto/aOyn8vYIg?orgId=1). It doesn't make a lot of sense to wait for a lot of time for the lock, if it takes a lot of time to acquire it, probably, something goes wrong. We also spawn a lot of retries, but they are not super helpful (0 means that it was connected successfully, 1, most probably, that it was re-request of the compute node address https://neonprod.grafana.net/goto/J_8VQvLIR?orgId=1). Let's try to keep a small number of retries.

…7613)

## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by #7531 Related issue: #7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in #7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.

## Problem Timelines cannot be deleted if they have children. In many production cases, a branch or a timeline has been created off the main branch for various reasons to the effect of having now a "new main" branch. This feature will make it possible to detach a timeline from its ancestor by inheriting all of the data before the branchpoint to the detached timeline and by also reparenting all of the ancestor's earlier branches to the detached timeline. ## Summary of changes - Earlier added copy_lsn_prefix functionality is used - RemoteTimelineClient learns to adopt layers by copying them from another timeline - LayerManager adds support for adding adopted layers - `timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor` and `timeline::detach_ancestor` are added - HTTP PUT handler Cc: #6994 Co-authored-by: Christian Schwarz <christian@neon.tech>

## Problem Ref https://neondb.slack.com/archives/C036U0GRMRB/p1688122168477729 ## Summary of changes - Add sha from postgres repo into postgres version string (via `--with-extra-version`) - Add a test that Postgres version matches the expected one - Remove build-time hard check and allow only related tests to fail

…7637) The `test_forward_compatibility` test runs the old production binaries, but is supposed to always run the latest neon_local binary. I think commit 6acbee2 broke that by accident because in that commit, `from_repo_dir` is introduced and runs an `init_start()` before the `test_forward_compatibility` gets a chance to patch up the neon_local_binpath.

RemoteTimelineClient has a lot of mandatory cloning. By using a single way of creating IndexPart out of UploadQueueInitialized we can simplify things and also avoid cloning the latest files for each `index_part.json` upload (the contents will still be cloned).

Preceding PR #7613 reduced the usage of `--pageserver-config-override`. This PR builds on top of that work and fully removes the `neon_local --pageserver-config-override`. Tests that need a non-default `pageserver.toml` control it using two options: 1. Specify `NeonEnvBuilder.pageserver_config_override` before `NeonEnvBuilder.init_start()`. This uses a new `neon_local init --pageserver-config` flag. 2. After `init_start()`: `env.pageserver.stop()` + `NeonPageserver.edit_config_toml()` + `env.pageserver.start()` A few test cases were using `env.pageserver.start(overrides=("--pageserver-config-override...",))`. I changed them to use one of the options above. Future Work ----------- The `neon_local init --pageserver-config` flag still uses `pageserver --config-override` under the hood. In the future, neon_local should just write the `pageserver.toml` directly. The `NeonEnvBuilder.pageserver_config_override` field should be renamed to `pageserver_initial_config`. Let's save this churn for a separate refactor commit.

This pull request adds the new basebackup read path + aux file write path. In the regression test, all logical replication tests are run with matrix aux_file_v2=false/true. Also fixed the vectored get code path to correctly return missing key error when being called from the unified sequential get code path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem In #7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.

## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.

A couple lines moved further down in main(), and one case of using Option<&str> instead of Option<&String>.

This commit is intentionally designed to have as small a diff as possible. To that end, the basic idea is that each distinct "chunk" of the previous main() has been wrapped in its own function, with the return values from each function being passed directly into the next. The structure of main() is now visible from its contents, which have a handful of smaller functions. There's a lot of other work that can / should(?) be done beyond this, but I figure that's more opinionated, and this should be a solid start. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Part of applying the changes from #7600. This piece *technically* can change the semantics because now the context guard is held before process_cli, but... the difference is likely quite small. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Introduced by refactorings from #7577. See an example check-macos-build failure here: https://github.com/neondatabase/neon/actions/runs/8992211409/job/24701531264

These log lines were repeated, and `test_detached_receives_flushes_while_being_detached` had an incomplete definition. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7531/8989511410/index.html#suites/a1c2be32556270764423c495fad75d47/992897d3a3369210

…-init --config-override` (#7638) This does to `neon_local` what neondatabase/infra#1322 does to our production deployment. After both are merged, there are no users of `pageserver --init` / `pageserver --config-override` left, and we can remove those flags eventually.

## Problem If a permit cannot be acquired to connect to compute, the cache is invalidated. This had the observed affect of sending more traffic to ProxyWakeCompute on cplane. ## Summary of changes Make sure that permit acquire failures are marked as "should not invalidate cache".

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.

in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

@hlinnaka

The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994

## Problem "John pointed out that the switch to protocol version 2 made test_gc_aggressive test flaky: #7692. I tracked it down, and that is indeed an issue. Conditions for hitting the issue: The problem occurs in the primary GC horizon is set to a very low value, e.g. 0. If the primary is actively writing WAL, and GC runs in the pageserver at the same time that the primary sends a GetPage request, it's possible that the GC advances the GC horizon past the GetPage request's LSN. I'm working on a fix here: #7708." - Heikki ## Summary of changes Use protocol version 1 as default.

github-actions · 2024-05-13T14:11:57Z

3060 tests run: 2928 passed, 0 failed, 132 skipped (full report)

Flaky tests (1)

Postgres 15

test_partial_evict_tenant[relative_spare]: release

Code coverage* (full report)

functions: 31.4% (6329 of 20161 functions)
lines: 47.3% (47705 of 100932 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
1010431 at 2024-05-13T14:11:56.998Z :recycle:}

VladLazar · 2024-05-13T14:15:47Z

Reviewed on behalf of storage, but cannot approve since I opened the PR.

We recently added support for local layer paths that contain a generation number: - #7609 - #7640 Now that we've cut a [release](#7735) that includes those changes, we can proceed to enable writing the new format without breaking forward compatibility.

arssher and others added 30 commits May 6, 2024 11:08

Add option allowing running multiple endpoints on the same branch.

e6da7e2

This is used by safekeeper tests.

Make 'neon.protocol_version = 2' the default (#7616)

0115fe6

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

Fix grammar in provider.rs error message

69337be

s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>

refactor(pageserver): remove --update-init flag (#7612)

df1def7

We don't actually use it. refs #7555

Fix test_ts_of_lsn_api flakiness (#7599)

f3af5f4

Changes parameters to fix the flakiness of `test_ts_of_lsn_api`. Already now, the amount of flakiness of the test is pretty low. With this, it's even lower. cc #5768

use less neon_local --pageserver-config-override / pageserver -c (#…

ac7dc82

…7613)

build(deps): bump werkzeug from 3.0.1 to 3.0.3 (#7625)

6e4e578

build(deps): bump jinja2 from 3.1.3 to 3.1.4 (#7626)

5a3d8e7

build(deps): bump flask-cors from 3.0.10 to 4.0.1 (#7633)

2dbd1c1

proxy: cull http connections (#7632)

0c99e5e

## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.

compute_ctl: Non-functional prep changes to reduce diff (#7577)

b158a5e

A couple lines moved further down in main(), and one case of using Option<&str> instead of Option<&String>.

compute_ctl: Fix unused variable on non-Linux (#7646)

b827e7b

Introduced by refactorings from #7577. See an example check-macos-build failure here: https://github.com/neondatabase/neon/actions/runs/8992211409/job/24701531264

petuhovskiy and others added 8 commits May 10, 2024 13:32

Fix permissions for safekeeper failpoints (#7669)

0b02043

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

build: run doctests (#7697)

6206f76

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

Fix checkpoint metric (#7701)

95098c3

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

VladLazar requested review from a team as code owners May 13, 2024 13:24

VladLazar requested review from hlinnaka, arssher, khanova and mattpodraza and removed request for a team May 13, 2024 13:24

arssher approved these changes May 13, 2024

View reviewed changes

hlinnaka approved these changes May 13, 2024

View reviewed changes

VladLazar merged commit c2e7cb3 into release May 13, 2024
55 checks passed

VladLazar deleted the vlad/release-2024-05-13 branch May 13, 2024 15:27

jcsp mentioned this pull request May 13, 2024

pageserver: switch on new-style local layer paths #7660

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handmade Release 2024-05-13 #7735

Handmade Release 2024-05-13 #7735

VladLazar commented May 13, 2024

github-actions bot commented May 13, 2024

Postgres 15

VladLazar commented May 13, 2024

Handmade Release 2024-05-13 #7735

Handmade Release 2024-05-13 #7735

Conversation

VladLazar commented May 13, 2024

github-actions bot commented May 13, 2024

3060 tests run: 2928 passed, 0 failed, 132 skipped (full report)

Postgres 15

Code coverage* (full report)

VladLazar commented May 13, 2024