Storage release 2024-09-07 #8959

vipvap · 2024-09-07T11:18:12Z

Storage release 2024-09-07

Please merge this Pull Request using 'Create a merge commit' button

## Problem This is a followup to #8783 - The old blocking ensure_attached function had been retained to handle the case where a shard had a None generation_pageserver, but this wasn't really necessary. - There was a subtle `.1` in the code where a struct would have been clearer Closes #8819 ## Summary of changes - Add ShardGenerationState to represent the results of peek_generation - Instead of calling ensure_attached when a tenant has a non-attached shard, check the shard's policy and return 409 if it isn't Attached, else return 503 if the shard's policy is attached but it hasn't been reconciled yet (i.e. has a None generation_pageserver)

…#8680) Implement the timeline specific `archival_config` endpoint also in the storage controller. It's mostly a copy-paste of the detach handler: the task is the same: do the same operation on all shards. Part of #8088.

part of #8002

… as well (#8891) Found in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d

…8885) Before this PR, we would classify layer summary block reads as "Unknown" content kind. <img width="1267" alt="image" src="https://github.com/user-attachments/assets/508af034-5c2a-4c89-80db-2899967b337c">

## Problem Metrics event idempotency keys differ across S3 and Vector. The events should be identical. Resolves #8605. ## Summary of changes Pre-generate the idempotency keys and pass the same set into both metrics sinks. Co-authored-by: John Spray <john@neon.tech>

## Problem Neon local set-up does not inject an az id in `metadata.json`. See real change in #8852. ## Summary of changes We piggyback on the existing `availability_zone` pageserver configuration in order to avoid making neon local even more complex.

It is used in many places, let's reduce number of ? on construction results.

Endpoint implementation sends msg to manager requesting to do the reset. Manager stops current partial backup upload task if it exists and performs the reset. Also slightly tweak eviction condition: all full segments before flush_lsn must be uploaded (and committed) and there must be only one segment left on disk (partial). This allows to evict timelines which started not on the first segment and didn't fill the whole segment (previous condition wasn't good because last_removed_segno was 0). ref #8759

Make all them snake case.

wal_storage.rs already checks this, but since this is a quite legit scenario check it at safekeeper.rs (consensus level) as well. ref #8212 This is a take 2; previous PR #8640 had been reverted because interplay with another change broke test_last_log_term_switch.

…8621) ## Problem Currently, DatadirModification keeps a key-indexed map of all pending writes, even though we (almost) never need to read back dirty pages for anything other than metadata pages (e.g. relation sizes). Related: #6345 ## Summary of changes - commit() modifications before ingesting database creation wal records, so that they are guaranteed to be able to get() everything they need directly from the underlying Timeline. - Split dirty pages in DatadirModification into pending_metadata_pages and pending_data_pages. The data ones don't need to be in a key-addressable format, so they just go in a Vec instead. - Special case handling of zero-page writes in DatadirModification, putting them in a map which is flushed on the end of a WAL record. This handles the case where during ingest, we might first write a zero page, and then ingest a postgres write to that page. We used to do this via the key-indexed map of writes, but in this PR we change the data page write path to not bother indexing these by key. My least favorite thing about this PR is that I needed to change the DatadirModification interface to add the on_record_end call. This is not very invasive because there's really only one place we use it, but it changes the object's behaviour from being clearly an aggregation of many records to having some per-record state. I could avoid this by implicitly doing the work when someone calls set_lsn or commit -- I'm open to opinions on whether that's cleaner or dirtier. ## Performance There may be some efficiency improvement here, but the primary motivation is to enable an earlier stage of ingest to operate without access to a Timeline. The `pending_data_pages` part is the "fast path" bulk write data that can in principle be generated without a Timeline, in parallel with other ingest batches, and ultimately on the safekeeper. `test_bulk_insert` on AX102 shows approximately the same results as in the previous PR #8591: ``` ------------------------------ Benchmark results ------------------------------- test_bulk_insert[neon-release-pg16].insert: 23.577 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 637 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s test_bulk_insert[neon-release-pg16].compaction: 0.052 s ```

Set the field to optional, otherwise there will be decode errors when newer version of the storage controller receives the JSON from older version of the pageservers. Signed-off-by: Alex Chi Z <chi@neon.tech>

ref #8872 ## Summary of changes We saw stuck storage scrubber in staging caused by infinite retries. I believe here we should use `min` instead of `max` to avoid getting minutes or hours of retry backoff. Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Each test might wait for up to 5s in order to HB the pageserver. ## Summary of changes Make the heartbeat interval configurable and use a really tight one for neon local => startup quicker

We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698

…#8910) Commit cfa45ff (PR #8860) updated the vendor/postgres submodules, but didn't use the same commit SHAs that were pushed as the corresponding REL_*_STABLE_neon branches in the postgres repository. The contents were the same, but the REL_*_STABLE_neon branches pointed to squashed versions of the commits, whereas the SHAs used in the submodules referred to the pre-squash revisions. Note: The vendor/postgres-v14 submodule still doesn't match with the tip of REL_14_STABLE_neon branch, because there has been one more commit on that branch since then. That's another confusion which we should fix, but let's do that separately. This commit doesn't change the code that gets built in any way, only changes the submodule references to point to the correct SHAs in the REL_*_STABLE_neon branch histories, rather than some detached commits.

## Problem The initial implementation of the validate API treats the in-memory generations as authoritative. - This is true when only one storage controller is running, but if a rogue controller was running that hadn't been shut down properly, and some pageserver requests were routed to that bad controller, it could incorrectly return valid=true for stale generations. - The generation in the main in-memory map gets out of date while a live migration is in flight, and if the origin location for the migration tries to do some deletions even though it is in AttachedStale (for example because it had already started compaction), these might be wrongly validated + executed. ## Summary of changes - Continue to do the in-memory check: if this returns valid=false it is sufficient to reject requests. - When valid=true, do an additional read from the database to confirm the generation is fresh. - Revise behavior for validation on missing shards: this used to always return valid=true as a convenience for deletions and shard splits, so that pageservers weren't prevented from completing any enqueued deletions for these shards after they're gone. However, this becomes unsafe when we consider split brain scenarios. We could reinstate this in future if we wanted to store some tombstones for deleted shards. - Update test_scrubber_physical_gc to cope with the behavioral change: they must now explicitly flush the deletion queue before splits, to avoid tripping up on deletions that are enqueued at the time of the split (these tests assert "scrubber deletes nothing", which check fails if the split leaves behind some remote objects that are legitimately GC'able) - Add `test_storage_controller_validate_during_migration`, which uses failpoints to create a situation where incorrect generation validation during a live migration could result in a corruption The rate of validate calls for tenants is pretty low: it happens as a consequence deletions from GC and compaction, which are both concurrency-limited on the pageserver side.

There was a confusion on the REL_14_STABLE_neon branch. PR neondatabase/postgres#471 was merged ot the branch, but the corresponding PRs on the other REL_15_STABLE_neon and REL_16_STABLE_neon branches were not merged. Also, the submodule reference in the neon repository was never updated, so even though the REL_14_STABLE_neon branch contained the commit, it was never used. That PR neondatabase/postgres#471 was a few bricks shy of a load (no tests, some differences between the different branches), so to get us to a good state, revert that change from the REL_14_STABLE_neon branch. This PR in the neon repository updates the submodule reference past two commites on the REL_14_STABLE_neon branch: first the commit from PR neondatabase/postgres#471, and immediately after that the revert of the same commit. This brings us back to square one, but now the submodule reference matches the tip of the REL_14_STABLE_neon branch again.

Part of #8623 ## Summary of changes It seems that we have tenants with aux policy set to v1 but don't have any aux files in the storage. It is still safe to force migrate them without notifying the customers. This patch adds more details to the warning to identify the cases where we have to reach out to the users before retiring aux v1. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

) ## Problem A tenant may ingest a lot of data between being drained for node restart and being moved back in the fill phase. This is expensive and causes the fill to stall. ## Summary of changes We make a tactical change to reduce secondary warm-up time for migrations in fills.

This reduces the per-request timeout to 10sec while keeping the total retry duration at 1min. Relates: neondatabase/cloud#15944

…7656) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields #8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values

Sometimes, the benchmarks fail to start up pageserver in 10s without any obvious reason. Benchmarks run sequentially on otherwise idle runners. Try running `sync(2)` after each bench to force a cleaner slate. Implement this via: - SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture - autouse fixture seems to be outermost fixture, so it works as expected - set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test workflow Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/ --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>

There is a bug in `yielding_loop` that causes it to never yield. ## Summary of changes Fixed the bug. `i + 1 % interval == 0` will always evaluate to `i + 1 == 0` which is false ([Playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=68e6ca393a02113cb7720115c2842e75)). This function is called in 2 places [here](https://github.com/neondatabase/neon/blob/99fa1c36004d710c65a47ffefaf66b4b5c6b4ce1/pageserver/src/tenant/secondary/scheduler.rs#L389) and [here](https://github.com/neondatabase/neon/blob/99fa1c36004d710c65a47ffefaf66b4b5c6b4ce1/pageserver/src/tenant/secondary/heatmap_uploader.rs#L152) with `interval == 1000` in both cases. This may change the performance of the system since now we are yielding to tokio. Also, this may expose undefined behavior since it is now possible for tasks to be moved between threads/whatever tokio does to tasks. However, this was the intention of the author of the code.

## Problem Building on MacOS failed due to missing m4. Although a window was popping up claiming to install m4, this was not helping. ## Summary of changes Add instructions to install m4 using brew and link it (thanks to Folke for helping).

## Problem #8852 introduced a new nullable column for the `nodes` table: `availability_zone_id` ## Summary of changes * Make neon local and the test suite always provide an az id * Make the az id field in the ps registration request mandatory * Migrate the column to non-nullable and adjust in memory state accordingly * Remove the code that was used to populate the az id for pre-existing nodes

Audited ``` diff -r -u ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/svg_fmt-0.4.{2,3} ``` fixes #7763

Addresses the clippy lints of the beta 1.82 toolchain. The `too_long_first_doc_paragraph` lint complained a lot and was addressed separately: #8941

Currently using gc blocking and unblocking with storage controller managed pageservers is painful. Implement the API on storage controller. Fixes: #8893

Used `vars` for new storing non-sensitive information, changed dev secrets to vars as well but didn't cleanup any secrets. neondatabase/cloud#16925 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>

## Problem ``` error: unused import: `anyhow::Context` --> libs/utils/src/crashsafe.rs:8:5 | 8 | use anyhow::Context; | ^^^^^^^^^^^^^^^ | = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` error: unused variable: `fd` --> libs/utils/src/crashsafe.rs:209:15 | 209 | pub fn syncfs(fd: impl AsRawFd) -> anyhow::Result<()> { | ^^ help: if this is intentional, prefix it with an underscore: `_fd` | = note: `-D unused-variables` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_variables)]` ``` ## Summary of changes - Fix rust warnings on macOS

github-actions · 2024-09-07T12:00:51Z

3857 tests run: 3742 passed, 1 failed, 114 skipped (full report)

Failures on Postgres 16

test_scrubber_physical_gc[4]: release-arm64

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_scrubber_physical_gc[release-pg16-4]"

Flaky tests (2)

Postgres 16

test_ondemand_wal_download_in_replication_slot_funcs: release-x86-64

Postgres 15

test_subscriber_restart: release-x86-64

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
7d7d1f3 at 2024-09-07T12:00:51.054Z :recycle:}

problame · 2024-09-07T12:01:52Z

Test failure test_scrubber_physical_gc[4] is known flaky test issue

Failures in test_scrubber_physical_gc #8928

problame · 2024-09-07T12:05:03Z

test-logical-replication is failing with

+ echo '/tmp/test_output/compatibility_snapshot_pg16/ does not exist'
/tmp/test_output/compatibility_snapshot_pg16/ does not exist

https://github.com/neondatabase/neon/actions/runs/10750974836/job/29817739387

Same for test-postgres-client-libs

https://github.com/neondatabase/neon/actions/runs/10750974836/job/29817739458

=> asking #cicd since I believe this issue should have been fixed in the past week

danieltprice · 2024-09-11T16:03:06Z

Reviewed for changelog

jcsp and others added 30 commits September 2, 2024 11:36

bottommost-compaction: remove dead code / rectify cfg!()s (#8884)

15e90cc

part of #8002

fixup(#8839): test_forward_compatibility needs to allow lag warning…

bf0531d

… as well (#8891) Found in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d

safekeeper: add remote_path to Timeline

3916810

It is used in many places, let's reduce number of ? on construction results.

safekeeper: more consistent task naming.

83dd7f5

Make all them snake case.

build(deps): bump flask-cors from 4.0.1 to 5.0.0 (#8899)

1a874a3

fix(pageserver): is_archived should be optional (#8902)

3d9001d

Set the field to optional, otherwise there will be decode errors when newer version of the storage controller receives the JSON from older version of the pageservers. Signed-off-by: Alex Chi Z <chi@neon.tech>

storcon: make hb interval an argument and speed up tests (#8880)

75310fe

## Problem Each test might wait for up to 5s in order to HB the pageserver. ## Summary of changes Make the heartbeat interval configurable and use a really tight one for neon local => startup quicker

storcon: boilerplate to upsert safekeeper records on deploy (#8879)

7a1397c

We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698

build(deps): bump cryptography from 42.0.4 to 43.0.1 (#8908)

3f43823

proxy: don't let one timeout eat entire retry budget (#8924)

6dfbf49

This reduces the per-request timeout to 10sec while keeping the total retry duration at 1min. Relates: neondatabase/cloud#15944

update svg_fmt to 0.4.3 (#8930)

cf11c8a

Audited ``` diff -r -u ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/svg_fmt-0.4.{2,3} ``` fixes #7763

arpad-m and others added 4 commits September 6, 2024 21:05

Address 1.82 clippy lints (#8944)

fa3fc73

Addresses the clippy lints of the beta 1.82 toolchain. The `too_long_first_doc_paragraph` lint complained a lot and was addressed separately: #8941

feat(storcon): forward gc blocking and unblocking (#8956)

3dbd34a

Currently using gc blocking and unblocking with storage controller managed pageservers is painful. Implement the API on storage controller. Fixes: #8893

push images to prod ACR (#8940)

16c200d

Used `vars` for new storing non-sensitive information, changed dev secrets to vars as well but didn't cleanup any secrets. neondatabase/cloud#16925 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>

vipvap requested review from a team as code owners September 7, 2024 11:18

vipvap requested review from tristan957, skyzh, cloneable and mattpodraza and removed request for a team September 7, 2024 11:18

problame changed the title ~~Storage & Compute release 2024-09-07~~ Storage release 2024-09-07 Sep 7, 2024

problame requested review from problame and removed request for tristan957, skyzh, cloneable and mattpodraza September 7, 2024 11:25

problame approved these changes Sep 7, 2024

View reviewed changes

problame merged commit 2f0b3e7 into release Sep 7, 2024
147 of 152 checks passed

problame deleted the rc/2024-09-07 branch September 7, 2024 13:09

problame mentioned this pull request Sep 9, 2024

Storage & Compute release 2024-09-09 #8966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage release 2024-09-07 #8959

Storage release 2024-09-07 #8959

vipvap commented Sep 7, 2024 •

edited by problame

Loading

github-actions bot commented Sep 7, 2024

Postgres 16

Postgres 15

problame commented Sep 7, 2024

problame commented Sep 7, 2024 •

edited

Loading

danieltprice commented Sep 11, 2024

Storage release 2024-09-07 #8959

Storage release 2024-09-07 #8959

Conversation

vipvap commented Sep 7, 2024 • edited by problame Loading