Release 2023-08-15 #4990

github-actions · 2023-08-15T10:05:02Z

No description provided.

…sses (#4902) ## Problem The pageserver<->safekeeper protocol uses error messages to indicate end of stream. pageserver already logs these at INFO level, but the inner error message includes the word "ERROR", which interferes with log searching. Example: ``` walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8 ``` The inner DbError has a severity of ERROR so DbError's Display implementation includes that ERROR, even though we are actually logging the error at INFO level. ## Summary of changes Introduce an explicit WalReceiverError type, and in its From<> for postgres errors, apply the logic from ExpectedError, for expected errors, and a new condition for successes. The new output looks like: ``` walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes ```

…4921) Fixes access to s3 buckets that use IAM roles for service accounts access control method --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

## Problem Currently to know how long pageserver startup took requires inspecting logs. ## Summary of changes `pageserver_startup_duration_ms` metric is added, with label `phase` for different phases of startup. These are broken down by phase, where the phases correspond to the existing wait points in the code: - Start of doing I/O - When tenant load is done - When initial size calculation is done - When background jobs start - Then "complete" when everything is done. `pageserver_startup_is_loading` is a 0/1 gauge that indicates whether we are in the initial load of tenants. `pageserver_tenant_activation_seconds` is a histogram of time in seconds taken to activate a tenant. Co-authored-by: Joonas Koivunen <joonas@neon.tech>

## Problem The safekeeper advertises the same address specified in `--listen-pg`, which is problematic when the listening address is different from the address that the pageserver can use to connect to the safekeeper. ## Summary of changes Add a new optional flag called `--advertise-pg` for the address to be advertised. If this flag is not specified, the behavior is the same as before.

## Problem When an endpoint is shutting down, it can take a few seconds. Currently when starting a new compute, this causes an "endpoint is in transition" error. We need to add delays before retrying to ensure that we allow time for the endpoint to shutdown properly. ## Summary of changes Adds a delay before retrying in auth. connect_to_compute already has this delay

This reverts commit 682dfb3.

## Problem The current test history format is a bit inconvenient: - It stores all test results in one row, so all queries should include subqueries which expand the test - It includes duplicated test results if the rerun is triggered manually for one of the test jobs (for example, if we rerun `debug-pg14`, then the report will include duplicates for other build types/postgres versions) - It doesn't have a reference to run_id, which we use to create a link to allure report Here's the proposed new format: ``` id BIGSERIAL PRIMARY KEY, parent_suite TEXT NOT NULL, suite TEXT NOT NULL, name TEXT NOT NULL, status TEXT NOT NULL, started_at TIMESTAMPTZ NOT NULL, stopped_at TIMESTAMPTZ NOT NULL, duration INT NOT NULL, flaky BOOLEAN NOT NULL, build_type TEXT NOT NULL, pg_version INT NOT NULL, run_id BIGINT NOT NULL, run_attempt INT NOT NULL, reference TEXT NOT NULL, revision CHAR(40) NOT NULL, raw JSONB COMPRESSION lz4 NOT NULL, ``` ## Summary of changes - Misc allure changes: - Update allure to 2.23.1 - Delete files from previous runs in HTML report (by using `sync --delete` instead of `mv`) - Use `test-cases/*.json` instead of `suites.json`, using this directory allows us to catch all reruns. - Until we migrated `scripts/flaky_tests.py` and `scripts/benchmark_durations.py` store test results in 2 formats (in 2 different databases).

…4930) ## Problem One might wonder why the code here doesn't use `TimelineId` or `TenantId`. I originally had a refactor to use them, but then discarded it, because converting to strings on each time there is a read or write is wasteful. ## Summary of changes We add some docs explaining why here no `TimelineId` or `TenantId` is being used.

In the quest to solve #4745 by moving the download/evictedness to be internally mutable factor of a Layer and get rid of `trait PersistentLayer` at least for prod usage, `layer_removal_cs`, we present some misc cleanups. --------- Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>

Don't panic if library or extension is not found in remote extension storage or download has failed. Instead, log the error and proceed - if file is not present locally as well, postgres will fail with postgres error. If it is a shared_preload_library, it won't start, because of bad config. Otherwise, it will just fail to run the SQL function/ command that needs the library. Also, don't try to download extensions if remote storage is not configured.

## Problem For some tests, we override the default timeout (300s / 5m) with a larger values like 600s / 10m or even 1800s / 30m, even if it's not required. I've collected some statistics (for the last 60 days) for tests duration: | test | max (s) | p99 (s) | p50 (s) | count | |-----------------------------------|---------|---------|---------|-------| | test_hot_standby | 9 | 2 | 2 | 5319 | | test_import_from_vanilla | 16 | 9 | 6 | 5692 | | test_import_from_pageserver_small | 37 | 7 | 5 | 5719 | | test_pg_regress | 101 | 73 | 44 | 5642 | | test_isolation | 65 | 56 | 39 | 5692 | A couple of tests that I left with custom 600s / 10m timeout. | test | max (s) | p99 (s) | p50 (s) | count | |-----------------------------------|---------|---------|---------|-------| | test_gc_cutoff | 456 | 224 | 109 | 5694 | | test_pageserver_chaos | 528 | 267 | 121 | 5712 | ## Summary of changes - Remove `@pytest.mark.timeout` annotation from several tests

…ayer (#4937) On the quest of #4745, these are more related to the task at hand, but still small. In addition to $subject, allow `ValueRef<ResidentDeltaLayer>`.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

…sion (#4945)

The test mutates a shared directory which does not work with multiple concurrent tests. It is being fixed, so this should be a very temporary band-aid. Cc: #4949.

## Problem The `DiskBtreeReader::visit` function calls `read_blk` internally, and while #4863 converted the API of `visit` to async, the internal function is still recursive. So, analogously to #4838, we turn the recursive function into an iterative one. ## Summary of changes First, we prepare the change by moving the for loop outside of the case switch, so that we only have one loop that calls recursion. Then, we switch from using recursion to an approach where we store the search path inside the tree on a stack on the heap. The caller of the `visit` function can control when the search over the B-Tree ends, by returning `false` from the closure. This is often used to either only find one specific entry (by always returning `false`), but it is also used to iterate over all entries of the B-tree (by always returning `true`), or to look for ranges (mostly in tests, but `get_value_reconstruct_data` also has such a use). Each stack entry contains two things: the block number (aka the block's offset), and a children iterator. The children iterator is constructed depending on the search direction, and with the results of a binary search over node's children list. It is the only thing that survives a spilling/push to the stack, everything else is reconstructed. In other words, each stack spill, will, if the search is still ongoing, cause an entire re-parsing of the node. Theoretically, this would be a linear overhead in the number of leaves the search visits. However, one needs to note: * the workloads to look for a specific entry are just visiting one leaf, ever, so this is mostly about workloads that visit larger ranges, including ones that visit the entire B-tree. * the requests first hit the page cache, so often the cost is just in terms of node deserialization * for nodes that only have leaf nodes as children, no spilling to the stack-on-heap happens (outside of the initial request where the iterator is `None`). In other words, for balanced trees, the spilling overhead is $\Theta\left(\frac{n}{b^2}\right)$, where `b` is the branching factor and `n` is the number of nodes in the tree. The B-Trees in the current implementation have a branching factor of roughly `PAGE_SZ/L` where `PAGE_SZ` is 8192, and `L` is `DELTA_KEY_SIZE = 26` or `KEY_SIZE = 18` in production code, so this gives us an estimate that we'd be re-loading an inner node for every 99000 leaves in the B-tree in the worst case. Due to these points above, I'd say that not fully caching the inner nodes with inner children is reasonable, especially as we also want to be fast for the "find one specific entry" workloads, where the stack content is never accessed: any action to make the spilling computationally more complex would contribute to wasted cycles here, even if these workloads "only" spill one node for each depth level of the b-tree (which is practically always a low single-digit number, Kleppmann points out on page 81 that for branching factor 500, a four level B-tree with 4 KB pages can store 250 TB of data). But disclaimer, this is all stuff I thought about in my head, I have not confirmed it with any benchmarks or data. Builds on top of #4863, part of #4743

`pg_regress` is flaky: #559 Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a wait for `wait_for_last_flush_lsn`. Some recently introduced flakyness was fixed with #4948. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Introduced in #4886, did not consider that tests with real_s3 could sometimes go over the limit. Do not fail tests because of that.

## Summary of changes For context see https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md Create Flow to delete tenant's data from pageserver. The approach heavily mimics previously implemented timeline deletion implemented mostly in #4384 and followed up in #4552 For remaining deletion related issues consult with deletion project here: https://github.com/orgs/neondatabase/projects/33 resolves #4250 resolves #3889 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

@kelvich

## Problem 1MB response limit is very small. ## Summary of changes This data is not yet tracked, so we shoudn't raise the limit too high yet. But as discussed with @kelvich and @conradludgate, this PR lifts it to 10MB, and adds also details of the limit to the error response.

## Problem Mysterious network issues ## Summary of changes Log a lot more about HTTP/DNS in hopes of detecting more of the network errors

Found this log on staging: ``` 2023-08-10T17:42:58.573790Z INFO handling interactive connection from client protocol="ws" ``` We seem to be losing websocket span in spawn, this patch fixes it.

Remove redundant `wait_while` in tests. It had only one usage. Use `wait_tenant_status404`. Related: #4855 (comment)

Fix multiline logs on websocket errors and always print sql-over-http errors sent to the user.

## Problem In some places, the lock on `InMemoryLayerInner` is only created to obtain `end_lsn`. This is not needed however, if we move `end_lsn` to `InMemoryLayer` instead. ## Summary of changes Make `end_lsn` a member of `InMemoryLayer`, and do less locking of `InMemoryLayerInner`. `end_lsn` is changed from `Option<Lsn>` into an `OnceLock<Lsn>`. Thanks to this change, we don't need to lock any more in three functions. Part of #4743 . Suggested in #4905 (comment) .

Patches a bug in vm-builder where it did not include enough parameters in the query string. These parameters are `host=localhost port=5432`. These parameters were not necessary for the monitor because the `pq` go postgres driver included them by default.

It allows term leader to ensure he pulls data from the correct term. Absense of it wasn't very problematic due to CRC checks, but let's be strict. walproposer still doesn't use it as we're going to remove recovery completely from it.

This code was mostly copied from walsender.c and the idea was to keep it similar to walsender.c, so that we can easily copy-paste future upstream changes to walsender.c to waproposer_utils.c, too. But right now I see that deleting it doesn't break anything, so it's better to remove unused parts.

…mpaction (#4971) ## Problem Currently, image generation reads delta layers before writing out subsequent image layers, which updates the access time of the delta layers and effectively puts them at the back of the queue for eviction. This is the opposite of what we want, because after a delta layer is covered by a later image layer, it's likely that subsequent reads of latest data will hit the image rather than the delta layer, so the delta layer should be quite a good candidate for eviction. ## Summary of changes `RequestContext` gets a new `ATimeBehavior` field, and a `RequestContextBuilder` helper so that we can optionally add the new field without growing `RequestContext::new` every time we add something like this. Request context is passed into the `record_access` function, and the access time is not updated if `ATimeBehavior::Skip` is set. The compaction background task constructs its request context with this skip policy. Closes: #4969

People sometimes ask about this. https://community.neon.tech/t/is-the-neon-local-binary-in-any-of-the-official-docker-images/360/2

Originated from test failure where we got SlowDown error from s3. The patch generalizes `download_retry` to not be download specific. Resulting `retry` function is moved to utils crate. `download_retries` is now a thin wrapper around this `retry` function. To ensure that all needed retries are in place test code now uses `test_remote_failures=1` setting. Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009

## Problem It's nice if `single query : single response :: batch query : batch response`. But at present, in the single case we send `{ query: '', params: [] }` and get back a single `{ rows: [], ... }` object, while in the batch case we send an array of `{ query: '', params: [] }` objects and get back not an array of `{ rows: [], ... }` objects but a `{ results: [ { rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead. ## Summary of changes With this change, the batch query body becomes `{ queries: [{ query: '', params: [] }, ... ] }`, which restores a consistent relationship between the request and response bodies.

## Problem The `BlockCursor::read_blob` and `BlockCursor::read_blob_into_buf` functions are calling `read_blk` internally, so if we want to make that function async fn, they need to be async themselves. ## Summary of changes * We first turn `ValueRef::load` into an async fn. * Then, we switch the `RwLock` implementation in `InMemoryLayer` to use the one from `tokio`. * Last, we convert the `read_blob` and `read_blob_into_buf` functions into async fn. In three instances we use `Handle::block_on`: * one use is in compaction code, which currently isn't async. We put the entire loop into an `async` block to prevent the potentially hot loop from doing cross-thread operations. * one use is in dumping code for `DeltaLayer`. The "proper" way to address this would be to enable the visit function to take async closures, but then we'd need to be generic over async fs non async, which [isn't supported by rust right now](https://blog.rust-lang.org/inside-rust/2022/07/27/keyword-generics.html). The other alternative would be to do a first pass where we cache the data into memory, and only then to dump it. * the third use is in writing code, inside a loop that copies from one file to another. It is is synchronous and we'd like to keep it that way (for now?). Part of #4743

## Problem The `BlockReader` trait is not ready to be asyncified, as associated types are not supported by asyncification strategies like via the `async_trait` macro, or via adopting enums. ## Summary of changes Remove the `BlockLease` associated type from the `BlockReader` trait and turn it into an enum instead, bearing the same name. The enum has two variants, one of which is gated by `#[cfg(test)]`. Therefore, outside of test settings, the enum has zero overhead over just having the `PageReadGuard`. Using the enum allows us to impl `BlockReader` without needing the page cache. Part of #4743

#4942 left old metrics in place for migration purposes. It was noticed that from new metrics the total number of deleted objects was forgotten, add it. While reviewing, it was noticed that the delete_object could just be delete_objects of one. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

github-actions · 2023-08-15T11:55:43Z

1588 tests run: 1510 passed, 0 failed, 78 skipped (full report)

Flaky tests (1)

Postgres 15

test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: debug

_{The comment gets automatically updated with the latest test results
b9de9d7 at 2023-08-15T11:55:42.652Z :recycle:}

jcsp and others added 30 commits August 8, 2023 12:35

Use WebIdentityTokenCredentialsProvider to access remote extensions (#…

6d17d6c

…4921) Fixes access to s3 buckets that use IAM roles for service accounts access control method --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>

Revert "bump vm-builder to v0.15.0-alpha1 (#4895)" (#4931)

4592538

This reverts commit 682dfb3.

bump vm-builder version to v0.15.0-alpha1 (#4934)

6661f4f

fix pkglibdir path for remote extensions

da128a5

refactor: make {Delta,Image}LayerInners usable without {Delta,Image}L…

c8aed10

…ayer (#4937) On the quest of #4745, these are more related to the task at hand, but still small. In addition to $subject, allow `ValueRef<ResidentDeltaLayer>`.

Measure compute-pageserver latency (#4901)

94ad920

Co-authored-by: Joonas Koivunen <joonas@neon.tech>

if control file already exists ignore the remote version of the exten…

e157b16

…sion (#4945)

test: mark test_download_extensions.py skipped for now (#4948)

db48f7e

The test mutates a shared directory which does not work with multiple concurrent tests. It is being fixed, so this should be a very temporary band-aid. Cc: #4949.

test: allow slow shutdown warning (#4953)

71f9d9e

Introduced in #4886, did not consider that tests with real_s3 could sometimes go over the limit. Do not fail tests because of that.

delete already present files from library index (#4955)

17aea78

proxy: http logging to 11 (#4950)

25c66dc

## Problem Mysterious network issues ## Summary of changes Log a lot more about HTTP/DNS in hopes of detecting more of the network errors

Grant BypassRLS to new neon_superuser roles (#4935)

3a71cf3

proxy: propagate ws span (#4966)

73d7a9b

Found this log on staging: ``` 2023-08-10T17:42:58.573790Z INFO handling interactive connection from client protocol="ws" ``` We seem to be losing websocket span in spawn, this patch fixes it.

tests: remove redundant wait_while (#4952)

d39fd66

Remove redundant `wait_while` in tests. It had only one usage. Use `wait_tenant_status404`. Related: #4855 (comment)

proxy: improve http logs (#4976)

3a6b99f

Fix multiline logs on websocket errors and always print sql-over-http errors sent to the user.

arssher and others added 10 commits August 12, 2023 12:20

Add neon_local to docker image.

49c57c0

People sometimes ask about this. https://community.neon.tech/t/is-the-neon-local-binary-in-any-of-the-official-docker-images/360/2

Update Postgres to v15.4 and v14.9 (#4965)

ef4a76c

github-actions bot requested review from a team as code owners August 15, 2023 10:05

github-actions bot requested review from awestover, arssher, nikitakalyanov and problame and removed request for a team August 15, 2023 10:05

shanyp approved these changes Aug 15, 2023

View reviewed changes

Merge branch 'release' into releases/2023-08-15

b9de9d7

shanyp merged commit aee1bf9 into release Aug 15, 2023

shanyp deleted the releases/2023-08-15 branch August 15, 2023 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2023-08-15 #4990

Release 2023-08-15 #4990

github-actions bot commented Aug 15, 2023

github-actions bot commented Aug 15, 2023

Postgres 15

Release 2023-08-15 #4990

Release 2023-08-15 #4990

Conversation

github-actions bot commented Aug 15, 2023

github-actions bot commented Aug 15, 2023

1588 tests run: 1510 passed, 0 failed, 78 skipped (full report)

Postgres 15