pageserver: verify correct handling of VM and FSM pages #9914

erikgrinaker · 2024-11-27T15:55:54Z

In #9855, we found that VM pages were handled suboptimally and potentially incorrectly.

The VM (visibility map) and FSM (free space map) are stored as forks of the main relation, and their pages are sharded and managed just like regular relation pages. However, updates to the VM and FSM pages may happen differently from regular pages -- they are often updated implicitly on writes to the main relation.

For VM pages specifically, we found in #9855 that ClearVmBits writes were sent to all shards, regardless of whether the shards were responsible for VM pages. These writes would succeed on shards that had VM pages and also on shard 0, but fail and be ignored on other shards (potentially causing significant performance overhead due to relsize cache misses).

In #9895, we fix ClearVmBits such that it's only applied on the shard responsible for the relevant VM pages. However, we need to verify that this is correct, and add appropriate test coverage. We also need to clean up any old partial writes due to misrouting of ClearVmBits (in particular to shard 0). The behavior following #9895 relies on a few assumptions:

The compute (vacuum) always explicitly writes VM pages via the WAL before any ClearVmBits on those pages.
The VM shard(s) have a view of the VM relation size that covers all local VM pages.
We never attempt to access (incomplete) VM pages on shards other than the owning one (including 0).

See also #9855 (comment).

We should also verify that FSM pages are handled similarly and correctly.

The text was updated successfully, but these errors were encountered:

# Problem VM (visibility map) pages are stored and managed as any regular relation page, in the VM fork of the main relation. They are also sharded like other pages. Regular WAL writes to the VM pages (typically performed by vacuum) are routed to the correct shard as usual. However, VM pages are also updated via `ClearVmBits` metadata records emitted when main relation pages are updated. These metadata records were sent to all shards, like other metadata records. This had the following effects: * On shards responsible for VM pages, the `ClearVmBits` applies as expected. * On shard 0, which knows about the VM relation and its size but doesn't necessarily have any VM pages, the `ClearVmBits` writes may have been applied without also having applied the explicit WAL writes to VM pages. * If VM pages are spread across multiple shards (unlikely with 256MB stripe size), all shards may have applied `ClearVmBits` if the pages fall within their local view of the relation size, even for pages they do not own. * On other shards, this caused a relation size cache miss and a DbDir and RelDir lookup before dropping the `ClearVmBits`. With many relations, this could cause significant CPU overhead. This is not believed to be a correctness problem, but this will be verified in #9914. Resolves #9855. # Changes Route `ClearVmBits` metadata records only to the shards responsible for the VM pages. Verification of the current VM handling and cleanup of incomplete VM pages on shard 0 (and potentially elsewhere) is left as follow-up work.

erikgrinaker · 2024-11-28T12:47:49Z

The pg_visibility extension is useful for debugging the visibility map. The following will verify that it's correct (i.e. matches the actual heap pages):

select pg_check_visible(table);
select pg_check_frozen(table);

This was verified by discarding ClearVmBits operations, which was detected by pg_check_visible().

Note that VM pages will often be cached in the compute, so to verify that storage has the correct view the compute must be restarted to clear its caches.

erikgrinaker · 2024-11-28T14:41:30Z

The following produced 386 faulty VM pages when run with ClearVmBits disabled:

createdb -p 55432 pgbench
./pg_install/v16/bin/pgbench -p 55432 -i -s 100 pgbench
./pg_install/v16/bin/pgbench -p 55432 -c 64 -T 60 -j 16 -P 1 pgbench
cargo neon endpoint stop main
cargo neon endpoint start main
psql -p 55432 pgbench -c "select count(*) from pg_check_visible('pgbench_accounts')"

I'll do five runs across 8 shards with stripe size 16 MB for 1 minute each, before and after #9895, to verify that there is no correctness issue.

erikgrinaker · 2024-11-28T15:25:54Z

Tests didn't show any VM page corruption.

I've spent the day reading and debugging code, running tests, and talking to Heikki (who wrote the visibility map), and I'm reasonably confident that we don't have a correctness problem. We likely have some stray keys to clean up though, split out #9927 for this.

I'll leave this issue open to consider adding some more assertions, checks, or tests around this.

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

# Problem VM (visibility map) pages are stored and managed as any regular relation page, in the VM fork of the main relation. They are also sharded like other pages. Regular WAL writes to the VM pages (typically performed by vacuum) are routed to the correct shard as usual. However, VM pages are also updated via `ClearVmBits` metadata records emitted when main relation pages are updated. These metadata records were sent to all shards, like other metadata records. This had the following effects: * On shards responsible for VM pages, the `ClearVmBits` applies as expected. * On shard 0, which knows about the VM relation and its size but doesn't necessarily have any VM pages, the `ClearVmBits` writes may have been applied without also having applied the explicit WAL writes to VM pages. * If VM pages are spread across multiple shards (unlikely with 256MB stripe size), all shards may have applied `ClearVmBits` if the pages fall within their local view of the relation size, even for pages they do not own. * On other shards, this caused a relation size cache miss and a DbDir and RelDir lookup before dropping the `ClearVmBits`. With many relations, this could cause significant CPU overhead. This is not believed to be a correctness problem, but this will be verified in #9914. Resolves #9855. # Changes Route `ClearVmBits` metadata records only to the shards responsible for the VM pages. Verification of the current VM handling and cleanup of incomplete VM pages on shard 0 (and potentially elsewhere) is left as follow-up work.

erikgrinaker · 2024-11-29T15:54:45Z

Closing this out, the following should address any gaps:

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

erikgrinaker · 2024-12-05T16:49:55Z

We found some potentially incorrect FSM key handling over in #10027.

We've seen cases where stray keys end up on the wrong shard. This shouldn't happen. Add debug assertions to prevent this. In release builds, we should be lenient in order to handle changing key ownership policies. Touches #9914.

Verifies that visibility map pages are correctly maintained across shards. Touches #9914.

erikgrinaker added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Nov 27, 2024

erikgrinaker self-assigned this Nov 27, 2024

This was referenced Nov 27, 2024

pageserver: slow get_rel_exists() during WAL ingestion with many relations #9855

Closed

pageserver: only apply ClearVmBits on relevant shards #9895

Merged

erikgrinaker mentioned this issue Nov 28, 2024

Makefile: build pg_visibility #9922

Merged

erikgrinaker mentioned this issue Nov 28, 2024

pageserver: clean up stray VM pages #9927

Closed

github-merge-queue bot pushed a commit that referenced this issue Nov 28, 2024

Makefile: build pg_visibility (#9922)

70780e3

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

This was referenced Nov 29, 2024

test_runner: add visibility map test #9940

Merged

pageserver: assert that keys belong to shard #9943

Merged

erikgrinaker closed this as completed Nov 29, 2024

awarus pushed a commit that referenced this issue Dec 5, 2024

Makefile: build pg_visibility (#9922)

2fa461b

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024

test_runner: add visibility map test (#9940)

ad472bd

Verifies that visibility map pages are correctly maintained across shards. Touches #9914.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: verify correct handling of VM and FSM pages #9914

pageserver: verify correct handling of VM and FSM pages #9914

erikgrinaker commented Nov 27, 2024 •

edited

Loading

erikgrinaker commented Nov 28, 2024

erikgrinaker commented Nov 28, 2024 •

edited

Loading

erikgrinaker commented Nov 28, 2024

erikgrinaker commented Nov 29, 2024

erikgrinaker commented Dec 5, 2024

pageserver: verify correct handling of VM and FSM pages #9914

pageserver: verify correct handling of VM and FSM pages #9914

Comments

erikgrinaker commented Nov 27, 2024 • edited Loading

erikgrinaker commented Nov 28, 2024

erikgrinaker commented Nov 28, 2024 • edited Loading

erikgrinaker commented Nov 28, 2024

erikgrinaker commented Nov 29, 2024

erikgrinaker commented Dec 5, 2024

erikgrinaker commented Nov 27, 2024 •

edited

Loading

erikgrinaker commented Nov 28, 2024 •

edited

Loading