-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: verify correct handling of VM and FSM pages #9914
Comments
# Problem VM (visibility map) pages are stored and managed as any regular relation page, in the VM fork of the main relation. They are also sharded like other pages. Regular WAL writes to the VM pages (typically performed by vacuum) are routed to the correct shard as usual. However, VM pages are also updated via `ClearVmBits` metadata records emitted when main relation pages are updated. These metadata records were sent to all shards, like other metadata records. This had the following effects: * On shards responsible for VM pages, the `ClearVmBits` applies as expected. * On shard 0, which knows about the VM relation and its size but doesn't necessarily have any VM pages, the `ClearVmBits` writes may have been applied without also having applied the explicit WAL writes to VM pages. * If VM pages are spread across multiple shards (unlikely with 256MB stripe size), all shards may have applied `ClearVmBits` if the pages fall within their local view of the relation size, even for pages they do not own. * On other shards, this caused a relation size cache miss and a DbDir and RelDir lookup before dropping the `ClearVmBits`. With many relations, this could cause significant CPU overhead. This is not believed to be a correctness problem, but this will be verified in #9914. Resolves #9855. # Changes Route `ClearVmBits` metadata records only to the shards responsible for the VM pages. Verification of the current VM handling and cleanup of incomplete VM pages on shard 0 (and potentially elsewhere) is left as follow-up work.
The
This was verified by discarding Note that VM pages will often be cached in the compute, so to verify that storage has the correct view the compute must be restarted to clear its caches. |
The following produced 386 faulty VM pages when run with createdb -p 55432 pgbench
./pg_install/v16/bin/pgbench -p 55432 -i -s 100 pgbench
./pg_install/v16/bin/pgbench -p 55432 -c 64 -T 60 -j 16 -P 1 pgbench
cargo neon endpoint stop main
cargo neon endpoint start main
psql -p 55432 pgbench -c "select count(*) from pg_check_visible('pgbench_accounts')" I'll do five runs across 8 shards with stripe size 16 MB for 1 minute each, before and after #9895, to verify that there is no correctness issue. |
Tests didn't show any VM page corruption. I've spent the day reading and debugging code, running tests, and talking to Heikki (who wrote the visibility map), and I'm reasonably confident that we don't have a correctness problem. We likely have some stray keys to clean up though, split out #9927 for this. I'll leave this issue open to consider adding some more assertions, checks, or tests around this. |
Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.
# Problem VM (visibility map) pages are stored and managed as any regular relation page, in the VM fork of the main relation. They are also sharded like other pages. Regular WAL writes to the VM pages (typically performed by vacuum) are routed to the correct shard as usual. However, VM pages are also updated via `ClearVmBits` metadata records emitted when main relation pages are updated. These metadata records were sent to all shards, like other metadata records. This had the following effects: * On shards responsible for VM pages, the `ClearVmBits` applies as expected. * On shard 0, which knows about the VM relation and its size but doesn't necessarily have any VM pages, the `ClearVmBits` writes may have been applied without also having applied the explicit WAL writes to VM pages. * If VM pages are spread across multiple shards (unlikely with 256MB stripe size), all shards may have applied `ClearVmBits` if the pages fall within their local view of the relation size, even for pages they do not own. * On other shards, this caused a relation size cache miss and a DbDir and RelDir lookup before dropping the `ClearVmBits`. With many relations, this could cause significant CPU overhead. This is not believed to be a correctness problem, but this will be verified in #9914. Resolves #9855. # Changes Route `ClearVmBits` metadata records only to the shards responsible for the VM pages. Verification of the current VM handling and cleanup of incomplete VM pages on shard 0 (and potentially elsewhere) is left as follow-up work.
Closing this out, the following should address any gaps: |
Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.
We found some potentially incorrect FSM key handling over in #10027. |
We've seen cases where stray keys end up on the wrong shard. This shouldn't happen. Add debug assertions to prevent this. In release builds, we should be lenient in order to handle changing key ownership policies. Touches #9914.
Verifies that visibility map pages are correctly maintained across shards. Touches #9914.
In #9855, we found that VM pages were handled suboptimally and potentially incorrectly.
The VM (visibility map) and FSM (free space map) are stored as forks of the main relation, and their pages are sharded and managed just like regular relation pages. However, updates to the VM and FSM pages may happen differently from regular pages -- they are often updated implicitly on writes to the main relation.
For VM pages specifically, we found in #9855 that
ClearVmBits
writes were sent to all shards, regardless of whether the shards were responsible for VM pages. These writes would succeed on shards that had VM pages and also on shard 0, but fail and be ignored on other shards (potentially causing significant performance overhead due to relsize cache misses).In #9895, we fix
ClearVmBits
such that it's only applied on the shard responsible for the relevant VM pages. However, we need to verify that this is correct, and add appropriate test coverage. We also need to clean up any old partial writes due to misrouting ofClearVmBits
(in particular to shard 0). The behavior following #9895 relies on a few assumptions:ClearVmBits
on those pages.See also #9855 (comment).
We should also verify that FSM pages are handled similarly and correctly.
The text was updated successfully, but these errors were encountered: