Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: verify correct handling of VM and FSM pages #9914

Closed
erikgrinaker opened this issue Nov 27, 2024 · 5 comments
Closed

pageserver: verify correct handling of VM and FSM pages #9914

erikgrinaker opened this issue Nov 27, 2024 · 5 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 27, 2024

In #9855, we found that VM pages were handled suboptimally and potentially incorrectly.

The VM (visibility map) and FSM (free space map) are stored as forks of the main relation, and their pages are sharded and managed just like regular relation pages. However, updates to the VM and FSM pages may happen differently from regular pages -- they are often updated implicitly on writes to the main relation.

For VM pages specifically, we found in #9855 that ClearVmBits writes were sent to all shards, regardless of whether the shards were responsible for VM pages. These writes would succeed on shards that had VM pages and also on shard 0, but fail and be ignored on other shards (potentially causing significant performance overhead due to relsize cache misses).

In #9895, we fix ClearVmBits such that it's only applied on the shard responsible for the relevant VM pages. However, we need to verify that this is correct, and add appropriate test coverage. We also need to clean up any old partial writes due to misrouting of ClearVmBits (in particular to shard 0). The behavior following #9895 relies on a few assumptions:

  • The compute (vacuum) always explicitly writes VM pages via the WAL before any ClearVmBits on those pages.
  • The VM shard(s) have a view of the VM relation size that covers all local VM pages.
  • We never attempt to access (incomplete) VM pages on shards other than the owning one (including 0).

See also #9855 (comment).

We should also verify that FSM pages are handled similarly and correctly.

@erikgrinaker erikgrinaker added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Nov 27, 2024
@erikgrinaker erikgrinaker self-assigned this Nov 27, 2024
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
# Problem

VM (visibility map) pages are stored and managed as any regular relation
page, in the VM fork of the main relation. They are also sharded like
other pages. Regular WAL writes to the VM pages (typically performed by
vacuum) are routed to the correct shard as usual. However, VM pages are
also updated via `ClearVmBits` metadata records emitted when main
relation pages are updated. These metadata records were sent to all
shards, like other metadata records. This had the following effects:

* On shards responsible for VM pages, the `ClearVmBits` applies as
expected.

* On shard 0, which knows about the VM relation and its size but doesn't
necessarily have any VM pages, the `ClearVmBits` writes may have been
applied without also having applied the explicit WAL writes to VM pages.

* If VM pages are spread across multiple shards (unlikely with 256MB
stripe size), all shards may have applied `ClearVmBits` if the pages
fall within their local view of the relation size, even for pages they
do not own.

* On other shards, this caused a relation size cache miss and a DbDir
and RelDir lookup before dropping the `ClearVmBits`. With many
relations, this could cause significant CPU overhead.

This is not believed to be a correctness problem, but this will be
verified in #9914.

Resolves #9855.

# Changes

Route `ClearVmBits` metadata records only to the shards responsible for
the VM pages.

Verification of the current VM handling and cleanup of incomplete VM
pages on shard 0 (and potentially elsewhere) is left as follow-up work.
@erikgrinaker
Copy link
Contributor Author

The pg_visibility extension is useful for debugging the visibility map. The following will verify that it's correct (i.e. matches the actual heap pages):

select pg_check_visible(table);
select pg_check_frozen(table);

This was verified by discarding ClearVmBits operations, which was detected by pg_check_visible().

Note that VM pages will often be cached in the compute, so to verify that storage has the correct view the compute must be restarted to clear its caches.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 28, 2024

The following produced 386 faulty VM pages when run with ClearVmBits disabled:

createdb -p 55432 pgbench
./pg_install/v16/bin/pgbench -p 55432 -i -s 100 pgbench
./pg_install/v16/bin/pgbench -p 55432 -c 64 -T 60 -j 16 -P 1 pgbench
cargo neon endpoint stop main
cargo neon endpoint start main
psql -p 55432 pgbench -c "select count(*) from pg_check_visible('pgbench_accounts')"

I'll do five runs across 8 shards with stripe size 16 MB for 1 minute each, before and after #9895, to verify that there is no correctness issue.

@erikgrinaker
Copy link
Contributor Author

Tests didn't show any VM page corruption.

I've spent the day reading and debugging code, running tests, and talking to Heikki (who wrote the visibility map), and I'm reasonably confident that we don't have a correctness problem. We likely have some stray keys to clean up though, split out #9927 for this.

I'll leave this issue open to consider adding some more assertions, checks, or tests around this.

github-merge-queue bot pushed a commit that referenced this issue Nov 28, 2024
Build the `pg_visibility` extension for use with `neon_local`. This is
useful to inspect the visibility map for debugging.

Touches #9914.
erikgrinaker added a commit that referenced this issue Nov 29, 2024
# Problem

VM (visibility map) pages are stored and managed as any regular relation
page, in the VM fork of the main relation. They are also sharded like
other pages. Regular WAL writes to the VM pages (typically performed by
vacuum) are routed to the correct shard as usual. However, VM pages are
also updated via `ClearVmBits` metadata records emitted when main
relation pages are updated. These metadata records were sent to all
shards, like other metadata records. This had the following effects:

* On shards responsible for VM pages, the `ClearVmBits` applies as
expected.

* On shard 0, which knows about the VM relation and its size but doesn't
necessarily have any VM pages, the `ClearVmBits` writes may have been
applied without also having applied the explicit WAL writes to VM pages.

* If VM pages are spread across multiple shards (unlikely with 256MB
stripe size), all shards may have applied `ClearVmBits` if the pages
fall within their local view of the relation size, even for pages they
do not own.

* On other shards, this caused a relation size cache miss and a DbDir
and RelDir lookup before dropping the `ClearVmBits`. With many
relations, this could cause significant CPU overhead.

This is not believed to be a correctness problem, but this will be
verified in #9914.

Resolves #9855.

# Changes

Route `ClearVmBits` metadata records only to the shards responsible for
the VM pages.

Verification of the current VM handling and cleanup of incomplete VM
pages on shard 0 (and potentially elsewhere) is left as follow-up work.
@erikgrinaker
Copy link
Contributor Author

Closing this out, the following should address any gaps:

awarus pushed a commit that referenced this issue Dec 5, 2024
Build the `pg_visibility` extension for use with `neon_local`. This is
useful to inspect the visibility map for debugging.

Touches #9914.
@erikgrinaker
Copy link
Contributor Author

We found some potentially incorrect FSM key handling over in #10027.

github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
We've seen cases where stray keys end up on the wrong shard. This
shouldn't happen. Add debug assertions to prevent this. In release
builds, we should be lenient in order to handle changing key ownership
policies.

Touches #9914.
github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024
Verifies that visibility map pages are correctly maintained across
shards.

Touches #9914.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant