remove materialized page cache #8105

problame · 2024-06-18T21:37:58Z

part of Epic #7386

Motivation

The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for task_kind=PageRequestHandler. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in PageCache is left in place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts:

test_local_file_cache_unlink
test_physical_replication
test_pg_regress

Performance

I focussed on ensuring that this PR will not result in a performance regression in prod.

getpage requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute.
ingest (task_kind=WalReceiverConnectionHandler): prod metrics show 0 percent hit rate, so, removing will not be a regression.
get_lsn_by_timestamp: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here.

We will watch the various nightly benchmarks closely for more results before shipping to prod.

github-actions · 2024-06-18T22:20:48Z

3222 tests run: 3105 passed, 0 failed, 117 skipped (full report)

Flaky tests (1)

Postgres 14

test_subscriber_restart: release

Code coverage* (full report)

functions: 32.4% (6839 of 21090 functions)
lines: 50.0% (53311 of 106578 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
45d7669 at 2024-06-19T11:51:13.122Z :recycle:}

…ency

problame · 2024-06-19T10:33:16Z

The failures are due to timeouts, will push changes that fix the tests to have fixed runtime and/or raise timeouts.

Rationale for that is that these tests are single-tenant tests that were previously fast because they had excellent materialized page cache hit rates. As stated in the PR description, that's something we want to discourage. Real computes should increase their shared_buffers/LFC. For the tests, we'd rather still hit the Pageservers to exercise the code, rather than increase their cache sizes.

problame · 2024-06-19T10:34:41Z

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8105/9573053274/index.html#suites/158be07438eb5188d40b466b6acfaeb3/bfde9639f5187a1/

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/portals.out /tmp/test_output/test_pg_regress[release-pg15-None]/regress/results/portals.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/portals.out	2024-06-18 22:45:27.681138276 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-None]/regress/results/portals.out	2024-06-18 22:50:45.694064880 +0000
@@ -34,8 +34,8 @@
 FETCH 2 in foo2;
  unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
 ---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
-    8800 |       0 |   0 |    0 |   0 |      0 |       0 |      800 |         800 |      3800 |     8800 |   0 |    1 | MAAAAA   | AAAAAA   | AAAAxx
-    1891 |       1 |   1 |    3 |   1 |     11 |      91 |      891 |        1891 |      1891 |     1891 | 182 |  183 | TUAAAA   | BAAAAA   | HHHHxx
+    9437 |    9744 |   1 |    1 |   7 |     17 |      37 |      437 |        1437 |      4437 |     9437 |  74 |   75 | ZYAAAA   | UKOAAA   | AAAAxx
+    1821 |    9745 |   1 |    1 |   1 |      1 |      21 |      821 |        1821 |      1821 |     1821 |  42 |   43 | BSAAAA   | VKOAAA   | HHHHxx
 (2 rows)
...

Note this PR is on top of #8050 and further, the bug fixed by #8050 can't happen with materialized page cache. So, is this a different issue?

jcsp

LGTM once this is understood:

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict

Let's be thoughtful about when we merge: maybe immediately after next week's release branch is cut, so that we get the full week in staging?

problame · 2024-06-19T13:13:07Z

Let's be thoughtful about when we merge: maybe immediately after next week's release branch is cut, so that we get the full week in staging?

I would rather merge today & do an early deploy to pre-prod so we get multiple prodlike cloudbench run results. (Reverting by end of week or on the weekend is less disruption than reverting on Monday )

jcsp · 2024-06-19T13:14:23Z

I would rather merge today & do an early deploy to pre-prod so we get multiple prodlike cloudbench run results. (Reverting by end of week or on the weekend is less disruption than reverting on Monday )

Works for me.

problame · 2024-06-19T14:22:29Z

Trying to repro the pg_regress test failure in

DNM: repro failing pg_regress #8114

stacked atop this PR

hlinnaka · 2024-06-19T14:37:15Z

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8105/9573053274/index.html#suites/158be07438eb5188d40b466b6acfaeb3/bfde9639f5187a1/

At quick glance, those are caused by differences in result set ordering. If you do something like SELECT * FROM foo, with no ORDER BY, the order of the results is not well-defined. The Postgres regression test suite contains a lot of queries like that are sensitive to row ordering, and in theory at the whim of implementation details, but are stable in practice and not usually causing any trouble.

The 'portals' test that failed here is more sensitive than most.
See https://www.postgresql.org/message-id/flat/1246440.1693883808%40sss.pgh.pa.us for upstream discussion on this topic.
If you run it with a very small shared_buffers setting, the "synchronous scans" feature can kick in and cause the result set ordering to change. I think that's what happened here.

If it's a one-off thing and doesn't repeat, I'd say ignore it. If it happens again, then we can try e.g. set synchronize_seqscans=off for the that test.

problame · 2024-06-20T09:53:07Z

So, why are the tests timing out with pytest.mark.repeat()?

we use pytest-xdist to run 16 tests in parallel
the pg_regress test, especially in debug mode where we do the validation, is much more CPU intensive than your typical test
so when we run only the pg_regress test, and 16 of them in parallel, we become CPU-bottlenecked
this results in longer test runtimes
that's why we hit the timeout

problame · 2024-06-20T09:55:56Z

I'm merging this, as I am convinced that the pg_regress failure that we observed isn't due to changes in this PR #8105 (comment)

I'm going to deploy to pre-prod post-merge so we get 4 benchmark runs before next week's release.

problame · 2024-06-20T11:44:19Z

Post-merge commit failed two test_runenr.performance benchmarks (Allure)

test_heavy_write_workload[neon_off-release-pg14-github-actions-selfhosted-10-5-5]
test_basebackup_with_high_slru_count[release-pg14-github-actions-selfhosted-sequential-10-13-30]

However, two commits earlier, these benchmarks also failed (Allure)

So, I'm moving on with the plan to deploy to pre-prod.

knizhnik · 2024-06-21T13:36:16Z

Let's make a bet:)
How long time will be needed to storage team to reincornate materialized page cache.
My guess is one year;)

Why do I think so? Three main arguments:

Page reconstruction is still quite expensive operation. Doesn't matter how we will optimise walredo process, may be ever rewrite redo handlers in Rust and throw away walredo process at all. It still be quite expensive operation comparing with taking page from cache. And caching is the most straightforward, natural, simple and efficient way to reduce cost of expensive operations.
There is quite a lot of memory on PS. And not so many things which it can be used for (except OS file cache).
Alternative of caching materialised pages is enforced generation of image layers. But it cause significant write amplification and may affect overall performance. Partial image layers can be some kind of compromise.

As far as I understand the main arguments against materialised cache were:

Small hit rate
Troubles with implementation and contention.

First can be explained by too small cache size and it's flushing by other requests (I am not sure if we have now separate cache for materialised page as I suggested long long time ago).
Concerning 2) we should definitely thanks Rust for it. It managed to make simplest things non-trivial.

To be honest, I see several arguments for removing this cache:

We assume that compute will keep frequently accessed pages in shared buffer or LFC. But it doesn't work for node restart and RO replicas.
If page is rarely updated, then it may be more efficient to create image layer to materialise it. If page is frequently updated, then materialized cache will be frequently invalidated, so maintaining it will slowdown writes but not speed up reads.
Page reconstruction cost can be negligible, comparing with compute<->roundtrip. But it is not true for prefetch.

part of Epic #7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.

## Problem Debug-mode runs of test_pg_regress are rather slow since #8105, and occasionally exceed their 600s timeout. ## Summary of changes - Use 8MiB layer files, avoiding large ephemeral layers On a hetzner AX102, this takes the runtime from 230s to 190s. Which hopefully will be enough to get the runtime on github runners more reliably below its 600s timeout. This has the side benefit of exercising more of the pageserver stack (including compaction) under a workload that exercises a more diverse set of postgres functionality than most of our tests.

problame added 2 commits June 18, 2024 13:14

remove materialized page cache

0dfef8d

cargo fmt

ed45edc

fix tests, remove traces from docs, remove imprecise cache lookup code

c073be6

problame mentioned this pull request Jun 18, 2024

Epic: Bypass PageCache for user data blocks #7386

Open

5 tasks

problame added 3 commits June 19, 2024 12:20

test_physical_replication: make runtime independent of getpage latency

8602096

test_local_file_cache_unlink: make runtime independent of getpage lat…

1f9a39f

…ency

test_pg_regress: increase timeout and always run vectored get validation

45d7669

problame marked this pull request as ready for review June 19, 2024 12:00

problame requested a review from a team as a code owner June 19, 2024 12:00

problame requested a review from jcsp June 19, 2024 12:00

jcsp approved these changes Jun 19, 2024

View reviewed changes

problame mentioned this pull request Jun 19, 2024

DNM: repro failing pg_regress #8114

Closed

problame merged commit 7940163 into main Jun 20, 2024
65 checks passed

problame deleted the problame/remove-materialized-page-cache branch June 20, 2024 09:56

problame self-assigned this Jun 27, 2024

jcsp mentioned this pull request Jul 2, 2024

tests: use smaller layers in test_pg_regress, do compaction after tests #8232

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove materialized page cache #8105

remove materialized page cache #8105

problame commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024 •

edited

Loading

Postgres 14

problame commented Jun 19, 2024

problame commented Jun 19, 2024 •

edited

Loading

jcsp left a comment

problame commented Jun 19, 2024 •

edited

Loading

jcsp commented Jun 19, 2024

problame commented Jun 19, 2024

hlinnaka commented Jun 19, 2024

problame commented Jun 20, 2024

problame commented Jun 20, 2024

problame commented Jun 20, 2024

knizhnik commented Jun 21, 2024

remove materialized page cache #8105

remove materialized page cache #8105

Conversation

problame commented Jun 18, 2024 • edited Loading

Motivation

Changes

Performance

github-actions bot commented Jun 18, 2024 • edited Loading

3222 tests run: 3105 passed, 0 failed, 117 skipped (full report)

Postgres 14

Code coverage* (full report)

problame commented Jun 19, 2024

problame commented Jun 19, 2024 • edited Loading

jcsp left a comment

Choose a reason for hiding this comment

problame commented Jun 19, 2024 • edited Loading

jcsp commented Jun 19, 2024

problame commented Jun 19, 2024

hlinnaka commented Jun 19, 2024

problame commented Jun 20, 2024

problame commented Jun 20, 2024

problame commented Jun 20, 2024

knizhnik commented Jun 21, 2024

problame commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024 •

edited

Loading

problame commented Jun 19, 2024 •

edited

Loading

problame commented Jun 19, 2024 •

edited

Loading