Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove materialized page cache #8105

Merged
merged 6 commits into from
Jun 20, 2024
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Jun 18, 2024

part of Epic #7386

Motivation

The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for task_kind=PageRequestHandler. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in PageCache is left in place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts:

  • test_local_file_cache_unlink
  • test_physical_replication
  • test_pg_regress

Performance

I focussed on ensuring that this PR will not result in a performance regression in prod.

  • getpage requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute.
  • ingest (task_kind=WalReceiverConnectionHandler): prod metrics show 0 percent hit rate, so, removing will not be a regression.
  • get_lsn_by_timestamp: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here.

We will watch the various nightly benchmarks closely for more results before shipping to prod.

Copy link

github-actions bot commented Jun 18, 2024

3222 tests run: 3105 passed, 0 failed, 117 skipped (full report)


Flaky tests (1)

Postgres 14

Code coverage* (full report)

  • functions: 32.4% (6839 of 21090 functions)
  • lines: 50.0% (53311 of 106578 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
45d7669 at 2024-06-19T11:51:13.122Z :recycle:

@problame
Copy link
Contributor Author

The failures are due to timeouts, will push changes that fix the tests to have fixed runtime and/or raise timeouts.

Rationale for that is that these tests are single-tenant tests that were previously fast because they had excellent materialized page cache hit rates. As stated in the PR description, that's something we want to discourage. Real computes should increase their shared_buffers/LFC. For the tests, we'd rather still hit the Pageservers to exercise the code, rather than increase their cache sizes.

@problame
Copy link
Contributor Author

problame commented Jun 19, 2024

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8105/9573053274/index.html#suites/158be07438eb5188d40b466b6acfaeb3/bfde9639f5187a1/

diff -U3 /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/portals.out /tmp/test_output/test_pg_regress[release-pg15-None]/regress/results/portals.out
--- /__w/neon/neon/vendor/postgres-v15/src/test/regress/expected/portals.out	2024-06-18 22:45:27.681138276 +0000
+++ /tmp/test_output/test_pg_regress[release-pg15-None]/regress/results/portals.out	2024-06-18 22:50:45.694064880 +0000
@@ -34,8 +34,8 @@
 FETCH 2 in foo2;
  unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
 ---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
-    8800 |       0 |   0 |    0 |   0 |      0 |       0 |      800 |         800 |      3800 |     8800 |   0 |    1 | MAAAAA   | AAAAAA   | AAAAxx
-    1891 |       1 |   1 |    3 |   1 |     11 |      91 |      891 |        1891 |      1891 |     1891 | 182 |  183 | TUAAAA   | BAAAAA   | HHHHxx
+    9437 |    9744 |   1 |    1 |   7 |     17 |      37 |      437 |        1437 |      4437 |     9437 |  74 |   75 | ZYAAAA   | UKOAAA   | AAAAxx
+    1821 |    9745 |   1 |    1 |   1 |      1 |      21 |      821 |        1821 |      1821 |     1821 |  42 |   43 | BSAAAA   | VKOAAA   | HHHHxx
 (2 rows)
...

Note this PR is on top of #8050 and further, the bug fixed by #8050 can't happen with materialized page cache. So, is this a different issue?

@problame problame marked this pull request as ready for review June 19, 2024 12:00
@problame problame requested a review from a team as a code owner June 19, 2024 12:00
@problame problame requested a review from jcsp June 19, 2024 12:00
Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once this is understood:

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict

Let's be thoughtful about when we merge: maybe immediately after next week's release branch is cut, so that we get the full week in staging?

@problame
Copy link
Contributor Author

problame commented Jun 19, 2024

Let's be thoughtful about when we merge: maybe immediately after next week's release branch is cut, so that we get the full week in staging?

I would rather merge today & do an early deploy to pre-prod so we get multiple prodlike cloudbench run results. (Reverting by end of week or on the weekend is less disruption than reverting on Monday )

@jcsp
Copy link
Collaborator

jcsp commented Jun 19, 2024

I would rather merge today & do an early deploy to pre-prod so we get multiple prodlike cloudbench run results. (Reverting by end of week or on the weekend is less disruption than reverting on Monday )

Works for me.

@problame
Copy link
Contributor Author

Trying to repro the pg_regress test failure in

stacked atop this PR

@hlinnaka
Copy link
Contributor

The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8105/9573053274/index.html#suites/158be07438eb5188d40b466b6acfaeb3/bfde9639f5187a1/

At quick glance, those are caused by differences in result set ordering. If you do something like SELECT * FROM foo, with no ORDER BY, the order of the results is not well-defined. The Postgres regression test suite contains a lot of queries like that are sensitive to row ordering, and in theory at the whim of implementation details, but are stable in practice and not usually causing any trouble.

The 'portals' test that failed here is more sensitive than most.
See https://www.postgresql.org/message-id/flat/1246440.1693883808%40sss.pgh.pa.us for upstream discussion on this topic.
If you run it with a very small shared_buffers setting, the "synchronous scans" feature can kick in and cause the result set ordering to change. I think that's what happened here.

If it's a one-off thing and doesn't repeat, I'd say ignore it. If it happens again, then we can try e.g. set synchronize_seqscans=off for the that test.

@problame
Copy link
Contributor Author

So, why are the tests timing out with pytest.mark.repeat()?

  • we use pytest-xdist to run 16 tests in parallel
  • the pg_regress test, especially in debug mode where we do the validation, is much more CPU intensive than your typical test
  • so when we run only the pg_regress test, and 16 of them in parallel, we become CPU-bottlenecked
  • this results in longer test runtimes
  • that's why we hit the timeout

@problame
Copy link
Contributor Author

I'm merging this, as I am convinced that the pg_regress failure that we observed isn't due to changes in this PR #8105 (comment)

I'm going to deploy to pre-prod post-merge so we get 4 benchmark runs before next week's release.

@problame problame merged commit 7940163 into main Jun 20, 2024
65 checks passed
@problame problame deleted the problame/remove-materialized-page-cache branch June 20, 2024 09:56
@problame
Copy link
Contributor Author

Post-merge commit failed two test_runenr.performance benchmarks (Allure)

test_heavy_write_workload[neon_off-release-pg14-github-actions-selfhosted-10-5-5]
test_basebackup_with_high_slru_count[release-pg14-github-actions-selfhosted-sequential-10-13-30]

However, two commits earlier, these benchmarks also failed (Allure)

So, I'm moving on with the plan to deploy to pre-prod.

@knizhnik
Copy link
Contributor

Let's make a bet:)
How long time will be needed to storage team to reincornate materialized page cache.
My guess is one year;)

Why do I think so? Three main arguments:

  1. Page reconstruction is still quite expensive operation. Doesn't matter how we will optimise walredo process, may be ever rewrite redo handlers in Rust and throw away walredo process at all. It still be quite expensive operation comparing with taking page from cache. And caching is the most straightforward, natural, simple and efficient way to reduce cost of expensive operations.
  2. There is quite a lot of memory on PS. And not so many things which it can be used for (except OS file cache).
  3. Alternative of caching materialised pages is enforced generation of image layers. But it cause significant write amplification and may affect overall performance. Partial image layers can be some kind of compromise.

As far as I understand the main arguments against materialised cache were:

  1. Small hit rate
  2. Troubles with implementation and contention.

First can be explained by too small cache size and it's flushing by other requests (I am not sure if we have now separate cache for materialised page as I suggested long long time ago).
Concerning 2) we should definitely thanks Rust for it. It managed to make simplest things non-trivial.

To be honest, I see several arguments for removing this cache:

  1. We assume that compute will keep frequently accessed pages in shared buffer or LFC. But it doesn't work for node restart and RO replicas.
  2. If page is rarely updated, then it may be more efficient to create image layer to materialise it. If page is frequently updated, then materialized cache will be frequently invalidated, so maintaining it will slowdown writes but not speed up reads.
  3. Page reconstruction cost can be negligible, comparing with compute<->roundtrip. But it is not true for prefetch.

conradludgate pushed a commit that referenced this pull request Jun 27, 2024
part of Epic #7386

# Motivation

The materialized page cache adds complexity to the code base, which
increases the maintenance burden and risk for subtle and hard to
reproduce bugs such as #8050.

Further, the best hit rate that we currently achieve in production is ca
1% of materialized page cache lookups for
`task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%.

Last, caching page images in Pageserver rewards under-sized caches in
Computes because reading from Pageserver's materialized page cache over
the network is often sufficiently fast (low hundreds of microseconds).
Such Computes should upscale their local caches to fit their working
set, rather than repeatedly requesting the same page from Pageserver.

Some more discussion and context in internal thread
https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459

# Changes

This PR removes the materialized page cache code & metrics.

The infrastructure for different key kinds in `PageCache` is left in
place, even though the "Immutable" key kind is the only remaining one.
This can be further simplified in a future commit.

Some tests started failing because their total runtime was dependent on
high materialized page cache hit rates. This test makes them
fixed-runtime or raises pytest timeouts:
* test_local_file_cache_unlink
* test_physical_replication
* test_pg_regress

# Performance

I focussed on ensuring that this PR will not result in a performance
regression in prod.

* **getpage** requests: our production metrics have shown the
materialized page cache to be irrelevant (low hit rate). Also,
Pageserver is the wrong place to cache page images, it should happen in
compute.
* **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics
show 0 percent hit rate, so, removing will not be a regression.
* **get_lsn_by_timestamp**: important API for branch creation, used by
control pane. The clog pages that this code uses are not
materialize-page-cached because they're not 8k. No risk of introducing a
regression here.

We will watch the various nightly benchmarks closely for more results
before shipping to prod.
@problame problame self-assigned this Jun 27, 2024
jcsp added a commit that referenced this pull request Jul 8, 2024
## Problem

Debug-mode runs of test_pg_regress are rather slow since
#8105, and occasionally exceed
their 600s timeout.

## Summary of changes

- Use 8MiB layer files, avoiding large ephemeral layers

On a hetzner AX102, this takes the runtime from 230s to 190s. Which
hopefully will be enough to get the runtime on github runners more
reliably below its 600s timeout.

This has the side benefit of exercising more of the pageserver stack
(including compaction) under a workload that exercises a more diverse
set of postgres functionality than most of our tests.
skyzh pushed a commit that referenced this pull request Jul 15, 2024
## Problem

Debug-mode runs of test_pg_regress are rather slow since
#8105, and occasionally exceed
their 600s timeout.

## Summary of changes

- Use 8MiB layer files, avoiding large ephemeral layers

On a hetzner AX102, this takes the runtime from 230s to 190s. Which
hopefully will be enough to get the runtime on github runners more
reliably below its 600s timeout.

This has the side benefit of exercising more of the pageserver stack
(including compaction) under a workload that exercises a more diverse
set of postgres functionality than most of our tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants