-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove materialized page cache #8105
Conversation
3222 tests run: 3105 passed, 0 failed, 117 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
45d7669 at 2024-06-19T11:51:13.122Z :recycle: |
The failures are due to timeouts, will push changes that fix the tests to have fixed runtime and/or raise timeouts. Rationale for that is that these tests are single-tenant tests that were previously fast because they had excellent materialized page cache hit rates. As stated in the PR description, that's something we want to discourage. Real computes should increase their shared_buffers/LFC. For the tests, we'd rather still hit the Pageservers to exercise the code, rather than increase their cache sizes. |
The
Note this PR is on top of #8050 and further, the bug fixed by #8050 can't happen with materialized page cache. So, is this a different issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once this is understood:
The pg_regress failure in release mode is a bit concerning though (see regress.diffs artifact). It's not due to timeout afaict
Let's be thoughtful about when we merge: maybe immediately after next week's release branch is cut, so that we get the full week in staging?
I would rather merge today & do an early deploy to pre-prod so we get multiple prodlike cloudbench run results. (Reverting by end of week or on the weekend is less disruption than reverting on Monday ) |
Works for me. |
Trying to repro the pg_regress test failure in stacked atop this PR |
At quick glance, those are caused by differences in result set ordering. If you do something like The 'portals' test that failed here is more sensitive than most. If it's a one-off thing and doesn't repeat, I'd say ignore it. If it happens again, then we can try e.g. set |
So, why are the tests timing out with
|
I'm merging this, as I am convinced that the pg_regress failure that we observed isn't due to changes in this PR #8105 (comment) I'm going to deploy to pre-prod post-merge so we get 4 benchmark runs before next week's release. |
Post-merge commit failed two
However, two commits earlier, these benchmarks also failed (Allure) So, I'm moving on with the plan to deploy to pre-prod. |
Let's make a bet:) Why do I think so? Three main arguments:
As far as I understand the main arguments against materialised cache were:
First can be explained by too small cache size and it's flushing by other requests (I am not sure if we have now separate cache for materialised page as I suggested long long time ago). To be honest, I see several arguments for removing this cache:
|
part of Epic #7386 # Motivation The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050. Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for `task_kind=PageRequestHandler`. Other task kinds have hit rates <0.2%. Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver. Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459 # Changes This PR removes the materialized page cache code & metrics. The infrastructure for different key kinds in `PageCache` is left in place, even though the "Immutable" key kind is the only remaining one. This can be further simplified in a future commit. Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts: * test_local_file_cache_unlink * test_physical_replication * test_pg_regress # Performance I focussed on ensuring that this PR will not result in a performance regression in prod. * **getpage** requests: our production metrics have shown the materialized page cache to be irrelevant (low hit rate). Also, Pageserver is the wrong place to cache page images, it should happen in compute. * **ingest** (`task_kind=WalReceiverConnectionHandler`): prod metrics show 0 percent hit rate, so, removing will not be a regression. * **get_lsn_by_timestamp**: important API for branch creation, used by control pane. The clog pages that this code uses are not materialize-page-cached because they're not 8k. No risk of introducing a regression here. We will watch the various nightly benchmarks closely for more results before shipping to prod.
## Problem Debug-mode runs of test_pg_regress are rather slow since #8105, and occasionally exceed their 600s timeout. ## Summary of changes - Use 8MiB layer files, avoiding large ephemeral layers On a hetzner AX102, this takes the runtime from 230s to 190s. Which hopefully will be enough to get the runtime on github runners more reliably below its 600s timeout. This has the side benefit of exercising more of the pageserver stack (including compaction) under a workload that exercises a more diverse set of postgres functionality than most of our tests.
## Problem Debug-mode runs of test_pg_regress are rather slow since #8105, and occasionally exceed their 600s timeout. ## Summary of changes - Use 8MiB layer files, avoiding large ephemeral layers On a hetzner AX102, this takes the runtime from 230s to 190s. Which hopefully will be enough to get the runtime on github runners more reliably below its 600s timeout. This has the side benefit of exercising more of the pageserver stack (including compaction) under a workload that exercises a more diverse set of postgres functionality than most of our tests.
part of Epic #7386
Motivation
The materialized page cache adds complexity to the code base, which increases the maintenance burden and risk for subtle and hard to reproduce bugs such as #8050.
Further, the best hit rate that we currently achieve in production is ca 1% of materialized page cache lookups for
task_kind=PageRequestHandler
. Other task kinds have hit rates <0.2%.Last, caching page images in Pageserver rewards under-sized caches in Computes because reading from Pageserver's materialized page cache over the network is often sufficiently fast (low hundreds of microseconds). Such Computes should upscale their local caches to fit their working set, rather than repeatedly requesting the same page from Pageserver.
Some more discussion and context in internal thread https://neondb.slack.com/archives/C033RQ5SPDH/p1718714037708459
Changes
This PR removes the materialized page cache code & metrics.
The infrastructure for different key kinds in
PageCache
is left in place, even though the "Immutable" key kind is the only remaining one.This can be further simplified in a future commit.
Some tests started failing because their total runtime was dependent on high materialized page cache hit rates. This test makes them fixed-runtime or raises pytest timeouts:
Performance
I focussed on ensuring that this PR will not result in a performance regression in prod.
task_kind=WalReceiverConnectionHandler
): prod metrics show 0 percent hit rate, so, removing will not be a regression.We will watch the various nightly benchmarks closely for more results before shipping to prod.