-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propagate apply_lsn from SK to PS to prevent GC from collecting objects which may be still requested by replica #6357
Conversation
2406 tests run: 2290 passed, 0 failed, 116 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
32f4a56 at 2024-02-07T07:45:12.191Z :recycle: |
High level comments:
|
pageserver/src/tenant/timeline.rs
Outdated
@@ -1358,6 +1360,7 @@ impl Timeline { | |||
|
|||
compaction_lock: tokio::sync::Mutex::default(), | |||
gc_lock: tokio::sync::Mutex::default(), | |||
standby_flush_lsn: AtomicLsn::new(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using Lsn::INVALID would make it more obvious that thise is the same as the default value in SkTimelineInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AtomicLsn
constructor accepts u64 parameter. And AtomicLsn::new(LSN::INVALID.0)
looks IMHO really strange
} | ||
} | ||
if reply.flush_lsn != Lsn::INVALID { | ||
if reply_agg.flush_lsn != Lsn::INVALID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since INVALID is defined as zero, wouldn't it be equivalent to just always do an Lsn::min here, rather than checking for INVALID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it doesn't work as 0 is is minimum u64 number, flush_lsb will never be assigned larger value without this check. Alternative is to use Lsn::MAX
as default - in this case we can avoid this check. But it may cause problems in other places which expect Lsn::INVALID
and not Lsn::MAX
.
So I preserved the same logic as already used for PS feedback.
It was my original concern, why I didn't want replicas to hold GC at PS. So the question can it be active replica which has huge lag with master? I do not see that it is realistic scenario.
As I wrote above only active replicas are taken into account.
It is not so early to somehow slowdown replica to cause replication lag. Certainly we can stop replica. But in this case it will not taken in account. And once it is restarted, it should catch up quite soon, unless there are a lot of changes it has to apply. But it means that test should take long enough time to collect this WAL. |
Can you explain more? When a replica ceases to be active, how does the
You may need some test-specific way to pause a replica, like a SIGSTOP, so that it remains logically active but delays advancing the offset. |
So workflow is the following:
So if there is no active replicas than
I will think more about it. The problem is that replica is receiving and replaying WAL is Postgres core code. I do not want to change it just to make it possible to create test. In principle we can add such "throttling" in WAL page filtering (done in our extension). But adding extra GUC and throttling code just for testing is IMHO overkill. |
e7ea8bb
to
4242f79
Compare
Test I added - without this PR it reproduces "tried to request a page version that was garbage collected" error |
secondary_lsn = secondary.safe_psql_scalar( | ||
"SELECT pg_last_wal_replay_lsn()", log_query=False | ||
) | ||
balance = secondary.safe_psql_scalar("select sum(abalance) from pgbench_accounts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no assertion in this test: it looks like it will pass as long as pgbench runs. I guess this is where you need to add something that makes the replica slow, and then some check that GC doesn't advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion is not needed - if requested page can not be reconstructed because requested LSN is beyand GC cutoff, then query is terminated with error and test is failed without any assertions.
It happens without this PR and still happen with 14/15 versions.
I am investigating it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay - can you please add a comment to the test that explains what it is testing, and how it would fail if something is wrong.
64f23d5
to
8de53a6
Compare
171793a
to
9bee043
Compare
1568c14
to
16e196f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave it a look-through, don't have any issues. Seems that we're just piping the horizon through from compute to pageserver, and that's the majority of the change.
6007fad
to
f6c6808
Compare
High level notes:
Mostly note to myself: haven't checked yet if hs feedback is processed at compute at all, last time I checked it wasn't. |
I agree. I didn't think much about upgrade issues.
It seems to be not a bug, but feature. We do not want offline replicas to somehow suspend GC. If replica is restarted, then it is restarted at most recent LSN.
I do not understand why do we need timestamp here. There is currently hardcoded constant:
which limits reported replica lag:
May be it will be better to replace it with PS parameter, but I do not understand why do we need to do something more than such check.
I also do not understand what can cause such large replication lag. Most likely it means some error. Hopefully it is already fixed. But from the other side - this
I have checked it - it I proceeded. Please notice that |
@@ -62,7 +62,7 @@ typedef struct | |||
typedef struct | |||
{ | |||
NeonMessageTag tag; | |||
bool latest; /* if true, request latest page version */ | |||
XLogRecPtr horizon; /* uppe boundary for page LSN */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saying here about 0 special case would be good, or add reference to neon_get_horizon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed comment because I do not consider 0 as special case any more.
get_page
now specifies interval of LSN. 0 is valid value of interval low boundary.
It is not somehow specially treated at PS.
} | ||
let last_record_lsn = timeline.get_last_record_lsn(); | ||
let request_horizon = if horizon == Lsn::INVALID { | ||
lsn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes comment about special lsn == 0 case. While we don't connect pageserver directly to compute anymore, I believe this hack is still valid when we get initial basebackup, so worth leaving comment, adapting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
// anyway) | ||
} | ||
let last_record_lsn = timeline.get_last_record_lsn(); | ||
let request_horizon = if horizon == Lsn::INVALID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite unobvious, can't we do that at compute side (set lsn == request_lsn is that latter is 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was specially done for conveniece in tests: there are a lot of tests which originally used latest=false
.
Passing copy of lsn to horizon in all this tests seems to be very inconvenient.
So I prefer to add comment here rather than require to pass copy of LSN.
And it seems to be quite logical:
- Lsn::MAX stands for latest version
- any valid LSN specified upper boundary for LSN
- Lsn::INVALID (0) move upper boundary down to the lower boundary=lsn
I have made thew following changes after discussion with @arssher :
|
28d4020
to
32f4a56
Compare
I don't understand why all the compute <-> pageserver protocol changes are needed. Can you explain? (I remember we talked about that in a 1:1 call last week, but I cannot remember it now, and it would be good to have written down in comments anyway) |
I briefly explain it in the comment in pagestore_smgr.c:
The problem itself is more detailed explained in Slack: So I assume that two cases are clear:
In both cases we do no need range and just "latest" flag. Originally I thought that the problem is related only with GC. And if GC is disabled, then there is no problem with hot-standby replicas. But looks like it is not true: So to handle case 3 we need top pas range of LSN: low boundary specifies estimated LSN of last page update (so that we do make PS to wait for applying most recent updates) and higher boundary specifies current apply position of replica, to prevent PS from sending too young pages. With such range determining start LSN position for lookup in layer map becomes even simpler than now with "latest" flag - it is just |
Ok, so the protocol changes are completely unrelated to the topic of this PR, propagating apply_lsn from SK to PS. Please split that off to a separate PR. |
|
@knizhnik, et al, thanks for your work on this issue. |
I see some changes in protocol. While deploying we will have older compute images talking to newer pageservers and safekeepers. Would that be a problem? @petuhovskiy can you please review that one? (ideas on on how to split that up to a safer series of patches and how to increase test coverage are welcome) |
also #6718 has some part of protocol changes as well. @skyzh when talking about last compute image rollout you've mentioned explicit protocol versioning for compute <> pageserver protocol to avoid assuming that each release has some breaking (bat backward-compatible) change. Should we add protocol version right here? And in a follow up PR we can add some tests to check for api breakage and the relax compute image selection rules in a control plane. |
Protocol changes (sending LSN range) were extracted from this ticket to #6718 |
ok, got it. Then @petuhovskiy or @arssher can you please take a look on #6718 instead? |
Took a quick look at #6718, it's about compute<->ps protocol change and I don't have expertise on that. But I'll take a look at this PR after it will get rebased. |
Also note that the current prod runs a release of compute node from 3 weeks ago. Better to wait it to catch up with our latest release before adding more things to the compute so that it's easier to roll back in case something goes wrong. |
We definitely need to unblock/unpin Compute first, and see if it's stable. Only then we can merge this in, and release it. @knizhnik After moving the protocol changes into #6718, are there any other breaking changes in this PR which require Compute restart? This PR is touching 30+ code files. Is it possible to break it down into smaller patches, which can be rolled out in 2 or 3 steps? In steps which build on each other. |
Replaced with #7368 |
Problem
Lagged replica can request objects which are beyond PiTR boundary ands so collected by GC
#6211
Summary of changes
Take in account
standby_flush_lsn
when calculationgc_cutoff
Checklist before requesting a review
Checklist before merging