Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: stabilize physical replication #6211

Open
16 of 20 tasks
vadim2404 opened this issue Dec 21, 2023 · 97 comments
Open
16 of 20 tasks

Epic: stabilize physical replication #6211

vadim2404 opened this issue Dec 21, 2023 · 97 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug t/Epic Issue type: Epic

Comments

@vadim2404
Copy link
Contributor

vadim2404 commented Dec 21, 2023

Summary

Original issue we hit was

page server returned error: tried to request a page version that was garbage collected. requested at C/1E923DE0 gc cutoff C/23B3DF00

but then the scope grew up quickly. This is the Epic to track main physical replication work

Tasks

Preview Give feedback
  1. c/compute t/bug
    knizhnik
  2. c/compute
    tristan957
  3. hlinnaka
  4. c/compute t/bug
  5. knizhnik
  6. c/compute t/bug

Follow-ups:

Related Epics:

@vadim2404 vadim2404 added t/bug Issue Type: Bug c/compute Component: compute, excluding postgres itself labels Dec 21, 2023
@knizhnik
Copy link
Contributor

I am thinking now how it can be done.

  • Replica receives WAL from safekeeper.
  • Master compute knows nothing about presence of replica - there is no replication slot at master.
  • Replica can be arbitrary lagged, suspended, ... It may not access nether SK, neither SK for arbitrary long time
  • There are also no replication slots at SK, so SK has no knowladge about all existed replicas and their WAL poisition.

So what can we do?

  1. We can create replication slot at master. This slot will be persisteted using AUX_KEY mechanism (right now it works only for logical slots, but it an be changed) and applying this wal record PS will know about position of replica. It is not clear who will advance this slot, if replication is performed from SK. In principle. SK can send this position in some feedback message to PS. But looks pretty ugly.
  2. SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
  3. Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
  4. In addition to PITR we can also have max_replica_lag parameter. If replica exceeds this value, then it is disabled.

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.

Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.

That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.

SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.

Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

@kelvich
Copy link
Contributor

kelvich commented Jan 2, 2024

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'

Yes, this seems to be the easiest way.

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).

With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).

Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

Ok, so to summarise all above:

  1. Information about replica apply position can be obtained by PS from broker (still not quite clear to me how frequent this information is updated)
  2. The problem most likely is not caused by replication lag, but by some bug in tracking VM updates either at compute, either at PS side. As far as the problem is reproduced only on replica, then most likely it is bug in compute, particularly in performing redo in compute. PS knows nothing if get_page request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page request with latest option (takes latest LSN), while replica uses latest=false.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:

  • to involve control plane in GC process
  • block GC until all replicas are online
  • remember current state of all replicas in some shared storage

@vadim2404
Copy link
Contributor Author

under investigation (most probably slip to the next week)

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.

Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 2, 2024

Newly seeded replica shouldn't lag significantly.

My concern is that replica can be suspended because of inactivity.
I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).

@vadim2404
Copy link
Contributor Author

Recently, @arssher turned off the suspension for computes, which has logical replication subscribers.
a41c412

@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known

@vadim2404
Copy link
Contributor Author

@knizhnik to check why replica requires to download the WAL.

@kelvich
Copy link
Contributor

kelvich commented Jan 9, 2024

My concern is that replica can be suspended because of inactivity.
Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn.
Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).

Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.

There are two open questions now:

  • why replica lags a lot, that shouldn't happen and that is the most pressing issue
  • how we delay GC in case of legitimately lagging replica. Approach with broker sounds reasonable (no replica == no need to hold GC). Control plane doesn't know about replica LSN and shouldn't know about them.

@knizhnik
Copy link
Contributor

Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant.
Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.

So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:

  1. Replica apply WAL slowly than master is producing it. For example replica use less powerful VM than master.
  2. There was some error with processing WAL at replica which stuck replication. I can be related with the problem recently fixed by @arssher (alignment of segments sent to replica on page boundary).

Are there links to the projects suffering for this problem? Can we include them in this ticket?

Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?

@arssher
Copy link
Contributor

arssher commented Jan 10, 2024

I wonder of broker can provide now access to both of this

Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo

@vadim2404
Copy link
Contributor Author

status update: in review

@vadim2404
Copy link
Contributor Author

to review it with @MMeent

@vadim2404
Copy link
Contributor Author

@arssher to review the PR

@ItsWadams
Copy link

Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?

@vadim2404
Copy link
Contributor Author

The problem was identified, and @knizhnik is working on fixing it.

But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.

@YanicNeon
Copy link

We got a support case about this problem today (ZD #2219)

Keeping an eye on this thread

@acervantes23
Copy link

@knizhnik what's the latest status on this issue?

@knizhnik
Copy link
Contributor

knizhnik commented Feb 7, 2024

@knizhnik what's the latest status on this issue?

There is PR #6357 waiting for one more round of review.
There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839
which are not yet resolved and where I need some help for somebody familiar with e2e tests.

@knizhnik
Copy link
Contributor

Yet, later it's going to be switched to fully static ROs

What actually you mean by "fully static ROs"?
Right now it is possible to start static replica by means of CLI, but not through UI.
It still requires branch. The main differences with with normal (hot-standby) replica are:

  • they do no have connection to primary (safekeeper)
  • they do not need to get information about running xacts at startup
    As far as I understand it is not possible too through UI.

This "ephemeral endpoints" or "static replicas" still require separate Postgres instance (POD/VM) and separate timeline/task at PS. In principle, creating temporary branch for static replicas is not strictly needed. Its get_page@lsn requests can be served by PS for original timeline. But branch creation allows to pin particular LSN horizon and protect this data fro GC.
Also looks like having extra tokio task at PS is cheap, so there is no string motivation to avoid branch creation.

What IMHO will be really useful is to allow time travel without spawning of separate compute. In this case we can access different time slices in the same Postgres cluster. But it seems to be non-trivial because CLOG and other SLUs are now access locally and so it is hard to provide versioning for them.

@ololobus
Copy link
Member

What actually you mean by "fully static ROs"?

I meant that we will start static computes pinned to specific LSN. Right now, it's turned off in cplane, so for some branch@LSN compute, we first need to create a temporary branch at this LSN and then start a 'normal' RO on it. IIRC, the problem was with the races with GC. Once we have leases, we can turn on static computes usage in cplane again

@ololobus
Copy link
Member

ololobus commented Jul 16, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Jul 30, 2024

This week:

Heikki proposal for RO starts and pageserver GC races -- we can create a new 'ephemeral' branch + static endpoint

@stepashka stepashka changed the title Epic: physical replication Epic: stabilize physical replication Jul 30, 2024
@stepashka
Copy link
Member

once the lag metric looks good, please ping the DBaaS team, e.g. on #proj-observability-for-users about the metrics we can add to the UI? 🙏
cc @lpetkov @seymourisdead

@ololobus
Copy link
Member

ololobus commented Aug 6, 2024

This week:

For #8484 we can postpone it. The most recent case https://neondb.slack.com/archives/C03H1K0PGKH/p1722631550388579

Side note for #8484: oldestActiveXid wasn't persisted on pageserver, now it's. Fast shutdown + check availability help over time. Also we have a clear recovery path -- start/restart RW. Currently it doesn't look like we need to rush with #8484

@tristan957
Copy link
Member

I have changed the dashboard to also expose lag in seconds.

@ololobus
Copy link
Member

ololobus commented Aug 13, 2024

This week:

  • Heikki: finish RO RFC Add retroactive RFC about physical replication #8546
  • Tristan: add filters to dashboard with size and lag cutoff
  • Tristan: fitler also all logs with ERROR + replication
  • Alexey: propose Polina to add UI for hot_standby_feedback
  • Alexey: consider adding max_standby_archive_delay / max_standby_streaming_delay to allow list

hlinnaka added a commit that referenced this issue Aug 20, 2024
Protocol version 2 has been the default for a while now, and we no
longer have any computes running in production that used protocol
version 1. This completes the migration by removing support for v1 in
both the pageserver and the compute.

See issue #6211.
@ololobus
Copy link
Member

ololobus commented Aug 20, 2024

This week:

  • Add new receive/replay metrics Add compute_receive_lsn metric #8750
  • Add new panels to per endpoint dashboard
  • Investigate replication-related errors like ERROR: cannot advance replication slot to 0/7B56EA8, minimum is 0/84544E0

hlinnaka added a commit that referenced this issue Aug 27, 2024
Protocol version 2 has been the default for a while now, and we no
longer have any computes running in production that used protocol
version 1. This completes the migration by removing support for v1 in
both the pageserver and the compute.

See issue #6211.
@ololobus ololobus assigned tristan957 and unassigned hlinnaka Sep 17, 2024
@ololobus
Copy link
Member

ololobus commented Sep 17, 2024

This week:

@ololobus
Copy link
Member

This week:

@ololobus
Copy link
Member

This week:

@ololobus
Copy link
Member

This week:

@ololobus
Copy link
Member

This week:

@ololobus
Copy link
Member

ololobus commented Nov 5, 2024

This week:

@knizhnik
Copy link
Contributor

knizhnik commented Nov 12, 2024

#9457 - all PRs are approved, not merged yet because problems with CI (test_aux_not_logged_at_replica is flaky)
#9023 - add yet another negative test illustrating locks space exhaustion. Still waiting for approval. (@hlinnaka pls have a look)
#9553 - merged. and included in next compute release

@ololobus
Copy link
Member

ololobus commented Nov 19, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Nov 26, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Dec 3, 2024

This week:

@ololobus ololobus assigned ololobus and unassigned knizhnik Dec 3, 2024
@ololobus
Copy link
Member

This week:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests