-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: move outbound logical replication out of Beta #6213
Comments
@arssher , will your fixes help with first two items? |
I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.
The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch: |
@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?
|
Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication. |
Discussion with Stas: improve pageserver performance first |
This week:
|
This week:
|
This week:
|
@tristan957 there is already monitoring on the storage side for aux v2 neon/pageserver/src/metrics.rs Lines 588 to 595 in 6949b45
there is total aux size metrics per timeline |
This week: |
This week:
|
This week:
|
This week: |
This week:
|
@tristan957 added tests for LR metrics in compute. PGv17 had broken metrics. Plan: get list of metrics from AWS. https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there. |
@tristan957 any progress with this? So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already The plan we discussed was
What was the item we gut stuck at? Or do I miss something and we got some additional context here? |
I think there was some miscommunication between Stas and myself here. The current problem is that the publisher endpoint will not even start at the moment, which is different from the ENOSPC issue we were previously running into.
Last week, I was talking to Nikita K. about what is going on here because the endpoint was seemingly stuck. He and I came to the conclusion that the compute was failing to retrieve the basebackup from the pageserver due to some AUX files issues. After talking to the storage team, we determined that we should wait for Chi to come back from vacation to get his thoughts. |
This week:
|
This week:
|
This week:
|
This week:
|
|
This week:
|
Docs for LR benchmarks: https://github.com/neondatabase/docs/pull/263 Looks like tests are failing for the PR, so I'll check those out today. Muhammet is working with the Proxy team to fix the new LR benchmarking failures. |
@tristan957 finalize the default for #8619. 300 snap files looks like a small number. Why do we have 10-20k files in prod if the limit is 300? https://neonprod.grafana.net/d/edr9erae8dipse/compute-replication-overview?orgId=1&refresh=1m |
PR to bump default value to 10k: #9896. |
See #9976. It seems like this problem is exclusive to PG 17. #9896 was merged and will be rolled out this week. |
This weeks:
|
This weeks:
|
This weeks:
|
DoD
logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform
Tasks & bugs to fix
FOR ALL TABLES
in logical replication #6229Follow-ups (out of scope):
Other related tasks and Epics
https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799
The text was updated successfully, but these errors were encountered: