Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: move outbound logical replication out of Beta #6213

Open
18 of 21 tasks
stepashka opened this issue Dec 21, 2023 · 30 comments
Open
18 of 21 tasks

Epic: move outbound logical replication out of Beta #6213

stepashka opened this issue Dec 21, 2023 · 30 comments
Assignees
Labels
c/compute Component: compute, excluding postgres itself t/Epic Issue type: Epic

Comments

@stepashka
Copy link
Member

stepashka commented Dec 21, 2023

DoD

logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform

Tasks & bugs to fix

Preview Give feedback
  1. c/compute
    save-buffer
  2. c/compute
    save-buffer
  3. c/compute
    knizhnik
  4. c/compute
    save-buffer
  5. c/compute t/bug
  6. t/feature
    save-buffer
  7. t/bug
  8. c/compute c/storage/pageserver t/bug triaged
  9. c/compute c/storage/safekeeper
  10. knizhnik
  11. tristan957
  12. c/compute external t/bug
    knizhnik

Follow-ups (out of scope):

Other related tasks and Epics

https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799

@stepashka stepashka added t/Epic Issue type: Epic c/compute Component: compute, excluding postgres itself labels Dec 21, 2023
@vadim2404
Copy link
Contributor

@arssher , will your fixes help with first two items?

@arssher
Copy link
Contributor

arssher commented Jan 2, 2024

slot may disappear on restart (hard to reproduce., only occured once)

I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.

in one case replication wasn't able to read WAL (hard to reproduce)

The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch:
#5948
but using it in logical walsenders is separate step. Shouldn't be hard, but I haven't started on that.

@vadim2404
Copy link
Contributor

@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?

@kelvich

slot may disappear on restart (hard to reproduce., only occured once)
I think nobody has been able to reproduce it so far. Reasonable question: is it a part of the epic's scope?

@andreasscherbaum andreasscherbaum changed the title Epic: move logical replication out of Beta Epic: move outbound logical replication out of Beta Feb 27, 2024
@andreasscherbaum
Copy link
Contributor

Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication.

@andreasscherbaum
Copy link
Contributor

Discussion with Stas: improve pageserver performance first

@ololobus ololobus assigned save-buffer and unassigned arssher Jul 9, 2024
@ololobus
Copy link
Member

ololobus commented Jul 9, 2024

This week:

  • Sasha: Finish the aux v2 rollout (last batch tomorrow morning)

@ololobus
Copy link
Member

ololobus commented Jul 16, 2024

This week:

  • Waiting for the new compute image rollout

@ololobus
Copy link
Member

This week:

  • Tristan: look at AUX files monitoring and limiting

@skyzh
Copy link
Member

skyzh commented Aug 19, 2024

@tristan957 there is already monitoring on the storage side for aux v2

static AUX_FILE_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pageserver_aux_file_estimated_size",
"The size of all aux files for a timeline in aux file v2 store.",
&["tenant_id", "shard_id", "timeline_id"]
)
.expect("failed to define a metric")
});

there is total aux size metrics per timeline

@ololobus
Copy link
Member

ololobus commented Aug 20, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Aug 27, 2024

This week:

  • Alexey: Expose snap files count metric and add to dashboards
  • Anastasia: look at LR test failures

@ololobus
Copy link
Member

ololobus commented Sep 3, 2024

This week:

  • Look at LR test failures (Tristan said that he will have a look on Thu)

@ololobus
Copy link
Member

@ololobus
Copy link
Member

ololobus commented Sep 24, 2024

This week:

@kelvich
Copy link
Contributor

kelvich commented Oct 8, 2024

@tristan957 added tests for LR metrics in compute. PGv17 had broken metrics. Plan: get list of metrics from AWS.

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

@ololobus
Copy link
Member

https://github.com/neondatabase/cloud/issues/17261 failed due ENOSPACE, smth else is going there. Nikita helps there.

@tristan957 any progress with this?

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

The plan we discussed was

  1. Repro this failure on staging (just manually rerun test)
  2. Watch disk usage and figure out what is eating the disk space
  3. Is it valid disk usage? Then we have two options
    3.1. It's something we want to fix -> investigate further and discuss/propose the fix
    3.2. Yes, it's OK -> bump the compute size and add fixed sized compute flag
  4. Start running the test again

What was the item we gut stuck at? Or do I miss something and we got some additional context here?

@tristan957
Copy link
Member

So the original problem was that test was passing at first, and then regressed and started failing with ENOSPACE. That was known in the beginning of Sept already

I think there was some miscommunication between Stas and myself here. The current problem is that the publisher endpoint will not even start at the moment, which is different from the ENOSPC issue we were previously running into.

@tristan957 any progress with this?

Last week, I was talking to Nikita K. about what is going on here because the endpoint was seemingly stuck. He and I came to the conclusion that the compute was failing to retrieve the basebackup from the pageserver due to some AUX files issues. After talking to the storage team, we determined that we should wait for Chi to come back from vacation to get his thoughts.

@ololobus
Copy link
Member

ololobus commented Oct 15, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Oct 22, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Oct 29, 2024

This week:

@ololobus
Copy link
Member

ololobus commented Nov 5, 2024

This week:

@tristan957
Copy link
Member

tristan957 commented Nov 12, 2024

@ololobus
Copy link
Member

ololobus commented Nov 19, 2024

@tristan957
Copy link
Member

tristan957 commented Nov 26, 2024

Docs for LR benchmarks: https://github.com/neondatabase/docs/pull/263
PR for logical snapshots size: #9887

Looks like tests are failing for the PR, so I'll check those out today.

Muhammet is working with the Proxy team to fix the new LR benchmarking failures.

@ololobus
Copy link
Member

@tristan957 finalize the default for #8619. 300 snap files looks like a small number.

Why do we have 10-20k files in prod if the limit is 300? https://neonprod.grafana.net/d/edr9erae8dipse/compute-replication-overview?orgId=1&refresh=1m

@tristan957
Copy link
Member

PR to bump default value to 10k: #9896.

@tristan957
Copy link
Member

Why do we have 10-20k files in prod if the limit is 300? neonprod.grafana.net/d/edr9erae8dipse/compute-replication-overview?orgId=1&refresh=1m

See #9976. It seems like this problem is exclusive to PG 17.

#9896 was merged and will be rolled out this week.

@ololobus
Copy link
Member

ololobus commented Dec 3, 2024

@ololobus
Copy link
Member

This weeks:

@ololobus
Copy link
Member

ololobus commented Dec 17, 2024

This weeks:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

9 participants