Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

tulir · 2023-08-26T20:57:53Z

Description

After updating to 1.91.0rc1, the CPU usage of my event persisters started growing steadily. I can't find any obvious reason for it, but the GC charts are also way up

I just restarted the workers and the CPU usage went back down to normal. Will see tomorrow if it starts climbing again

Steps to reproduce

Update to 1.91.0rc1
Observe event persistence worker CPU usage increasing steadily over a few days

Homeserver

maunium.net

Synapse Version

1.91.0rc1

Database

PostgreSQL

Workers

Multiple workers

Platform

Custom docker image

cremesk · 2023-08-26T21:00:55Z

I can also observe the same behavior here.

jameskitt616 · 2023-08-28T06:04:26Z

I've got similar behaviour. But i am currently on 1.90.0 i dont exactly remember when i upgraded to which version the past days, but you can clearly see the cpu usage slowly growing over the past days.

fir3drag0n · 2023-08-28T08:27:14Z

same problem here with docker:

tulir · 2023-08-28T11:08:42Z

@jameskitt616 this started specifically on 1.91.0rc1, your issue is likely unrelated (might be #16101 considering the network traffic, but you should use synapse's prometheus metrics and grafana dashboard to confirm)

clokep · 2023-08-28T12:08:13Z

Any particular jump in the Per-block metrics section of the dashboard or does it all seem to be due to increased GCs? I'm assuming the Python version was not changed during this and the jump was 1.90.0 -> 1.91.0rc1 -- is that correct?

Looking at matrix.org I'm seeing decreased CPU usage on event persistors starting from the update.

tulir · 2023-08-28T12:50:39Z

Python version has been 3.11 for a while now so shouldn't be any changes there

Nothing jumps out in per-block metrics, there was some spike soon after the upgrade, but that was fairly short

GC from same period (restarted the two workers when I made this issue, the times went back down and started climbing again):

MatMaul · 2023-08-29T12:25:59Z

on matrix.org, we can see a change in the GC behavior, but not a degradation with increasing CPU usage like you.

It seems like the GC is running less often but with more objects collected on each pass.

May we know if you have presence enabled, reporters ?

cremesk · 2023-08-29T12:30:07Z

@MatMaul i have presence disabled here and see the issue..
(like my commend above - #16190 (comment))

DMRobertson · 2023-08-29T12:33:02Z

Tricky to advise without being able to reproduce it on our end. If anyone is able to, it would be super useful to bisect through Synapse ~~versions~~commits to see when this was introduced. From tulir's graphs, it looks like the GC frequency is a good indicator (i.e. it doesn't take days for it to show up) of the performance regression.

cremesk · 2023-08-29T12:38:58Z

i saw the misbehavior right after upgrading from 1.90.0 to version 1.91.0rc1. I think as soon as matrix.org upgrades too the same behavior will be observed there.

DMRobertson · 2023-08-29T12:41:17Z

Sorry, I misspoke: I should have said "bisect through Synapse commits" instead of "versions".

Matrix.org is running v1.91.0rc1, except that its version string has not been updated to reflect this, and per #16190 (comment) we do not see the same behaviour.

cremesk · 2023-08-29T13:43:39Z

okay, I'm surprised that you guys can't see the problem on matrix.org. here it has clearly started with the version upgrade.

since I use the finished deb file from packages.matrix.org it is a bit difficult/cumbersome for me to debug which commit the issue came with.

warioishere · 2023-08-30T09:31:45Z

I also saw increasing CPU Usage and disk IO but already since 1.90 stable. Didnt upgrade to 1.91rc

after I added this to homeserver.yml I saw instantly a positiv impact. Dont know if this similar to the issue on 1.91rc

federation:
  destination_min_retry_interval: 10m
  destination_retry_multiplier: 5
  destination_max_retry_interval: 365d

DMRobertson · 2023-08-31T16:13:55Z

Other avenues of investigation:

is anyone seeing this with opentracing (Jaegar) configured? If we can find an unsually slow request we might be able to see where that extra CPU time is spent
is this still happening on develop? I'd be surprised if there is a fix, but there have been some bugfixes that landed.

jaywink · 2023-08-31T20:09:34Z

We're getting hit by this possibly in EMS as well for a lot of hosts, and have proceeded with a rollback out of caution. Some examples with internal host ID's:

ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, showing post-update event send time and event persister CPU rise and eventual rollback to v1.90.0:

GC for the same host:

Host c2e1fb80-f9a6-11ec-b22e-319e919febc8-live which is way less busy but also shows event send times degrading:

GC has the same trend:

It does not look like all hosts have suffered in the same way. Some hosts with an event persister have their usual metrics pattern in event send times, CPU and GC.

jaywink · 2023-08-31T21:10:39Z

I'm also seeing this pattern on some hosts without an event persister, for example 94922e70-d385-11e9-bee2-2b7ad79f31cb-live, increasing GC until restart on 1.90.0:

Event send time also regressed but CPU did not:

This Synapse deployment has no workers, and on EMS we've seen quite a few deployments without workers alert on event send times regressing.

DMRobertson · 2023-09-01T14:59:55Z

Matrix.org is running v1.91.0rc1, except that its version string has not been updated to reflect this, and per #16190 (comment) we do not see the same behaviour.

This wasn't the whole truth 😞; we have a working hypothesis as to the cause.

Current thinking is that #16220 will fix this. If we can verify this I think we'll likely put out a bugfix release early next week.

tulir · 2023-09-01T17:45:52Z

Cherry-picked that onto maunium.net and it seems to be fixed, gc frequency of the event persisters has stayed relatively flat around around 0.2Hz for 2 hours now

tulir · 2023-09-02T09:37:51Z

DMRobertson · 2023-09-04T12:41:45Z

Thanks for cherry-picking and confirming @tulir. Let's consider this closed by that change. I will prepare a patch 1.91.1 patch release now.

DMRobertson closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

tulir commented Aug 26, 2023 •

edited

Loading

cremesk commented Aug 26, 2023

jameskitt616 commented Aug 28, 2023 •

edited

Loading

fir3drag0n commented Aug 28, 2023

tulir commented Aug 28, 2023

clokep commented Aug 28, 2023 •

edited

Loading

tulir commented Aug 28, 2023 •

edited

Loading

MatMaul commented Aug 29, 2023

cremesk commented Aug 29, 2023

DMRobertson commented Aug 29, 2023 •

edited

Loading

cremesk commented Aug 29, 2023 •

edited

Loading

DMRobertson commented Aug 29, 2023

cremesk commented Aug 29, 2023 •

edited

Loading

warioishere commented Aug 30, 2023

DMRobertson commented Aug 31, 2023

jaywink commented Aug 31, 2023

jaywink commented Aug 31, 2023

DMRobertson commented Sep 1, 2023

tulir commented Sep 1, 2023

tulir commented Sep 2, 2023

DMRobertson commented Sep 4, 2023 •

edited

Loading

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

Comments

tulir commented Aug 26, 2023 • edited Loading

Description

Steps to reproduce

Homeserver

Synapse Version

Database

Workers

Platform

cremesk commented Aug 26, 2023

jameskitt616 commented Aug 28, 2023 • edited Loading

fir3drag0n commented Aug 28, 2023

tulir commented Aug 28, 2023

clokep commented Aug 28, 2023 • edited Loading

tulir commented Aug 28, 2023 • edited Loading

MatMaul commented Aug 29, 2023

cremesk commented Aug 29, 2023

DMRobertson commented Aug 29, 2023 • edited Loading

cremesk commented Aug 29, 2023 • edited Loading

DMRobertson commented Aug 29, 2023

cremesk commented Aug 29, 2023 • edited Loading

warioishere commented Aug 30, 2023

DMRobertson commented Aug 31, 2023

jaywink commented Aug 31, 2023

jaywink commented Aug 31, 2023

DMRobertson commented Sep 1, 2023

tulir commented Sep 1, 2023

tulir commented Sep 2, 2023

DMRobertson commented Sep 4, 2023 • edited Loading

tulir commented Aug 26, 2023 •

edited

Loading

jameskitt616 commented Aug 28, 2023 •

edited

Loading

clokep commented Aug 28, 2023 •

edited

Loading

tulir commented Aug 28, 2023 •

edited

Loading

DMRobertson commented Aug 29, 2023 •

edited

Loading

cremesk commented Aug 29, 2023 •

edited

Loading

cremesk commented Aug 29, 2023 •

edited

Loading

DMRobertson commented Sep 4, 2023 •

edited

Loading