Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

Closed
tulir opened this issue Aug 26, 2023 · 20 comments
Closed

Increased cpu/gc usage in event persistence workers in 1.91.0rc1 #16190

tulir opened this issue Aug 26, 2023 · 20 comments
Labels
A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release

Comments

@tulir
Copy link
Member

tulir commented Aug 26, 2023

Description

After updating to 1.91.0rc1, the CPU usage of my event persisters started growing steadily. I can't find any obvious reason for it, but the GC charts are also way up

image
image

I just restarted the workers and the CPU usage went back down to normal. Will see tomorrow if it starts climbing again

Steps to reproduce

  • Update to 1.91.0rc1
  • Observe event persistence worker CPU usage increasing steadily over a few days

Homeserver

maunium.net

Synapse Version

1.91.0rc1

Database

PostgreSQL

Workers

Multiple workers

Platform

Custom docker image

@cremesk
Copy link
Contributor

cremesk commented Aug 26, 2023

I can also observe the same behavior here.

image

image

@jameskitt616
Copy link

jameskitt616 commented Aug 28, 2023

image

image

I've got similar behaviour. But i am currently on 1.90.0 i dont exactly remember when i upgraded to which version the past days, but you can clearly see the cpu usage slowly growing over the past days.

@fir3drag0n
Copy link

same problem here with docker:

image

@tulir
Copy link
Member Author

tulir commented Aug 28, 2023

@jameskitt616 this started specifically on 1.91.0rc1, your issue is likely unrelated (might be #16101 considering the network traffic, but you should use synapse's prometheus metrics and grafana dashboard to confirm)

@clokep
Copy link
Member

clokep commented Aug 28, 2023

Any particular jump in the Per-block metrics section of the dashboard or does it all seem to be due to increased GCs? I'm assuming the Python version was not changed during this and the jump was 1.90.0 -> 1.91.0rc1 -- is that correct?

Looking at matrix.org I'm seeing decreased CPU usage on event persistors starting from the update.

@tulir
Copy link
Member Author

tulir commented Aug 28, 2023

Python version has been 3.11 for a while now so shouldn't be any changes there

Nothing jumps out in per-block metrics, there was some spike soon after the upgrade, but that was fairly short

image
image

GC from same period (restarted the two workers when I made this issue, the times went back down and started climbing again):

image
image

@MatMaul
Copy link
Contributor

MatMaul commented Aug 29, 2023

on matrix.org, we can see a change in the GC behavior, but not a degradation with increasing CPU usage like you.

image

It seems like the GC is running less often but with more objects collected on each pass.

May we know if you have presence enabled, reporters ?

@cremesk
Copy link
Contributor

cremesk commented Aug 29, 2023

@MatMaul i have presence disabled here and see the issue..
(like my commend above - #16190 (comment))

@DMRobertson
Copy link
Contributor

DMRobertson commented Aug 29, 2023

Tricky to advise without being able to reproduce it on our end. If anyone is able to, it would be super useful to bisect through Synapse versionscommits to see when this was introduced. From tulir's graphs, it looks like the GC frequency is a good indicator (i.e. it doesn't take days for it to show up) of the performance regression.

@cremesk
Copy link
Contributor

cremesk commented Aug 29, 2023

i saw the misbehavior right after upgrading from 1.90.0 to version 1.91.0rc1. I think as soon as matrix.org upgrades too the same behavior will be observed there.

@DMRobertson
Copy link
Contributor

Sorry, I misspoke: I should have said "bisect through Synapse commits" instead of "versions".

Matrix.org is running v1.91.0rc1, except that its version string has not been updated to reflect this, and per #16190 (comment) we do not see the same behaviour.

@cremesk
Copy link
Contributor

cremesk commented Aug 29, 2023

okay, I'm surprised that you guys can't see the problem on matrix.org. here it has clearly started with the version upgrade.

since I use the finished deb file from packages.matrix.org it is a bit difficult/cumbersome for me to debug which commit the issue came with.

image
image

@DMRobertson DMRobertson added A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) X-Regression Something broke which worked on a previous release T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Aug 29, 2023
@warioishere
Copy link

I also saw increasing CPU Usage and disk IO but already since 1.90 stable. Didnt upgrade to 1.91rc

after I added this to homeserver.yml I saw instantly a positiv impact. Dont know if this similar to the issue on 1.91rc

federation:
  destination_min_retry_interval: 10m
  destination_retry_multiplier: 5
  destination_max_retry_interval: 365d

matrix-cpu-usage

@DMRobertson
Copy link
Contributor

Other avenues of investigation:

  • is anyone seeing this with opentracing (Jaegar) configured? If we can find an unsually slow request we might be able to see where that extra CPU time is spent
  • is this still happening on develop? I'd be surprised if there is a fix, but there have been some bugfixes that landed.

@jaywink
Copy link
Member

jaywink commented Aug 31, 2023

We're getting hit by this possibly in EMS as well for a lot of hosts, and have proceeded with a rollback out of caution. Some examples with internal host ID's:

ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, showing post-update event send time and event persister CPU rise and eventual rollback to v1.90.0:

image

GC for the same host:

image

Host c2e1fb80-f9a6-11ec-b22e-319e919febc8-live which is way less busy but also shows event send times degrading:

image

GC has the same trend:

image

It does not look like all hosts have suffered in the same way. Some hosts with an event persister have their usual metrics pattern in event send times, CPU and GC.

@jaywink
Copy link
Member

jaywink commented Aug 31, 2023

I'm also seeing this pattern on some hosts without an event persister, for example 94922e70-d385-11e9-bee2-2b7ad79f31cb-live, increasing GC until restart on 1.90.0:

image

Event send time also regressed but CPU did not:

image

This Synapse deployment has no workers, and on EMS we've seen quite a few deployments without workers alert on event send times regressing.

@DMRobertson
Copy link
Contributor

Matrix.org is running v1.91.0rc1, except that its version string has not been updated to reflect this, and per #16190 (comment) we do not see the same behaviour.

This wasn't the whole truth 😞; we have a working hypothesis as to the cause.

Current thinking is that #16220 will fix this. If we can verify this I think we'll likely put out a bugfix release early next week.

@tulir
Copy link
Member Author

tulir commented Sep 1, 2023

Cherry-picked that onto maunium.net and it seems to be fixed, gc frequency of the event persisters has stayed relatively flat around around 0.2Hz for 2 hours now

@tulir
Copy link
Member Author

tulir commented Sep 2, 2023

image
image

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 4, 2023

Thanks for cherry-picking and confirming @tulir. Let's consider this closed by that change. I will prepare a patch 1.91.1 patch release now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing A-Workers Problems related to running Synapse in Worker Mode (or replication) T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release
Projects
None yet
Development

No branches or pull requests

9 participants