Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Performance Regression in 1.63 #13331

Closed
davidmehren opened this issue Jul 19, 2022 · 5 comments · Fixed by #13332
Closed

Performance Regression in 1.63 #13331

davidmehren opened this issue Jul 19, 2022 · 5 comments · Fixed by #13332

Comments

@davidmehren
Copy link

Description

After updating to Synapse 1.63.0 today, one of my Synapse instances experiences a noticeable performance regression:
image

According to the metrics (snapshot at https://fsr-ops.cs.tu-dortmund.de/dashboard/snapshot/PyUD0nsC3zYGM3AyCOrwWOOWTbdRsXZo?orgId=0) handle_new_client_event and action_for_event_by_user now consume significantly more CPU and database resources.

#13100 and #13078 seem to have touched these functions for 1.63, so maybe they are the culprit?

Steps to reproduce

  • Update to Synapse 1.63.0
  • Observe worse performance in Grafana

Homeserver

fachschaften.org

Synapse Version

1.63.0

Installation Method

Docker (matrixdotorg/synapse)

Platform

Docker on Ubuntu 20.04 in LXC

Relevant log output

I didn't see something relevant in the log output (and there is just too much to paste everything) and the Grafana snapshot is hopefully more helpful than a huge amount of logs.

Anything else that would be useful to know?

No response

@squahtx
Copy link
Contributor

squahtx commented Jul 19, 2022

I wonder if it's the default size of the get_user_in_room_with_profile cache being too small. Does the "Top 10 cache misses" chart look any different before and after the upgrade?

@davidmehren
Copy link
Author

@squahtx that looks like a good guess!
image

@squahtx
Copy link
Contributor

squahtx commented Jul 19, 2022

The size of that cache can be controlled by adding an entry to per_cache_factors in the config. The config can then be reloaded by sending SIGHUP to Synapse.

It sounds like the default size for that cache is too small and we should ship with a larger default. I'd be interested in seeing what a good cache factor for your deployment turns out to be.

@davidmehren
Copy link
Author

I doubled the cache factor from our global default of 1 to 2 and will observe the metrics for a bit. Thanks for the prompt help!

@davidmehren
Copy link
Author

1.63.1 seems to have fixed the problem, load and event send time is back to normal immediately after upgrading.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants