Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Latest Synapse 1.70.0rc1 (2022-10-19) release seems to have regressed /messages response time #14250

Closed
MadLittleMods opened this issue Oct 20, 2022 · 3 comments
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Oct 20, 2022

I think we rolled out the latest Synapse 1.70.0rc1 (2022-10-19) to matrix.org this morning. Looking at the graphs (where the spike on the right is), I noticed there seems to be a big up-tick in /messages response times.

We should check to see whether this persists tomorrow or is just a blip.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=now-4d&to=now

It looks like we're seeing a major regression in the before backfill processing time (synapse/handlers/federation.py#L197-L503). And some in the after backfill processing time (synapse/handlers/federation_event.py#L643-L657.

What changed?

See https://github.com/matrix-org/synapse/blob/develop/CHANGES.md#synapse-1700rc1-2022-10-19

Is it #13816?

One obvious culprit is #13816 but I would expect everything that it touches to be in the after backfill processing timing. And the goal of that is to save time since we bail early if are trying to process an event we know will fail.

My /messages?dir=b&limit=500 benchmark request to Matrix HQ still takes 24s to complete (expected).. The one thing that stood out to me was a new recursively fetching redactions span under _process_pulled_events in the trace.

Is it #14164?

This one seems suspicious. Given the new recursively fetching redactions span I saw under _process_pulled_events in the Jaeger trace which _get_events_from_db does and changed in this PR. And that we removed a cache.

This would account for the after backfill processing time. Just smoke though. I don't know how it performs. We do end up running _process_pulled_event less since we only do it for new events.

What else?

What about the before backfill processing time?

Needs investigation.


I wish I had a trace that represented one of these long requests. Should be something longer than 180s

But we only recorded two traces in the past 2 days over 120s, https://jaeger.proxy.matrix.org/search?end=1666298780924000&limit=20&lookback=2d&maxDuration&minDuration=120s&operation=RoomMessageListRestServlet&service=matrix.org%20client\_reader&start=1666125980924000

And they seem more related to outgoing-federation-request for /backfill which we don't control, RequestSendFailed. And wouldn't account for before/after backfill processing times

@MadLittleMods MadLittleMods added the A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) label Oct 20, 2022
@DMRobertson DMRobertson added A-Performance Performance, both client-facing and admin-facing S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release O-Occasional Affects or can be seen by some users regularly or most users rarely labels Oct 20, 2022
@DMRobertson
Copy link
Contributor

Is this a regression in typical response times, or is it just getting worse in the long tail?

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Oct 20, 2022

Is this a regression in typical response times, or is it just getting worse in the long tail?

It's not very obvious in the overall graphs besides in the very top percentiles (>99 percentile):

The problem seems to be focused in substantial (100 - 1000 members) and large rooms (> 1000 members) where it at least shows in the 75th percentile.

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Oct 24, 2022

I'm not seeing this spike continue at all this week. Seems back down to normal.

Perhaps it was a load spike on the matrix.org servers? Nothing obvious on that front that correlates though.

I'll close for now because this isn't pressing and continuing for us to debug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release
Projects
None yet
Development

No branches or pull requests

2 participants