Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Release Synapse v1.45.0 #11058

Closed
babolivier opened this issue Oct 12, 2021 · 9 comments
Closed

Release Synapse v1.45.0 #11058

babolivier opened this issue Oct 12, 2021 · 9 comments
Assignees
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Release Issues tracking a Synapse release

Comments

@babolivier
Copy link
Contributor

This issue tracks the progress of the v1.45.0 release of Synapse, which is due to be released on October 19th.

@babolivier babolivier added the Z-Release Issues tracking a Synapse release label Oct 12, 2021
@babolivier
Copy link
Contributor Author

babolivier commented Oct 12, 2021

We released v1.45.0rc1 earlier today: https://github.com/matrix-org/synapse/releases/tag/v1.45.0rc1

Because of a time constraint on getting the RC out, not all the release blockers were fixed, see X-Release-Blocker Must be resolved before making a release

The intention is for @DMRobertson (as the maintainer for this week) to release a v1.45.0rc2 (hopefully) later this week with these fixes, before we release v1.45.0.

@callahad callahad changed the title Synapse v1.45.0 Release Synapse v1.45.0 Oct 12, 2021
@callahad callahad added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Oct 12, 2021
@DMRobertson
Copy link
Contributor

#11053 is deployed to matrix.org with 27e6e45

Grafana shows that the user directory worker isn't stuck on CPU:
Screenshot from 2021-10-13 11-16-34

I don't like the climbing memory usage there before my change. I'm guessing that the background process constantly failing means that something wasn't getting GCed?

Here's a view of the last half an hour where you can see the restart.
Screenshot from 2021-10-13 11-19-25

I was hoping to see forward progress recorded in the DB too, but for now it's consistently saying

matrix=> select stream_id from user_directory_stream_pos;
 stream_id  
------------
 2382157728

Which I don't find too reassuring.

@DMRobertson
Copy link
Contributor

However I can see evidence in grafana that we're not just processing the same old event over and over again:

Screenshot from 2021-10-13 11-25-04

We're calling add_users_in_public_rooms which we were never doing before. That's promising.

But I would feel better if I saw us making progress here:
Screenshot from 2021-10-13 11-28-35

@DMRobertson
Copy link
Contributor

As soon as I asked @erikjohnston to take a look, the issue resolved itself. Here's that graph again:
Screenshot from 2021-10-13 12-27-48

We observed:

  • Synapse processes state deltas in batches of 100. It had roughly 200 batches to process during the time range shown here.
  • One batch was particularly CPU intensive to process. (The metrics are updated only after a batch completes, and sampled every 30 seconds.
    • Not sure what happened here and I don't think we have great visibility on it.
    • I have made changes here, e.g. checking appservice senders explicitly (linear in the number of appservices) and checking for support/disabled users explicitly. I wouldn't expect that to cause a 40 minute batch though!
    • Processing a room that changes from public to private looks to be quadratic in the size of the room
    • Someone joining a private room is linear in the size of the room
  • Some of the metrics are only logged when the batch ends, rather than in flight. This causes misleading spikey artifacts in grafana.

We concluded that all is fine now, even if it is a little concerning.

@babolivier
Copy link
Contributor Author

v1.45.0rc2 was released today: https://github.com/matrix-org/synapse/releases/tag/v1.45.0rc2

It contains the fix to the user directory issue, but not the one to the performance issue. @DMRobertson what's the plan here? Is the performance issue still a release blocker, and we should release an rc3 tomorrow or on Monday?

@DMRobertson
Copy link
Contributor

Is the performance issue still a release blocker [...] ?

I'm not sure. I don't yet understand what the underlying cause of the issue (issues?) is. Was hoping to talk it over in the backend triage meeting this afternoon.

One person changed their configuration to use workers and they report that the lockups have stopped.
In their graphs from before the config change, @erikjohnston noted that the spikes seemed to be correlated with outbound federation requests. We were also surprised at the appearance of generate_sync_entry_for_groups. (But that doesn't sound like something we would expect to have changed in 1.44.)

The other r[eported that turning presence off reduced the impact of their mass DNS lookups] (#11049 (comment)), by a factor of two.

and we should release an rc3 tomorrow or on Monday?

I think we should if we have a proposed fix.

@DMRobertson
Copy link
Contributor

After discussion in the planning meeting, we decided that #11049 was no longer a release blocker. I've asked the reporters to confirm that this was a regression by rolling back. If so, we may ask them to git bisect to figure out what introduces the problem.

@DMRobertson
Copy link
Contributor

@squahtx
Copy link
Contributor

squahtx commented Oct 20, 2021

We're going to do a 1.45.1 release to roll back #10947 temporarily

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Release Issues tracking a Synapse release
Projects
None yet
Development

No branches or pull requests

4 participants