-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
synapse takes ages to restart #7968
Comments
keywords: restart |
I'm hoping this will provide some pointers for debugging #7968.
I'm hoping this will provide some pointers for debugging #7968.
A lot of the problem here is this query:
... which takes six seconds on a relatively idle database. |
this one takes 11 seconds:
|
hopefully the latter will be fixed by #8271. |
I've also manually cleared out |
it's still pretty slow, so I'm not sure it's fixed. |
shutting down also takes ages: that seems to be mostly because python runs a final GC at shutdown, which can take many seconds. Maybe we can skip that GC. (or maybe it improves if we fix #7176) |
given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely. |
Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early |
I think I want to make this a release blocker; it feels like we've had a big regression here in the last couple of weeks. I just tried restarting an idle worker:
42 seconds to restart an idle process is no good. |
I've also seen the startup sequence take more than 60 seconds (in addition to schema migrations) |
This is now much better thanks to #8447. I still think there's work to be done here (on matrix.org, shutdown takes 12 seconds; startup takes 11) so will leave this open. |
I just had another restart which took 83 seconds:
A few thoughts:
|
apparently the stream generator for
|
We need to be careful that we still run the functions we registered to run when the reactor stops |
I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712 |
We saw the same 30 second delay after initialising the stream generator for
After looking at postgres logs, two slow queries on startup were identified, accounting for ~30 seconds.
From discussions with @richvdh and @erikjohnston:
|
Our next step is to check the impact on restart times once these two changes have made it to matrix.org |
Related to rolling restarts: #10793 |
In today's restart, we saw that the 30 second delay (actually 35 seconds) has been halved down to 18 seconds:
These 3 previously seen queries were responsible for most of those 18 seconds, as expected. |
Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain. |
The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate. On a recent restart of
I'd love to know what the bg worker and the main process were doing for those 10 seconds. I still also think we should remove the dependency between main process and workers: there is no need to wait for all the workers to shut down before beginning to stop the main process. |
so yes, it looks like the first
I thinl the very slow response - and a lot of the slow worker startup time - is just because we have tens of synapse processes hammering the database after a restart. Rolling restarts should help considerably here. |
At the time I made that observation, #10703 was still an issue, so it may very well have been the cause |
Data point: @erikjohnston and @reivilibre were surprised that the restart to upgrade to 1.47.1 seemed noticeably quicker than expected. |
Deploying 1.59rc2 to matrix.org today took 42 minutes for 141 workers to sequentially shutdown, restart and notify systemd. I make that roughly 18 seconds per worker. |
Why does it take 70 seconds to initialise the database? It might be some sort of database connection contention but it would be nice if there were some log messages or something.
what's going on there?
The text was updated successfully, but these errors were encountered: