synapse takes ages to restart #7968

richvdh · 2020-07-28T10:03:44Z

Why does it take 70 seconds to initialise the database? It might be some sort of database connection contention but it would be nice if there were some log messages or something.

2020-07-28 09:56:32,141 - synapse.storage.data_stores - 50 - INFO - None - Preparing database 'master'...
2020-07-28 09:56:56,412 - synapse.storage.prepare_database - 325 - INFO - None - Upgrading schema to v58
2020-07-28 09:56:56,421 - synapse.storage.data_stores - 60 - INFO - None - Starting 'main' data store
2020-07-28 09:57:42,940 - synapse.config.appservice - 87 - INFO - None - Loaded application service: ApplicationService: ...

what's going on there?

The text was updated successfully, but these errors were encountered:

richvdh · 2020-08-19T10:25:44Z

keywords: restart

I'm hoping this will provide some pointers for debugging #7968.

richvdh · 2020-09-07T12:55:14Z

A lot of the problem here is this query:

SELECT instance_name, MAX(stream_id) FROM cache_invalidation_stream_by_instance GROUP BY instance_name

... which takes six seconds on a relatively idle database.

richvdh · 2020-09-07T13:26:41Z

this one takes 11 seconds:

SELECT COUNT(*) FROM users WHERE name NOT LIKE '%:matrix.org'

richvdh · 2020-09-07T13:52:41Z

hopefully the latter will be fixed by #8271.

richvdh · 2020-09-07T16:16:17Z

I've also manually cleared out cache_invalidation_stream_by_instance, which seems to have mostly resolved this problem, so I'm going to close this and leave #8269 to track a more permanent fix.

richvdh · 2020-09-30T13:28:41Z

it's still pretty slow, so I'm not sure it's fixed.

richvdh · 2020-09-30T13:31:00Z

shutting down also takes ages: that seems to be mostly because python runs a final GC at shutdown, which can take many seconds. Maybe we can skip that GC. (or maybe it improves if we fix #7176)

richvdh · 2020-10-01T11:48:57Z

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

richvdh · 2020-10-01T11:54:01Z

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

richvdh · 2020-10-01T15:26:44Z

I think I want to make this a release blocker; it feels like we've had a big regression here in the last couple of weeks.

I just tried restarting an idle worker:

Oct 01 15:21:52 hippogriff.matrix.org systemd[1882]: Stopping Synapse test_worker1 worker...
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Main process exited, code=killed, status=15/TERM
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:35 hippogriff.matrix.org synctl[8957]: started synapse.app.generic_worker('/home/synapse/config/workers/test_worker1.yaml')
Oct 01 15:22:35 hippogriff.matrix.org systemd[1882]: Started Synapse test_worker1 worker.

42 seconds to restart an idle process is no good.

richvdh · 2020-10-01T15:27:58Z

I've also seen the startup sequence take more than 60 seconds (in addition to schema migrations)

richvdh · 2020-10-02T11:42:37Z

This is now much better thanks to #8447. I still think there's work to be done here (on matrix.org, shutdown takes 12 seconds; startup takes 11) so will leave this open.

richvdh · 2021-01-05T12:20:27Z

I just had another restart which took 83 seconds:

11:35:36: send SIGTERMs to workers
11:35:46: send SIGKILLs to workers
11:35:51: last worker finally dies
11:35:51: send SIGTERM to master
11:36:01: send SIGKILL to master
11:36:02: start master
11:36:29: master is up; start workers
11:36:59: workers started

A few thoughts:

do we (still) need to enforce the ordering of shutting down workers before starting the master, and starting the master before the workers? One reason for doing so used to be to ensure the DB was updated before starting the workers, but since Refuse to upgrade database on worker processes #8266 the workers will simply refuse to start in this situation. Breaking this link would roughly halve the restart time.

During startup, the workers all pause for 16 seconds on:

2021-01-05 11:36:40,797 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for presence_stream(stream_id)

can we make that faster?

The master, on the other hand, seems to pause for some different stream generators:

2021-01-05 11:36:14,787 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for push_rules_stream(stream_id)
2021-01-05 11:36:21,874 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for access_tokens(id)
2021-01-05 11:36:29,288 - synapse.storage.databases - 83 - INFO - None - [database config 'master']: Starting 'state' database

I still think we should skip the final GC per synapse takes ages to restart #7968 (comment).

richvdh · 2021-05-26T10:00:43Z

apparently the stream generator for push_rules_stream now takes 30 seconds to initialise:

2021-05-26 09:57:14,498 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-05-26 09:57:14,502 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-05-26 09:57:14,503 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-05-26 09:57:44,476 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-05-26 09:57:45,615 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

erikjohnston · 2021-08-23T16:10:17Z

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

We need to be careful that we still run the functions we registered to run when the reactor stops

squahtx · 2021-08-27T15:21:16Z

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712
Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

squahtx · 2021-09-02T12:31:00Z

We saw the same 30 second delay after initialising the stream generator for push_rules_stream again:

2021-08-31 10:36:32,594 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-08-31 10:36:32,606 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-08-31 10:36:32,613 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-08-31 10:37:07,958 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-08-31 10:37:08,106 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-08-31 10:37:09,514 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

After looking at postgres logs, two slow queries on startup were identified, accounting for ~30 seconds.

2021-08-31 10:36:54.115 UTC [matrix main] LOG:  duration: 17652.320 ms  statement: SELECT user_id, state, last_active_ts, last_federation_update_ts, last_user_sync_ts, status_msg, currently_active FROM presence_stream WHERE state != 'offline'
[...]
2021-08-31 10:37:07.905 UTC [matrix main] LOG:  duration: 13210.316 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2273929705 - 100000 GROUP BY room_id

From discussions with @richvdh and @erikjohnston:

The first query picks out 0 rows from an 800,000 row table with a sequential scan. There is one row for each user synapse has seen.
For now we plan to add an index to the table.
The second query uses an index and ought to be fast. When run manually, the query does complete reasonably quickly.
It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

squahtx · 2021-09-03T16:20:45Z

Our next step is to check the impact on restart times once these two changes have made it to matrix.org

richvdh · 2021-09-10T10:18:01Z

It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

Related to rolling restarts: #10793

squahtx · 2021-09-14T11:56:10Z

In today's restart, we saw that the 30 second delay (actually 35 seconds) has been halved down to 18 seconds:

2021-09-14 10:20:51,840 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for pushers(id)
2021-09-14 10:20:51,842 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-09-14 10:20:51,844 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-09-14 10:21:09,785 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for access_tokens(id)
2021-09-14 10:21:09,786 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-09-14 10:21:11,279 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

2021-09-14 10:20:55.415 UTC [matrix main] LOG:  duration: 3556.734 ms  statement: SELECT user_id, MAX(stream_id) FROM push_rules_stream WHERE stream_id > 2616199 - 100000 GROUP BY user_id
2021-09-14 10:20:56.270 UTC [matrix main] LOG:  duration: 639.503 ms  statement: SELECT user_id, MAX(stream_id) FROM presence_stream WHERE stream_id > 757284961 - 100000 GROUP BY user_id
2021-09-14 10:21:09.736 UTC [matrix main] LOG:  duration: 13282.137 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2308446758 - 100000 GROUP BY room_id

These 3 previously seen queries were responsible for most of those 18 seconds, as expected.

Yoric · 2021-10-04T14:06:42Z

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

richvdh · 2021-10-04T14:11:29Z

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712
Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

On a recent restart of matrix.org, almost all of the workers shut down immediately (within a second of the shutdown request). Exceptions were:

the client readers, which took a couple of seconds
background_worker1.service, and the main process, both of whose shutdowns timed out after 10 seconds and were SIGKILLed.

I'd love to know what the bg worker and the main process were doing for those 10 seconds. I still also think we should remove the dependency between main process and workers: there is no need to wait for all the workers to shut down before beginning to stop the main process.

richvdh · 2021-10-04T14:21:12Z

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

so yes, it looks like the first /sync request after restart took over 3 minutes to generate a response. Breaking this down:

13:34:46: shutdown begins
13:34:56: background_worker1 shutdown complete
13:35:06: main process shutdown complete. Startup begins
13:35:50: main process startup complete; worker startup begins
13:36:45: synchrotron1 starts processing requests

13:39:48: synchrotron1 finally responds to @Yoric:

2021-10-04 13:39:48,726 - synapse.access.http.8083 - 421 - INFO - GET-288 - ... - 8083 - {@dotyoric:matrix.org} Processed request: 182.045sec/-1.976sec (0.280sec, 0.030sec) (208.099sec/5.580sec/166) 0B 200! "GET /_matrix/client/r0/sync?filter=0&timeout=0&since=....

I thinl the very slow response - and a lot of the slow worker startup time - is just because we have tens of synapse processes hammering the database after a restart. Rolling restarts should help considerably here.

squahtx · 2021-10-04T14:47:01Z

I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

At the time I made that observation, #10703 was still an issue, so it may very well have been the cause

DMRobertson · 2021-11-25T10:17:15Z

Data point: @erikjohnston and @reivilibre were surprised that the restart to upgrade to 1.47.1 seemed noticeably quicker than expected.

DMRobertson · 2022-05-16T14:22:35Z

Deploying 1.59rc2 to matrix.org today took 42 minutes for 141 workers to sequentially shutdown, restart and notify systemd. I make that roughly 18 seconds per worker.

richvdh changed the title ~~master blocks for a long tie during startup~~ master blocks for a long time during startup Jul 28, 2020

clokep mentioned this issue Aug 18, 2020

synapse takes a long time to restart #8117

Closed

richvdh added a commit that referenced this issue Sep 6, 2020

Add more logging to debug slow startup

4609bea

I'm hoping this will provide some pointers for debugging #7968.

richvdh mentioned this issue Sep 6, 2020

Add more logging to debug slow startup #8264

Merged

richvdh self-assigned this Sep 6, 2020

richvdh added a commit that referenced this issue Sep 7, 2020

Add more logging to debug slow startup (#8264)

0dae7d8

I'm hoping this will provide some pointers for debugging #7968.

richvdh mentioned this issue Sep 7, 2020

cache_invalidation_stream_by_instance grows without bounds and causes slow startup #8269

Closed

richvdh closed this as completed Sep 7, 2020

richvdh reopened this Sep 30, 2020

richvdh added the release blocker label Oct 1, 2020

richvdh removed the release blocker label Oct 2, 2020

richvdh changed the title ~~master blocks for a long time during startup~~ synapse takes ages to restart Jan 5, 2021

erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 26, 2021

richvdh removed their assignment Aug 19, 2021

richvdh assigned squahtx Aug 26, 2021

squahtx mentioned this issue Aug 27, 2021

Skip the final GC on shutdown to improve restart times #10712

Merged

squahtx mentioned this issue Sep 2, 2021

Add a partial index to presence_stream to speed up startups #10748

Merged

squahtx mentioned this issue Oct 20, 2021

Rolling restarts for matrix.org #11136

Closed

callahad unassigned squahtx Oct 21, 2021

matrixbot mentioned this issue Dec 21, 2023

synapse takes ages to restart element-hq/synapse#7968

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synapse takes ages to restart #7968

synapse takes ages to restart #7968

richvdh commented Jul 28, 2020 •

edited

Loading

richvdh commented Aug 19, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 30, 2020

richvdh commented Sep 30, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 2, 2020

richvdh commented Jan 5, 2021

richvdh commented May 26, 2021

erikjohnston commented Aug 23, 2021

squahtx commented Aug 27, 2021

squahtx commented Sep 2, 2021

squahtx commented Sep 3, 2021 •

edited

Loading

richvdh commented Sep 10, 2021

squahtx commented Sep 14, 2021

Yoric commented Oct 4, 2021

richvdh commented Oct 4, 2021

richvdh commented Oct 4, 2021

squahtx commented Oct 4, 2021 •

edited

Loading

DMRobertson commented Nov 25, 2021

DMRobertson commented May 16, 2022

synapse takes ages to restart #7968

synapse takes ages to restart #7968

Comments

richvdh commented Jul 28, 2020 • edited Loading

richvdh commented Aug 19, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 7, 2020

richvdh commented Sep 30, 2020

richvdh commented Sep 30, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 1, 2020

richvdh commented Oct 2, 2020

richvdh commented Jan 5, 2021

richvdh commented May 26, 2021

erikjohnston commented Aug 23, 2021

squahtx commented Aug 27, 2021

squahtx commented Sep 2, 2021

squahtx commented Sep 3, 2021 • edited Loading

richvdh commented Sep 10, 2021

squahtx commented Sep 14, 2021

Yoric commented Oct 4, 2021

richvdh commented Oct 4, 2021

richvdh commented Oct 4, 2021

squahtx commented Oct 4, 2021 • edited Loading

DMRobertson commented Nov 25, 2021

DMRobertson commented May 16, 2022

richvdh commented Jul 28, 2020 •

edited

Loading

squahtx commented Sep 3, 2021 •

edited

Loading

squahtx commented Oct 4, 2021 •

edited

Loading