-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add synapse worker support #642
Feature: add synapse worker support #642
Conversation
· needs documentation; no checks yet for port clashes or typos in worker name · according to https://github.com/matrix-org/synapse/wiki/Workers-setup-with-nginx#results about 90% of requests go to the synchrotron endpoint · thus, the synchrotron worker is especially suited to be load-balanced · most of the other workers are documented to support only a single instance · https://github.com/matrix-org/synapse/blob/master/docs/workers.md
· 😅 How to keep this in sync with the matrix-synapse documentation? · regex location matching is expensive · nginx syntax limit: one location only per block / statement · thus, lots of duplicate statements in this file
FIXME: horrid duplication in template file
Thanks for chiming in Max! Some of the checkboxes (redis) can be ticked, please have a look at the original PR #456 .. had been working on this last week but got too deep into trying to automatically parse the synapse workers dox .. converting my awk script* into perl and such non-sense. Also have a bunch of other stuff going on, and a tmux session with >60 windows (so maybe 150 terminal commands/apps running) really gets a bit confusing when combined with a sleep deficit 🤪 [*]
|
Hey @eMPee584 nice to see that you picked it up again. I would like to support, eg. your nginx configuration still contains the old workers/routes. |
You mean, [it still contains the old routes judging from the parts I have committed as yet..] |
how can we help you guys push this over the finish line? |
I am running this branch in a beta test with about 15000 users right now to see if everything runs smooth under load. |
I'll have another go at it today.. 😅 |
an idiot-proof guide on how to get the latest workers pull:
Then add this to your vars.yml:
then just re-run the playbook with setup-all,start seems to run okay so far, good work max Edit: This setup caused CPU load to spike badly, probably because 10 workers for like 2 dozen people is overkill. xD Tried reducing the amount of workers like so:
This caused the service file for the postgres container to break like so:
Which of course meant the synapse container wouldn't load. I ended up reverting to the no-worker setup and had to delete the image+container for redis manually. Back online now, I'll have another play tomorrow. Edit 2: Half these were shut off when i deployed the second time with less workers, so it doesn't seem to be cleaning them up properly:
It's possible these workers are 'all-or-nothing' so you can set them all up, or remove them all, but you can't pick and choose. I didn't run the same PR with workers disabled so that might have cleaned up all of these in one swoop. |
Had a bit more of a play, here is how it breaks the postgres service file:
The bottom line and preceding slash is what breaks it, this might just be because of my crude merging though. xD Also noticed with these variables the dockers start up but clients won't start syncing:
It does a good job cleaning everything up at least when you set workers_enabled to false. |
I tested this PR with my homeserver. It did not work with the defaults.
|
Have been running 3 workers for a few days now with no real issues:
Just ran with these settings, then fixed the broken postgres service file. CPU time seems to be climbing on the synapse processes but the service is very responsive now. :) Thanks again. |
One error i think might be related is 2/3rds of the time uploads fail to complete with error message: "The file 'XXX' failed to upload." in Element. I believe this is the synapse workers failing to handle the request, and it only completes when the main synapse thread handles the request. I wasn't able to find a relevant error message in their logs though. :( |
I tried to revert to 1 worker like so yet i still see the 2 generic workers in htop:
Interestingly enough the port mapping is correct, so these 2 generic workers shouldn't be able to access the web:
We also see that the services are still available:
I ended up manually removing these services. Edit: In retrospect i wouldn't be surprised if the 2 generic workers were climbing in load constantly because they weren't port mapped before either! derp :) Still they don't get cleaned up properly when you reduce the amount of workers in this list. |
Have you tried to add a media_repository worker to split the upload from generic workers? |
I rebased my branch on this PR and added proper (hopefully) cleansing of left-over workers, please try if the problem persists with the recent commits on #456 .. |
I'll close this PR in favour to finish the work in #456 |
The PR of @eMPee584 is stale for quite a while. I would like to take over to implement this feature.
Docs:
Usage
To enable worker support, simply set the
matrix_synapse_workers_enabled
flag to true. There are defaults on how many workers of which kind are started inmatrix_synapse_workers_enabled_list
. In addition your database should allow more connections to support all workers.Tasks: