Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synapse workers #456

Merged
merged 88 commits into from
Feb 19, 2021
Merged

Synapse workers #456

merged 88 commits into from
Feb 19, 2021

Conversation

eMPee584
Copy link
Contributor

@eMPee584 eMPee584 commented Apr 14, 2020

· first version: grokking this was a bit of 🤯 .. and there are probably still faults in the wiring
· ⚠️ not tested impact on UX yet
· update 2020/04/21: live testing on our 7k+ uni matrix instance 😅 💦
· update 2020/06/30: works very well to spread load, now with 10k+ users
· update 2020/09/09: currently working on expanding the patch queue to cover recent changes to synapse workers.. not much left but it might need another week.
· update 2020/09/29: about to merge with Max's contributions in #642 PR ✊♻️
· update 2020/10/14: ok I've merged and mostly reconciled Max's & my branch (this PR) & with MDAD* mainline, now working to get the frigging PIDs from the worker processes and fixing the removal of leftovers
· update 2020/10/28: nearly finished, only missing support for setups not using inbuilt nginx.. REVIEWS, PLEASE
· update 2021/01/08: yes, still nearly finished 😁

  • (hey @spantaleev this project needs an official abbrevation/nick name lol)

@eMPee584
Copy link
Contributor Author

eMPee584 commented Apr 19, 2020

Turns out, even with a when condition on an import_tasks action evaluating to false, loops within these tasks will be executed. How !clever a design choice is that. 🥳

[ok, further reading revealed the distinctive logic behind import_tasks and include_tasks .. c.f. Server Fault - What's the difference between include_tasks and import_tasks? ]

eMPee584 and others added 8 commits April 19, 2020 19:05
· needs documentation; no checks yet for port clashes or typos in worker name
· according to https://github.com/matrix-org/synapse/wiki/Workers-setup-with-nginx#results
  about 90% of requests go to the synchrotron endpoint
· thus, the synchrotron worker is especially suited to be load-balanced
· most of the other workers are documented to support only a single instance
· https://github.com/matrix-org/synapse/blob/master/docs/workers.md
· 😅 How to keep this in sync with the matrix-synapse documentation?
· regex location matching is expensive
· nginx syntax limit: one location only per block / statement
· thus, lots of duplicate statements in this file
FIXME: horrid duplication in template file
@eMPee584
Copy link
Contributor Author

eMPee584 commented Apr 20, 2020

@spantaleev this template file has a lot of duplicate statements.. I had started to factor some of that out into a {% macro render_proxy_directives() %}, but oh the indentation .. and oh the TABs... this would have been nice: pallets/jinja#919 🤔
We tried it on our test instance though and it seems to work fine. Still waiting for a colleague to provide us with his load testing script..

This was referenced Apr 20, 2020
@spantaleev
Copy link
Owner

One of the things that grabs my attention is the use of matrix_synapse_* variables in the matrix-nginx-proxy role (e.g. matrix_synapse_use_presence, matrix_nginx_proxy_synapse_workers_list).

We try to keep roles as separate as possible and to instead pass data along via group_vars/matrix_servers.

The easiest way to fix that would probably be by defining a new default like this in the matrix-nginx-proxy role:

# List of Synapse workers.
# Example: [{"worker": "federation_reader", "port": 5000}]
matrix_nginx_proxy_synapse_workers_list: []

.. and then connecting it to matrix_synapse_workers_enabled_list from group_vars/matrix_servers:

matrix_nginx_proxy_synapse_workers_list: "{{ matrix_synapse_workers_enabled_list }}"

We probably don't need to pass matrix_synapse_use_presence to matrix-nginx-proxy. It doesn't seem to hurt having a rule capturing location ~ ^/_matrix/client/(api/v1|r0|unstable)/presence/[^/]+/status and forwarding it to Synapse in any case (even if presence is disabled).


I'm guessing when workers are enabled, we don't support matrix_nginx_proxy_enabled: false?
If the proxy is disabled, we attempt to generate a configuration that can be used with a local nginx server (one not running in a container). In such a case, we tell other services to expose their ports on 127.0.0.1 and we reverse-proxy to those ports, and not to container-name:port.

I guess making workers work with this would be complicated.

I see that you've modified roles/matrix-synapse/templates/synapse/systemd/matrix-synapse.service.j2 to expose all worker ports (on 0.0.0.0 even). Is it necessary to do this? We may wish to expose them on 127.0.0.1 in some rare cases (like matrix_nginx_proxy_enabled: false), but I don't see why we'd do it on 0.0.0.0 all of the time.


I'm guessing that when workers are enabled, matrix-corporal likely breaks, as it's no longer capturing all traffic. Rather, some traffic gets captured in these new location blocks and goes to Synapse workers instead.

Using matrix-corporal + Synapse workers is some advanced use-case that will be difficult to solve. You should ignore it. I'm just mentioning it here for completeness.

@lovelaced
Copy link

I don't have anything to add here, but just wanted to add this would be massively useful to my organization, and workers are a REALLY important feature. Would love to be able to use the officially supported ansible playbook for this rather than throwing something together.

@eMPee584
Copy link
Contributor Author

eMPee584 commented Apr 25, 2020

( ok thanks for the feedback.. this will need couple more days to get right.. also, we're hitting some issues, matrix-org/synapse#7345 [issues were resolved by updating to synapse 1.12.4 👍 😅 ] .. so for the time being, it's strictly for the bold and coureagous 😅 )

@ptman
Copy link
Contributor

ptman commented Jun 3, 2020

@eMPee584 any update? smooth sailing?

@eMPee584
Copy link
Contributor Author

Ah, sorry, I was busy with school the recent weeks and the convoluted nginx template also made me avoid finally tackling this. I'll try and get it done in the coming week..
Yes, it does work very well.

@spantaleev
Copy link
Owner

Ah, good to here you're still on it! Looking forward to merging this!

{% set federation_reader_worker = matrix_synapse_workers_enabled_list|selectattr('worker', 'equalto', 'federation_reader')|first %}
{% if federation_reader_worker %}
{# c.f. https://github.com/matrix-org/synapse/blame/master/docs/workers.md#L160 #}
location ~ ^(/_matrix/federation/v1/event/|/_matrix/federation/v1/state/|/_matrix/federation/v1/state_ids/|/_matrix/federation/v1/backfill/|/_matrix/federation/v1/get_missing_events/|/_matrix/federation/v1/publicRooms|/_matrix/federation/v1/query/|/_matrix/federation/v1/make_join/|/_matrix/federation/v1/make_leave/|/_matrix/federation/v1/send_join/|/_matrix/federation/v2/send_join/|/_matrix/federation/v1/send_leave/|/_matrix/federation/v2/send_leave/|/_matrix/federation/v1/invite/|/_matrix/federation/v2/invite/|/_matrix/federation/v1/query_auth/|/_matrix/federation/v1/event_auth/|/_matrix/federation/v1/exchange_third_party_invite/|/_matrix/federation/v1/user/devices/|/_matrix/federation/v1/send/|/_matrix/federation/v1/get_groups_publicised$|/_matrix/key/v2/query|/_matrix/federation/v1/groups/) {
Copy link

@lovelaced lovelaced Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/_matrix/federation/v1/groups/ should only be allowed GET requests; this configuration will cause federated community invites to fail. reference here: https://github.com/matrix-org/synapse/blob/master/docs/workers.md#synapseappfederation_reader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.. Is this theoretical or have you experienced it on an actual system?
I think there probably is a different mistake in the config to cause this.. we are also having some minor troubles with our setup.
I've just created an awk script to parse the endpoints directly from the upstream synapse workers documentation.. but it remains kinda nightmarish to work with.. 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I experienced it myself when trying to set up a fresh homeserver so I just removed the groups endpoint so the main synapse thread is handling it instead for now.

{% set client_reader_worker = matrix_synapse_workers_enabled_list|selectattr('worker', 'equalto', 'client_reader')|first %}
{% if client_reader_worker %}
{# c.f. https://github.com/matrix-org/synapse/blame/master/docs/workers.md#L252 #}
location ^/_matrix/client/(versions$|(api/v1|r0|unstable)/(publicRooms$|rooms/.*/joined_me|rooms/.*/context/.|rooms/.*/members$|rooms/.*/messages$|rooms/.*/state$|login$|account/3pid$|keys/query$|keys/changes$|voip/turnServer$|joined_groups$|publicised_groups$|publicised_groups/|pushrules/.*$|groups/.*$|register$|auth/.*/fallback/web$)) {
Copy link

@lovelaced lovelaced Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two of the endpoints here can only handle GET requests, and there are also two more not included here. The following list is GET-only for the client_reader worker:

^/_matrix/client/(api/v1|r0|unstable)/pushrules/.*$
^/_matrix/client/(api/v1|r0|unstable)/groups/.*$
^/_matrix/client/(api/v1|r0|unstable)/user/[^/]*/account_data/
^/_matrix/client/(api/v1|r0|unstable)/user/[^/]*/rooms/[^/]*/account_data/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "location ^/..." here, for a regex match needs to be "location ~ ^..." doesn't it?

@maxklenk
Copy link
Contributor

Redis support was added in v1.13.0 with it becoming the recommended method in v1.18.0.
https://github.com/matrix-org/synapse/blob/master/docs/workers.md

The workers in the current synapse version now communicate via Redis, so we should setup Redis as well to add worker support.

@maxklenk maxklenk mentioned this pull request Sep 9, 2020
5 tasks
`::` leads to errors like:

> socket.gaierror: [Errno -9] Address family for hostname not supported
This leads to much easier management and potential safety
features (validation). In the future, we could try to avoid port
conflicts as well, but it didn't seem worth the effort to do it now.
Our port ranges seem large enough.

This can also pave the way for a "presets" feature
(similar to `matrix_nginx_proxy_ssl_presets`) which makes it even easier
for people to configure worker counts.
@spantaleev
Copy link
Owner

spantaleev commented Feb 15, 2021

85a05f3 added support for dynamically generating matrix_synapse_workers_enabled_list based on other variables.
I believe that having this makes it much easier for people to configure workers.

I think having some presets might be a useful feature that can be added easily now. Most people would then be able to just switch a single other variable (matrix_synapse_workers_preset) which controls all these many other matrix_synapse_workers_*_workers_count variables.

It'd be interesting to brainstorm what some good presets would be. I guess it could be like this:

  • minimal - a very minimal (but still useful) one is to just enable a federation_sender worker and leave all other workers disabled
  • one-of-each - the next preset could be "one of each" worker type
  • larger - after that, there could be one that is like "one of each", but has multiple generic workers (3 or 5, for example)

Note that we currently don't support running multiple federation_sender workers (see 61e427d), so we can't increase the number of those right now. We may consider fixing this.

People that need more workers than these 3 presets can go for a hybrid approach: start with a preset (say "one of each") and then increase the number of a certain worker that they need more of (e.g. matrix_synapse_workers_generic_workers_count: 15).

Another question is, which should be the default preset? I think one-of-each is most helpful and reasonable. People that enable workers are obviously on a somewhat bigger server, so going for the middle option is good. People who want to save resources can go the extra mile and go for minimal or for additional tweaking of matrix_synapse_workers_*_workers_count variables. People who want a bigger worker setup can also use one-of-each and tweak the variables that seem worth tweaking.


Do you have other ideas for useful presets? Or comments about these ones? Does having a preset option sound useful at all or should we just go for "one of each" by default and ask people to tweak the worker counts?

@ptman
Copy link
Contributor

ptman commented Feb 15, 2021

It would be great to get (voluntary) feedback on what actual worker counts end up being popular. One could start with one, or no, presets and add more as patterns are identified.

This give us the possibility to run multiple instances of
workers that that don't expose a port.

Right now, we don't support that, but in the future we could
run multiple `federation_sender` or `pusher` workers, without
them fighting over naming (previously, they'd all be named
something like `matrix-synapse-worker-pusher-0`, because
they'd all define `port` as `0`).
@spantaleev
Copy link
Owner

I've made a breaking change to the format of matrix_synapse_workers_enabled_list in d6c4d41.

People that define that manually will need to add an instanceId property to each worker instance.


In any case, it's better to refrain from defining matrix_synapse_workers_enabled_list, if possible in favor of matrix_synapse_workers_*_workers_count variables.

We're talking about a webserver running on the same machine, which
imports the configuration files generated by the `matrix-nginx-proxy`
in the `/matrix/nginx-proxy/conf.d` directory.

Users who run an nginx webserver on some other machine will need to do
something different.
Adding more presets in the future would be nice.
@spantaleev spantaleev merged commit d94d0e2 into spantaleev:master Feb 19, 2021
@spantaleev
Copy link
Owner

I think we have a cause for celebration! This is now complete and merged to master! 🎉

Huge thanks to everyone who has worked on making this happen!


There are minor things that could be improved.

For example, pusher and federation_sender workers can only run as a single instance (at most) for us. This is solvable, but requires more work and testing. It's probably not too important as well, given that TU Dresden scaled to 10k+ even without it.

Another thing that may be improved is metrics. Since #838, we do support metrics collection. When workers are enabled, graphs become somewhat sparse. I'm guessing we'd need to collect data from the various workers when workers are enabled. Maybe we need some different graphs (splitting them per worker or something?).

@jolly-jump
Copy link
Contributor

Congratulations to those who worked on this! This saved my butt. We have 1k users only, but it seems this was one necessary step.

I have two other questions:

  • nginx worker connections
  • postgres worker connections

They are not handled by this configuration of workers, I suppose. But, if I tune postgres to use "max_parallel_workers" as proposed here #532 and if I use more nginx worker connections by tampering with /matrix/nginx-proxy/nginx.conf is it wise to count all the workers together to avoid too much overprovisioning of CPU cores (I know, that is not a real thing).

On the other hand: would it be difficult to let the nginx-config (at least) be managed by this workers-configuration or is this all about synapse only and therfore it should be a separate configuration?

spantaleev added a commit that referenced this pull request Feb 19, 2021
This probably got lost somehow in all the work that happened in
#456
@spantaleev
Copy link
Owner

We can adjust both of these (nginx workers; Postgres max connections) automatically when Synapse workers are enabled via group_vars/matrix_servers settings.

We can define variables for these things in their respective roles and then override them from group_vars/matrix_servers to some better values (for higher scale). We can even do it per preset (especially if we add larger presets that will require changes). Still, people may wish to adjust the values manually if they're running many additional workers.

spantaleev added a commit that referenced this pull request Feb 24, 2021
spantaleev added a commit that referenced this pull request Mar 1, 2021
Mostly affects people who disable the integrated `matrix-nginx-proxy`.

Related to #456
and more specifically 4d62a75.
spantaleev added a commit that referenced this pull request Mar 2, 2021
This is something that got lost during
#456
and more specifically 4d62a75.

Fixes #914
@rltas
Copy link

rltas commented Nov 11, 2022

It's probably not too important as well, given that TU Dresden scaled to 10k+ even without it.

Just to get an idea, how many resources/workers are needed at that scale?

@eMPee584
Copy link
Contributor Author

It's probably not too important as well, given that TU Dresden scaled to 10k+ even without it.

Just to get an idea, how many resources/workers are needed at that scale?

From what I remember, very few.. a couple (3-6) synchrotron workers and the resultant parallelization from using just a single worker for the others already had a sufficient impact compared to a monolithic synapse process.
I'll ask the current admin team for the worker config in use now..

@eMPee584 eMPee584 deleted the synapse-workers branch November 24, 2022 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.