Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add synapse worker support #642

Closed

Conversation

maxklenk
Copy link
Contributor

@maxklenk maxklenk commented Sep 9, 2020

The PR of @eMPee584 is stale for quite a while. I would like to take over to implement this feature.

Docs:

Usage

To enable worker support, simply set the matrix_synapse_workers_enabled flag to true. There are defaults on how many workers of which kind are started in matrix_synapse_workers_enabled_list. In addition your database should allow more connections to support all workers.

#
# Synapse config
#
matrix_synapse_workers_enabled: true

#
# Postgres config
#
matrix_postgres_process_extra_arguments: [
  "-c 'max_connections=200'"
]

Tasks:

  • use new worker names
  • use new worker routes
  • deploy redis when workers are enabled
  • add redis configuration for workers
  • check cleanup of worker processes/config when they are changed/renamed

eMPee584 and others added 12 commits April 19, 2020 19:05
· needs documentation; no checks yet for port clashes or typos in worker name
· according to https://github.com/matrix-org/synapse/wiki/Workers-setup-with-nginx#results
  about 90% of requests go to the synchrotron endpoint
· thus, the synchrotron worker is especially suited to be load-balanced
· most of the other workers are documented to support only a single instance
· https://github.com/matrix-org/synapse/blob/master/docs/workers.md
· 😅 How to keep this in sync with the matrix-synapse documentation?
· regex location matching is expensive
· nginx syntax limit: one location only per block / statement
· thus, lots of duplicate statements in this file
FIXME: horrid duplication in template file
@eMPee584
Copy link
Contributor

eMPee584 commented Sep 9, 2020

Thanks for chiming in Max! Some of the checkboxes (redis) can be ticked, please have a look at the original PR #456 .. had been working on this last week but got too deep into trying to automatically parse the synapse workers dox .. converting my awk script* into perl and such non-sense. Also have a bunch of other stuff going on, and a tmux session with >60 windows (so maybe 150 terminal commands/apps running) really gets a bit confusing when combined with a sleep deficit 🤪
Anyway, I just force-pushed my rebased branch and some fragments of the necessary updates and will try to have it up running ASAP.. there's not much more missing.

[*] parse-workers-docs.awk

#!/usr/bin/awk
# ^ the most hackish approach to get a machine-readable list
# of current matrix REST API endpoint regular expressions from the »source«
# invoke in shell with:
# URL=https://github.com/matrix-org/synapse/raw/master/docs/workers.md
# curl -L ${URL} | awk -f parse-workers-docs.awk -

# Enable further processing only at the end of the preamble.
# Read each synapse worker section as record and its lines as fields.
/Available worker applications/ {
    enable_parsing = 1
    # set record separator to markdown sub-header
    RS = "### "
    # set field separator to newline
    FS = "\n"
}

# Each worker record should start with a synapse.app.X headline
enable_parsing && /synapse.app/ {
    # get rid of the backticks, then store the worker type
    gsub("`", "", $1)
    worker_type = $1

    # loop through the lines (1 - number of fields in record)
    for (i = 1; i <= NF; i++) {
        # and parse those looking like URLs
        if ($i ~ /^ +[\^/].*\//) {
            gsub(/^ +/, "", $i)
            api_endpoint_regex = $i
            print worker_type ": '" api_endpoint_regex "'"
        }
    }
}

# vim: tabstop=4 shiftwidth=4 expandtab autoindent

@maxklenk
Copy link
Contributor Author

Hey @eMPee584 nice to see that you picked it up again. I would like to support, eg. your nginx configuration still contains the old workers/routes.

@eMPee584
Copy link
Contributor

eMPee584 commented Sep 10, 2020

You mean, [it still contains the old routes judging from the parts I have committed as yet..]
i usually try to have proper, atomic & tested commits.. so sometimes, things lie around in my work tree for a while before they are being committed even though practically ready..
Ah well, redundant implementation may be effort duplication in some cases, but also provides the benefit of validation. So keep going if you like, even though I might already have done a part of a task and it just remains to be committed and pushed.. 😉

@maxklenk maxklenk changed the title [WIP] Feature: add synapse worker support Feature: add synapse worker support Sep 11, 2020
@PC-Admin
Copy link
Contributor

PC-Admin commented Sep 19, 2020

Big thanks to @eMPee584 and @maxklenk for trying to get this working, it's a very important addition to this deploy script.

@ptman
Copy link
Contributor

ptman commented Sep 23, 2020

how can we help you guys push this over the finish line?

@maxklenk
Copy link
Contributor Author

I am running this branch in a beta test with about 15000 users right now to see if everything runs smooth under load.
I tried to include the feedback from PR #456 but it would be great to get a review of my changes to finish this.

@eMPee584
Copy link
Contributor

I'll have another go at it today.. 😅

@eMPee584 eMPee584 mentioned this pull request Sep 29, 2020
@PC-Admin
Copy link
Contributor

PC-Admin commented Oct 6, 2020

an idiot-proof guide on how to get the latest workers pull:

$ git clone https://github.com/spantaleev/matrix-docker-ansible-deploy.git
$ cd ./matrix-docker-ansible-deploy/
$ git fetch origin pull/642/head:maxtesting
$ git checkout maxtesting

Then add this to your vars.yml:

matrix_synapse_workers_enabled: true

then just re-run the playbook with setup-all,start

seems to run okay so far, good work max

Edit: This setup caused CPU load to spike badly, probably because 10 workers for like 2 dozen people is overkill. xD Tried reducing the amount of workers like so:

matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

This caused the service file for the postgres container to break like so:

-- A start job for unit matrix-postgres.service has finished successfully.
-- 
-- The job identifier is 78513.
Oct 07 01:29:56 pagrus matrix-postgres[1688]: chmod: /var/run/postgresql: Operation not permitted
Oct 07 01:29:56 pagrus matrix-postgres[1688]: PostgreSQL Database directory appears to contain a database; Skipping initialization
Oct 07 01:29:56 pagrus matrix-postgres[1688]: postgres: invalid argument: "ExecStop=-/usr/bin/env"
Oct 07 01:29:56 pagrus matrix-postgres[1688]: Try "postgres --help" for more information.
Oct 07 01:30:10 pagrus systemd[1]: matrix-postgres.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited

Which of course meant the synapse container wouldn't load. I ended up reverting to the no-worker setup and had to delete the image+container for redis manually. Back online now, I'll have another play tomorrow.

Edit 2: Half these were shut off when i deployed the second time with less workers, so it doesn't seem to be cleaning them up properly:

matrix-redis.service            loaded active running Matrix Redis server
matrix-synapse-worker@frontend_proxy:18701.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18101.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18102.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18103.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18104.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18105.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@generic_worker:18106.service loaded active running Synapse Matrix Worker    
matrix-synapse-worker@media_repository:18501.service loaded active running Synapse Matrix Worker 

It's possible these workers are 'all-or-nothing' so you can set them all up, or remove them all, but you can't pick and choose. I didn't run the same PR with workers disabled so that might have cleaned up all of these in one swoop.

@PC-Admin
Copy link
Contributor

PC-Admin commented Oct 7, 2020

Had a bit more of a play, here is how it breaks the postgres service file:

ExecStart=/usr/bin/env docker run --rm --name matrix-postgres \
                        --log-driver=none \
                        --user=1001:1001 \
                        --cap-drop=ALL \
                        --read-only \
                        --tmpfs=/tmp:rw,noexec,nosuid,size=100m \
                        --tmpfs=/run/postgresql:rw,noexec,nosuid,size=100m \
                        --network=matrix \
                        --env-file=/matrix/postgres/env-postgres-server \
                        -v /matrix/postgres/data:/var/lib/postgresql/data:rw \
                        -v /etc/passwd:/etc/passwd:ro \
                        postgres:12.4-alpine \
                        postgres \

The bottom line and preceding slash is what breaks it, this might just be because of my crude merging though. xD

Also noticed with these variables the dockers start up but clients won't start syncing:

matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

It does a good job cleaning everything up at least when you set workers_enabled to false.

@bobbythefett
Copy link

I tested this PR with my homeserver. It did not work with the defaults.

  • My server went from a load average of 2 to 20+ (maybe too much workers for a small/medium server by default?)
  • Nginx configuration did not work with my own nginx. The lines added for the workers were not working, even after the load reduced.

@PC-Admin
Copy link
Contributor

Have been running 3 workers for a few days now with no real issues:

# Synapse/Redit Workers
matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

Just ran with these settings, then fixed the broken postgres service file. CPU time seems to be climbing on the synapse processes but the service is very responsive now. :) Thanks again.

@PC-Admin
Copy link
Contributor

PC-Admin commented Oct 11, 2020

One error i think might be related is 2/3rds of the time uploads fail to complete with error message: "The file 'XXX' failed to upload." in Element.

I believe this is the synapse workers failing to handle the request, and it only completes when the main synapse thread handles the request. I wasn't able to find a relevant error message in their logs though. :(

@PC-Admin
Copy link
Contributor

generic_workers_overloaded

The generic workers seem to be causing runaway CPU load, i wasn't able to locate the cause. Reverting to 1 federation sender for now.

@PC-Admin
Copy link
Contributor

PC-Admin commented Oct 12, 2020

I tried to revert to 1 worker like so yet i still see the 2 generic workers in htop:

# Synapse/Redit Workers
matrix_synapse_workers_enabled: false
matrix_synapse_workers_enabled_list:
  - { worker: generic_worker, port: 18101 }
  - { worker: generic_worker, port: 18102 }
  - { worker: federation_sender, port: 18401 }

~THEN~

# Synapse/Redit Workers
matrix_synapse_workers_enabled: true
matrix_synapse_workers_enabled_list:
  - { worker: federation_sender, port: 18401 }

Interestingly enough the port mapping is correct, so these 2 generic workers shouldn't be able to access the web:

CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS              PORTS                                                                                                                                  NAMES
76d2e06cf4a5        matrixdotorg/synapse:v1.19.1    "python -m synapse.a…"   3 minutes ago       Up 3 minutes        8008-8009/tcp, 8448/tcp, 0.0.0.0:18401->18401/tcp, 0.0.0.0:9000->9100/tcp                                                              matrix-synapse

We also see that the services are still available:

# systemctl list-units --type=service --state=running
matrix-synapse-worker@federation_sender:18401.service loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@generic_worker:18101.service    loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@generic_worker:18102.service    loaded active running Synapse Matrix Worker                       
matrix-synapse-worker@user_dir:18601.service          loaded active running Synapse Matrix Worker

I ended up manually removing these services.

Edit: In retrospect i wouldn't be surprised if the 2 generic workers were climbing in load constantly because they weren't port mapped before either! derp :) Still they don't get cleaned up properly when you reduce the amount of workers in this list.

@localguru
Copy link

@PC-Admin

One error i think might be related is 2/3rds of the time uploads fail to complete with error message: "The file 'XXX' failed to upload." in Element.

Have you tried to add a media_repository worker to split the upload from generic workers?

@eMPee584
Copy link
Contributor

I rebased my branch on this PR and added proper (hopefully) cleansing of left-over workers, please try if the problem persists with the recent commits on #456 ..

@maxklenk
Copy link
Contributor Author

maxklenk commented Nov 9, 2020

I'll close this PR in favour to finish the work in #456

@maxklenk maxklenk closed this Nov 9, 2020
@maxklenk maxklenk deleted the feature/add-worker-support branch January 5, 2021 09:21
@spantaleev spantaleev mentioned this pull request Feb 26, 2021
@PC-Admin PC-Admin mentioned this pull request Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants