Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Move catchup of replication streams to worker. #7024

Merged
merged 17 commits into from
Mar 25, 2020

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Mar 3, 2020

This changes the replication protocol so that the server does not send down RDATA for rows that happened before the client connected. Instead, the server will send a POSITION and clients then query the database (or master out of band) to get up to date.

Based on #7010 and #7011

@erikjohnston
Copy link
Member Author

(Test failures are due to sytest being broken w.r.t. CAS login)

@erikjohnston erikjohnston requested a review from a team March 20, 2020 13:05
@erikjohnston erikjohnston force-pushed the erikj/catchup_on_worker branch 4 times, most recently from 0de854f to 58e3642 Compare March 20, 2020 14:57
@erikjohnston erikjohnston removed the request for review from a team March 20, 2020 15:02
@erikjohnston erikjohnston force-pushed the erikj/catchup_on_worker branch 2 times, most recently from 9781b56 to 545a299 Compare March 20, 2020 15:30
The catchup will in future happen on workers, so master process won't
need to protect itself by dropping the connection.
@erikjohnston erikjohnston requested a review from a team March 20, 2020 15:38
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too big for me to keep in my head, so I can't really check it for correctness; I'll have to take your word for it.

It seems generally sane, but a few thoughts below.

@@ -390,6 +390,9 @@ def process_replication_rows(self, token, rows):
self._room_serials[row.room_id] = token
self._room_typing[row.room_id] = row.user_ids

def get_current_token(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets used by the TypingStream.current_token, which now gets called on workers (this is the equivalent of calling a slaved ID gen current_token function).

synapse/replication/http/streams.py Show resolved Hide resolved
synapse/replication/http/streams.py Outdated Show resolved Hide resolved
synapse/replication/http/streams.py Outdated Show resolved Hide resolved
synapse/replication/tcp/commands.py Show resolved Hide resolved
synapse/replication/tcp/protocol.py Show resolved Hide resolved
synapse/replication/tcp/streams/_base.py Outdated Show resolved Hide resolved
synapse/replication/tcp/streams/_base.py Outdated Show resolved Hide resolved
if from_token == upto_token:
return [], upto_token, False

if self._is_worker and self._QUERY_MASTER:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help but feel that this would be much easier to follow if it were implemented in the individual Streams:

if hs.config.worker_app is None:
    self.update_function = handler.get_updates
else:
    self.update_function = <something to do with ReplicationGetStreamUpdates>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be surprisingly faffy, given that what the API call returns is not quite the same as what the update_function returns (as can be seen from the differences in the two branches below the commented line). I guess we could make them be identical (i.e. make the API hit only return the rows), but I'm not sure that really makes things clearer. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the spec for update_function be that it return the same (updates, new_last_token, limited) tuple.

Have a helper function:

def db_query_to_update_function(query_function):
    """Wraps a db query function which returns a list of rows to make it suitable for use as an `update_function` for the Stream class"""
    async def update_function(from_token, upto_token, limit):
            rows = await query_function(from_token, upto_token, limit=limit)
            updates = [(row[0], row[1:]) for row in rows]
            limited = False
            if len(updates) == limit:
                upto_token = rows[-1][0]
                limited = True

            return updates, upto_token, limited
    return update_function

Do something similar with the ReplicationClient.

Use that to wrap the db query function in the Stream class:

        self.update_function = db_query_to_update_function(store.get_all_new_backfill_event_rows)

Suggest sorting out the type annotations to help with that.

I think it will be clearer than adding more and more fields to the Stream class to control how it is supposed to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me give that a go. (Annoyingly mypy refuses to accept setting functions like we do in stream, hence the type: ignore going on, which is apain)

docs/tcp_replication.md Show resolved Hide resolved
@erikjohnston erikjohnston requested review from richvdh and removed request for richvdh March 24, 2020 16:57
@erikjohnston
Copy link
Member Author

Oh, I messed that up.

@@ -174,7 +170,7 @@ class BackfillStream(Stream):
def __init__(self, hs):
store = hs.get_datastore()
self.current_token = store.get_current_backfill_token # type: ignore
self.update_function = store.get_all_new_backfill_event_rows # type: ignore
self.update_function = db_query_to_update_function(store.get_all_new_backfill_event_rows) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably if update_function wasn't defined as a method in the base class, the type:ignore wouldn't be necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, then it complains that Stream doesn't have a update_function, so you add one as a property and then it complains when you try and declare async def update_function(..) in sub classes.

@@ -151,6 +127,26 @@ def update_function(self, from_token, current_token, limit):
raise NotImplementedError()


def db_query_to_update_function(
query_function: Callable[[int, int, int], Awaitable[List[tuple]]]
) -> Callable[[int, int, int], Awaitable[Tuple[List[Tuple[int, tuple]], int, bool]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we should make some type definitions somewhere...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeeeeeeeeeeeeeees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've added some type aliases, although it didn't really simplify things very much in the end :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I was really thinking we should have something like:

UpdateFunctionResult = Tuple[List[StreamRow], Token, bool]
UpdateFunction = Callable[[Token, Token, int], Awaitable[UpdateFunctionResult]]

but shrugface

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@erikjohnston
Copy link
Member Author

Thanking you! ❤️

@erikjohnston erikjohnston merged commit 4cff617 into develop Mar 25, 2020
@erikjohnston erikjohnston deleted the erikj/catchup_on_worker branch March 25, 2020 14:54
erikjohnston added a commit that referenced this pull request Apr 1, 2020
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
erikjohnston added a commit that referenced this pull request Apr 1, 2020
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
richvdh added a commit that referenced this pull request Apr 16, 2020
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in #7024)
richvdh added a commit that referenced this pull request Apr 16, 2020
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in #7024)
richvdh added a commit that referenced this pull request Apr 20, 2020
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in #7024, and I hope this fixes #7206.
clokep added a commit that referenced this pull request May 19, 2020
Synapse 1.13.0 (2020-05-19)
===========================

This release brings some potential changes necessary for certain
configurations of Synapse:

* If your Synapse is configured to use SSO and have a custom
  `sso_redirect_confirm_template_dir` configuration option set, you will need
  to duplicate the new `sso_auth_confirm.html`, `sso_auth_success.html` and
  `sso_account_deactivated.html` templates into that directory.
* Synapse plugins using the `complete_sso_login` method of
  `synapse.module_api.ModuleApi` should instead switch to the async/await
  version, `complete_sso_login_async`, which includes additional checks. The
  former version is now deprecated.
* A bug was introduced in Synapse 1.4.0 which could cause the room directory
  to be incomplete or empty if Synapse was upgraded directly from v1.2.1 or
  earlier, to versions between v1.4.0 and v1.12.x.

Please review [UPGRADE.rst](https://github.com/matrix-org/synapse/blob/master/UPGRADE.rst)
for more details on these changes and for general upgrade guidance.

Notice of change to the default `git` branch for Synapse
--------------------------------------------------------

With the release of Synapse 1.13.0, the default `git` branch for Synapse has
changed to `develop`, which is the development tip. This is more consistent with
common practice and modern `git` usage.

The `master` branch, which tracks the latest release, is still available. It is
recommended that developers and distributors who have scripts which run builds
using the default branch of Synapse should therefore consider pinning their
scripts to `master`.

Features
--------

- Extend the `web_client_location` option to accept an absolute URL to use as a redirect. Adds a warning when running the web client on the same hostname as homeserver. Contributed by Martin Milata. ([\#7006](#7006))
- Set `Referrer-Policy` header to `no-referrer` on media downloads. ([\#7009](#7009))
- Add support for running replication over Redis when using workers. ([\#7040](#7040), [\#7325](#7325), [\#7352](#7352), [\#7401](#7401), [\#7427](#7427), [\#7439](#7439), [\#7446](#7446), [\#7450](#7450), [\#7454](#7454))
- Admin API `POST /_synapse/admin/v1/join/<roomIdOrAlias>` to join users to a room like `auto_join_rooms` for creation of users. ([\#7051](#7051))
- Add options to prevent users from changing their profile or associated 3PIDs. ([\#7096](#7096))
- Support SSO in the user interactive authentication workflow. ([\#7102](#7102), [\#7186](#7186), [\#7279](#7279), [\#7343](#7343))
- Allow server admins to define and enforce a password policy ([MSC2000](matrix-org/matrix-spec-proposals#2000)). ([\#7118](#7118))
- Improve the support for SSO authentication on the login fallback page. ([\#7152](#7152), [\#7235](#7235))
- Always whitelist the login fallback in the SSO configuration if `public_baseurl` is set. ([\#7153](#7153))
- Admin users are no longer required to be in a room to create an alias for it. ([\#7191](#7191))
- Require admin privileges to enable room encryption by default. This does not affect existing rooms. ([\#7230](#7230))
- Add a config option for specifying the value of the Accept-Language HTTP header when generating URL previews. ([\#7265](#7265))
- Allow `/requestToken` endpoints to hide the existence (or lack thereof) of 3PID associations on the homeserver. ([\#7315](#7315))
- Add a configuration setting to tweak the threshold for dummy events. ([\#7422](#7422))

Bugfixes
--------

- Don't attempt to use an invalid sqlite config if no database configuration is provided. Contributed by @nekatak. ([\#6573](#6573))
- Fix single-sign on with CAS systems: pass the same service URL when requesting the CAS ticket and when calling the `proxyValidate` URL. Contributed by @Naugrimm. ([\#6634](#6634))
- Fix missing field `default` when fetching user-defined push rules. ([\#6639](#6639))
- Improve error responses when accessing remote public room lists. ([\#6899](#6899), [\#7368](#7368))
- Transfer alias mappings on room upgrade. ([\#6946](#6946))
- Ensure that a user interactive authentication session is tied to a single request. ([\#7068](#7068), [\#7455](#7455))
- Fix a bug in the federation API which could cause occasional "Failed to get PDU" errors. ([\#7089](#7089))
- Return the proper error (`M_BAD_ALIAS`) when a non-existant canonical alias is provided. ([\#7109](#7109))
- Fix a bug which meant that groups updates were not correctly replicated between workers. ([\#7117](#7117))
- Fix starting workers when federation sending not split out. ([\#7133](#7133))
- Ensure `is_verified` is a boolean in responses to `GET /_matrix/client/r0/room_keys/keys`. Also warn the user if they forgot the `version` query param. ([\#7150](#7150))
- Fix error page being shown when a custom SAML handler attempted to redirect when processing an auth response. ([\#7151](#7151))
- Avoid importing `sqlite3` when using the postgres backend. Contributed by David Vo. ([\#7155](#7155))
- Fix excessive CPU usage by `prune_old_outbound_device_pokes` job. ([\#7159](#7159))
- Fix a bug which could cause outbound federation traffic to stop working if a client uploaded an incorrect e2e device signature. ([\#7177](#7177))
- Fix a bug which could cause incorrect 'cyclic dependency' error. ([\#7178](#7178))
- Fix a bug that could cause a user to be invited to a server notices (aka System Alerts) room without any notice being sent. ([\#7199](#7199))
- Fix some worker-mode replication handling not being correctly recorded in CPU usage stats. ([\#7203](#7203))
- Do not allow a deactivated user to login via SSO. ([\#7240](#7240), [\#7259](#7259))
- Fix --help command-line argument. ([\#7249](#7249))
- Fix room publish permissions not being checked on room creation. ([\#7260](#7260))
- Reject unknown session IDs during user interactive authentication instead of silently creating a new session. ([\#7268](#7268))
- Fix a SQL query introduced in Synapse 1.12.0 which could cause large amounts of logging to the postgres slow-query log. ([\#7274](#7274))
- Persist user interactive authentication sessions across workers and Synapse restarts. ([\#7302](#7302))
- Fixed backwards compatibility logic of the first value of `trusted_third_party_id_servers` being used for `account_threepid_delegates.email`, which occurs when the former, deprecated option is set and the latter is not. ([\#7316](#7316))
- Fix a bug where event updates might not be sent over replication to worker processes after the stream falls behind. ([\#7337](#7337), [\#7358](#7358))
- Fix bad error handling that would cause Synapse to crash if it's provided with a YAML configuration file that's either empty or doesn't parse into a key-value map. ([\#7341](#7341))
- Fix incorrect metrics reporting for `renew_attestations` background task. ([\#7344](#7344))
- Prevent non-federating rooms from appearing in responses to federated `POST /publicRoom` requests when a filter was included. ([\#7367](#7367))
- Fix a bug which would cause the room durectory to be incorrectly populated if Synapse was upgraded directly from v1.2.1 or earlier to v1.4.0 or later. Note that this fix does not apply retrospectively; see the [upgrade notes](UPGRADE.rst#upgrading-to-v1130) for more information. ([\#7387](#7387))
- Fix bug in `EventContext.deserialize`. ([\#7393](#7393))
- Fix a long-standing bug which could cause messages not to be sent over federation, when state events with state keys matching user IDs (such as custom user statuses) were received. ([\#7376](#7376))
- Restore compatibility with non-compliant clients during the user interactive authentication process, fixing a problem introduced in v1.13.0rc1. ([\#7483](#7483))
- Hash passwords as early as possible during registration. ([\#7523](#7523))

Improved Documentation
----------------------

- Update Debian installation instructions to recommend installing the `virtualenv` package instead of `python3-virtualenv`. ([\#6892](#6892))
- Improve the documentation for database configuration. ([\#6988](#6988))
- Improve the documentation of application service configuration files. ([\#7091](#7091))
- Update pre-built package name for FreeBSD. ([\#7107](#7107))
- Update postgres docs with login troubleshooting information. ([\#7119](#7119))
- Clean up INSTALL.md a bit. ([\#7141](#7141))
- Add documentation for running a local CAS server for testing. ([\#7147](#7147))
- Improve README.md by being explicit about public IP recommendation for TURN relaying. ([\#7167](#7167))
- Fix a small typo in the `metrics_flags` config option. ([\#7171](#7171))
- Update the contributed documentation on managing synapse workers with systemd, and bring it into the core distribution. ([\#7234](#7234))
- Add documentation to the `password_providers` config option. Add known password provider implementations to docs. ([\#7238](#7238), [\#7248](#7248))
- Modify suggested nginx reverse proxy configuration to match Synapse's default file upload size. Contributed by @ProCycleDev. ([\#7251](#7251))
- Documentation of media_storage_providers options updated to avoid misunderstandings. Contributed by Tristan Lins. ([\#7272](#7272))
- Add documentation on monitoring workers with Prometheus. ([\#7357](#7357))
- Clarify endpoint usage in the users admin api documentation. ([\#7361](#7361))

Deprecations and Removals
-------------------------

- Remove nonfunctional `captcha_bypass_secret` option from `homeserver.yaml`. ([\#7137](#7137))

Internal Changes
----------------

- Add benchmarks for LruCache. ([\#6446](#6446))
- Return total number of users and profile attributes in admin users endpoint. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#6881](#6881))
- Change device list streams to have one row per ID. ([\#7010](#7010))
- Remove concept of a non-limited stream. ([\#7011](#7011))
- Move catchup of replication streams logic to worker. ([\#7024](#7024), [\#7195](#7195), [\#7226](#7226), [\#7239](#7239), [\#7286](#7286), [\#7290](#7290), [\#7318](#7318), [\#7326](#7326), [\#7378](#7378), [\#7421](#7421))
- Convert some of synapse.rest.media to async/await. ([\#7110](#7110), [\#7184](#7184), [\#7241](#7241))
- De-duplicate / remove unused REST code for login and auth. ([\#7115](#7115))
- Convert `*StreamRow` classes to inner classes. ([\#7116](#7116))
- Clean up some LoggingContext code. ([\#7120](#7120), [\#7181](#7181), [\#7183](#7183), [\#7408](#7408), [\#7426](#7426))
- Add explicit `instance_id` for USER_SYNC commands and remove implicit `conn_id` usage. ([\#7128](#7128))
- Refactored the CAS authentication logic to a separate class. ([\#7136](#7136))
- Run replication streamers on workers. ([\#7146](#7146))
- Add tests for outbound device pokes. ([\#7157](#7157))
- Fix device list update stream ids going backward. ([\#7158](#7158))
- Use `stream.current_token()` and remove `stream_positions()`. ([\#7172](#7172))
- Move client command handling out of TCP protocol. ([\#7185](#7185))
- Move server command handling out of TCP protocol. ([\#7187](#7187))
- Fix consistency of HTTP status codes reported in log lines. ([\#7188](#7188))
- Only run one background database update at a time. ([\#7190](#7190))
- Remove sent outbound device list pokes from the database. ([\#7192](#7192))
- Add a background database update job to clear out duplicate `device_lists_outbound_pokes`. ([\#7193](#7193))
- Remove some extraneous debugging log lines. ([\#7207](#7207))
- Add explicit Python build tooling as dependencies for the snapcraft build. ([\#7213](#7213))
- Add typing information to federation server code. ([\#7219](#7219))
- Extend room admin api (`GET /_synapse/admin/v1/rooms`) with additional attributes. ([\#7225](#7225))
- Unblacklist '/upgrade creates a new room' sytest for workers. ([\#7228](#7228))
- Remove redundant checks on `daemonize` from synctl. ([\#7233](#7233))
- Upgrade jQuery to v3.4.1 on fallback login/registration pages. ([\#7236](#7236))
- Change log line that told user to implement onLogin/onRegister fallback js functions to a warning, instead of an info, so it's more visible. ([\#7237](#7237))
- Correct the parameters of a test fixture. Contributed by Isaiah Singletary. ([\#7243](#7243))
- Convert auth handler to async/await. ([\#7261](#7261))
- Add some unit tests for replication. ([\#7278](#7278))
- Improve typing annotations in `synapse.replication.tcp.streams.Stream`. ([\#7291](#7291))
- Reduce log verbosity of url cache cleanup tasks. ([\#7295](#7295))
- Fix sample SAML Service Provider configuration. Contributed by @frcl. ([\#7300](#7300))
- Fix StreamChangeCache to work with multiple entities changing on the same stream id. ([\#7303](#7303))
- Fix an incorrect import in IdentityHandler. ([\#7319](#7319))
- Reduce logging verbosity for successful federation requests. ([\#7321](#7321))
- Convert some federation handler code to async/await. ([\#7338](#7338))
- Fix collation for postgres for unit tests. ([\#7359](#7359))
- Convert RegistrationWorkerStore.is_server_admin and dependent code to async/await. ([\#7363](#7363))
- Add an `instance_name` to `RDATA` and `POSITION` replication commands. ([\#7364](#7364))
- Thread through instance name to replication client. ([\#7369](#7369))
- Convert synapse.server_notices to async/await. ([\#7394](#7394))
- Convert synapse.notifier to async/await. ([\#7395](#7395))
- Fix issues with the Python package manifest. ([\#7404](#7404))
- Prevent methods in `synapse.handlers.auth` from polling the homeserver config every request. ([\#7420](#7420))
- Speed up fetching device lists changes when handling `/sync` requests. ([\#7423](#7423))
- Run group attestation renewal in series rather than parallel for performance. ([\#7442](#7442))
- Fix linting errors in new version of Flake8. ([\#7470](#7470))
- Update the version of dh-virtualenv we use to build debs, and add focal to the list of target distributions. ([\#7526](#7526))
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
This broke in a recent PR (matrix-org#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.

(think this was introduced in matrix-org#7024)
phil-flex pushed a commit to phil-flex/synapse that referenced this pull request Jun 16, 2020
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.

This code was introduced in matrix-org#7024, and I hope this fixes matrix-org#7206.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants