-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix MultiWriteIdGenerator's handling of restarts. #8374
Conversation
87fcb18
to
55a3820
Compare
I suspect this would be helpful to shrink the size a bit, but isn't required. |
Done: #8383 |
On startup `MultiWriteIdGenerator` fetches the maximum stream ID for each instance from the table and uses that as its initial "current position" for each writer. This is problematic as a) it involves either a scan of events table or an index (neither of which is ideal), and b) if rows are being persisted out of order elsewhere while the process restarts then using the maximum stream ID is not correct. This could theoretically lead to race conditions where e.g. events that are persisted out of order are not sent down sync streams. We fix this by creating a new table that tracks the current positions of each writer to the stream, and update it each time we finish persisting a new entry. This is a relatively small overhead when persisting events. However for the cache invalidation stream this is a much bigger relative overhead, so instead we note that for invalidation we don't actually care about reliability over restarts (as there's no caches to invalidate) and simply don't bother reading and writing to the new table in that particular case.
55a3820
to
6e3d562
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks pretty reasonable, but I'm not sure I understand the impact of different failures modes.
txn.call_after( | ||
run_as_background_process, | ||
"MultiWriterIdGenerator._update_table", | ||
self._db.runInteraction, | ||
"MultiWriterIdGenerator._update_table", | ||
self._update_stream_positions_table_txn, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any issues with this being a different transaction? (What if this transaction fails?) I suspect it is OK since we always update to the largest position anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to do it after we've marked the associated ID as "persisted", which happens after the transaction has already finished.
Though a follow up question of: can we mark the ID as persisted before we finish the transaction? The answer to that is probably yes if we faff around a bit.
One thing to note is that the only stream that actually uses get_next_txn
is the cache invalidation stream, which we explicitly dont' record the stream positions for. The only reason I've added updating the stream positions table here is for consistency and to avoid future foot guns.
else: | ||
self._current_positions = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like _current_positions
is initiated to an empty dict. We probably don't need this.
Yeah, its a bit of a tricky thing to reason about? Would a call to walk through it help? Or do you want a second opinion on this? |
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
self._current_positions = { | ||
instance: stream_id * self._return_factor | ||
for instance, stream_id in cur | ||
if instance in self._writers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By excluding only the known writers here could there be a dataloss situation when removing a writer, if that writer is the minimum position?
If you have writers:
- A at pos 15
- B at pos 17
- C at pos 12
And then remove "C", _current_positions
will be {A: 15, B: 17} and min_stream_id
will be 15. I'm not sure if this is OK or not?
It seems you also don't want to include writers that you no longer care about or else you can end up in a situation where you have an old writer always saying it is far behind, which I'm guessing this if-statement checks against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion elsewhere: this is OK because if C
has been removed from the list it means its been turned off, and so we know its up to date.
It's worth noting that these positions aren't really "where C
has gotten up to in the rooms", but instead is more "where C
is currently persisting events". I.e. we're not worried that if we remove C
from the deployment then A
or B
will start persisting at position 12, as either C
finished persisting a row at 12 or the request gets retried and A
or B
will persist it with a new position.
# We also check if any of the later rows are from this instance, in | ||
# which case we use that for this instance's current position. This | ||
# is to handle the case where we didn't finish persisting to the | ||
# stream positions table before restart (or the stream position | ||
# table otherwise got out of date). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this true of any instance in _current_positions
? I'm having trouble following why this is only true of the current instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know that if the current instance has just been restarted then we don't have any rows that are currently being persisted, so its safe to set the current position of itself to the max stream ID. Other instances may not have been restarted so may still be persisting things.
(We don't just set the current position of the instance to the max stream ID as in future we want every entry in current_positions
to have matching instance in the DB, i.e. if we have {A: 5, B: 6}
then we want the row 5 in the DB to have an instance of A
and row 6 to have an instance of B
. This will allow us to serialise the current position to (5, 6)
, as we can then just look the rows up in the DB to get back to {A: 5, B: 6}
)
if not self._writers: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't really matter, but this is checked in the callers too. I'm assuming this was just double checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just in case somewhere forgets to do the check.
I took another look (and went back to more of the code this calls into) and I think I have a decent handle on it. I asked a couple more questions -- if they're easier to answer over a call that sounds fine with me! |
Synapse 1.21.0rc1 (2020-10-01) ============================== Features -------- - Require the user to confirm that their password should be reset after clicking the email confirmation link. ([\#8004](#8004)) - Add an admin API `GET /_synapse/admin/v1/event_reports` to read entries of table `event_reports`. Contributed by @dklimpel. ([\#8217](#8217)) - Consolidate the SSO error template across all configuration. ([\#8248](#8248), [\#8405](#8405)) - Add a configuration option to specify a whitelist of domains that a user can be redirected to after validating their email or phone number. ([\#8275](#8275), [\#8417](#8417)) - Add experimental support for sharding event persister. ([\#8294](#8294), [\#8387](#8387), [\#8396](#8396), [\#8419](#8419)) - Add the room topic and avatar to the room details admin API. ([\#8305](#8305)) - Add an admin API for querying rooms where a user is a member. Contributed by @dklimpel. ([\#8306](#8306)) - Add `uk.half-shot.msc2778.login.application_service` login type to allow appservices to login. ([\#8320](#8320)) - Add a configuration option that allows existing users to log in with OpenID Connect. Contributed by @BBBSnowball and @OmmyZhang. ([\#8345](#8345)) - Add prometheus metrics for replication requests. ([\#8406](#8406)) - Support passing additional single sign-on parameters to the client. ([\#8413](#8413)) - Add experimental reporting of metrics on expensive rooms for state-resolution. ([\#8420](#8420)) - Add experimental prometheus metric to track numbers of "large" rooms for state resolutiom. ([\#8425](#8425)) - Add prometheus metrics to track federation delays. ([\#8430](#8430)) Bugfixes -------- - Fix a bug in the media repository where remote thumbnails with the same size but different crop methods would overwrite each other. Contributed by @deepbluev7. ([\#7124](#7124)) - Fix inconsistent handling of non-existent push rules, and stop tracking the `enabled` state of removed push rules. ([\#7796](#7796)) - Fix a longstanding bug when storing a media file with an empty `upload_name`. ([\#7905](#7905)) - Fix messages not being sent over federation until an event is sent into the same room. ([\#8230](#8230), [\#8247](#8247), [\#8258](#8258), [\#8272](#8272), [\#8322](#8322)) - Fix a longstanding bug where files that could not be thumbnailed would result in an Internal Server Error. ([\#8236](#8236), [\#8435](#8435)) - Upgrade minimum version of `canonicaljson` to version 1.4.0, to fix an unicode encoding issue. ([\#8262](#8262)) - Fix longstanding bug which could lead to incomplete database upgrades on SQLite. ([\#8265](#8265)) - Fix stack overflow when stderr is redirected to the logging system, and the logging system encounters an error. ([\#8268](#8268)) - Fix a bug which cause the logging system to report errors, if `DEBUG` was enabled and no `context` filter was applied. ([\#8278](#8278)) - Fix edge case where push could get delayed for a user until a later event was pushed. ([\#8287](#8287)) - Fix fetching malformed events from remote servers. ([\#8324](#8324)) - Fix `UnboundLocalError` from occuring when appservices send a malformed register request. ([\#8329](#8329)) - Don't send push notifications to expired user accounts. ([\#8353](#8353)) - Fix a regression in v1.19.0 with reactivating users through the admin API. ([\#8362](#8362)) - Fix a bug where during device registration the length of the device name wasn't limited. ([\#8364](#8364)) - Include `guest_access` in the fields that are checked for null bytes when updating `room_stats_state`. Broke in v1.7.2. ([\#8373](#8373)) - Fix theoretical race condition where events are not sent down `/sync` if the synchrotron worker is restarted without restarting other workers. ([\#8374](#8374)) - Fix a bug which could cause errors in rooms with malformed membership events, on servers using sqlite. ([\#8385](#8385)) - Fix "Re-starting finished log context" warning when receiving an event we already had over federation. ([\#8398](#8398)) - Fix incorrect handling of timeouts on outgoing HTTP requests. ([\#8400](#8400)) - Fix a regression in v1.20.0 in the `synapse_port_db` script regarding the `ui_auth_sessions_ips` table. ([\#8410](#8410)) - Remove unnecessary 3PID registration check when resetting password via an email address. Bug introduced in v0.34.0rc2. ([\#8414](#8414)) Improved Documentation ---------------------- - Add `/_synapse/client` to the reverse proxy documentation. ([\#8227](#8227)) - Add note to the reverse proxy settings documentation about disabling Apache's mod_security2. Contributed by Julian Fietkau (@jfietkau). ([\#8375](#8375)) - Improve description of `server_name` config option in `homserver.yaml`. ([\#8415](#8415)) Deprecations and Removals ------------------------- - Drop support for `prometheus_client` older than 0.4.0. ([\#8426](#8426)) Internal Changes ---------------- - Fix tests on distros which disable TLSv1.0. Contributed by @danc86. ([\#8208](#8208)) - Simplify the distributor code to avoid unnecessary work. ([\#8216](#8216)) - Remove the `populate_stats_process_rooms_2` background job and restore functionality to `populate_stats_process_rooms`. ([\#8243](#8243)) - Clean up type hints for `PaginationConfig`. ([\#8250](#8250), [\#8282](#8282)) - Track the latest event for every destination and room for catch-up after federation outage. ([\#8256](#8256)) - Fix non-user visible bug in implementation of `MultiWriterIdGenerator.get_current_token_for_writer`. ([\#8257](#8257)) - Switch to the JSON implementation from the standard library. ([\#8259](#8259)) - Add type hints to `synapse.util.async_helpers`. ([\#8260](#8260)) - Simplify tests that mock asynchronous functions. ([\#8261](#8261)) - Add type hints to `StreamToken` and `RoomStreamToken` classes. ([\#8279](#8279)) - Change `StreamToken.room_key` to be a `RoomStreamToken` instance. ([\#8281](#8281)) - Refactor notifier code to correctly use the max event stream position. ([\#8288](#8288)) - Use slotted classes where possible. ([\#8296](#8296)) - Support testing the local Synapse checkout against the [Complement homeserver test suite](https://github.com/matrix-org/complement/). ([\#8317](#8317)) - Update outdated usages of `metaclass` to python 3 syntax. ([\#8326](#8326)) - Move lint-related dependencies to package-extra field, update CONTRIBUTING.md to utilise this. ([\#8330](#8330), [\#8377](#8377)) - Use the `admin_patterns` helper in additional locations. ([\#8331](#8331)) - Fix test logging to allow braces in log output. ([\#8335](#8335)) - Remove `__future__` imports related to Python 2 compatibility. ([\#8337](#8337)) - Simplify `super()` calls to Python 3 syntax. ([\#8344](#8344)) - Fix bad merge from `release-v1.20.0` branch to `develop`. ([\#8354](#8354)) - Factor out a `_send_dummy_event_for_room` method. ([\#8370](#8370)) - Improve logging of state resolution. ([\#8371](#8371)) - Add type annotations to `SimpleHttpClient`. ([\#8372](#8372)) - Refactor ID generators to use `async with` syntax. ([\#8383](#8383)) - Add `EventStreamPosition` type. ([\#8388](#8388)) - Create a mechanism for marking tests "logcontext clean". ([\#8399](#8399)) - A pair of tiny cleanups in the federation request code. ([\#8401](#8401)) - Add checks on startup that PostgreSQL sequences are consistent with their associated tables. ([\#8402](#8402)) - Do not include appservice users when calculating the total MAU for a server. ([\#8404](#8404)) - Typing fixes for `synapse.handlers.federation`. ([\#8422](#8422)) - Various refactors to simplify stream token handling. ([\#8423](#8423)) - Make stream token serializing/deserializing async. ([\#8427](#8427))
Synapse 1.21.0rc2 (2020-10-02) ============================== Features -------- - Convert additional templates from inline HTML to Jinja2 templates. ([\#8444](#8444)) Bugfixes -------- - Fix a regression in v1.21.0rc1 which broke thumbnails of remote media. ([\#8438](#8438)) - Do not expose the experimental `uk.half-shot.msc2778.login.application_service` flow in the login API, which caused a compatibility problem with Element iOS. ([\#8440](#8440)) - Fix malformed log line in new federation "catch up" logic. ([\#8442](#8442)) - Fix DB query on startup for negative streams which caused long start up times. Introduced in [\#8374](#8374). ([\#8447](#8447))
Synapse 1.21.2 (2020-10-15) =========================== Debian packages and Docker images have been rebuilt using the latest versions of dependency libraries, including authlib 0.15.1. Please see bugfixes below. Security advisory ----------------- * HTML pages served via Synapse were vulnerable to cross-site scripting (XSS) attacks. All server administrators are encouraged to upgrade. ([\#8444](matrix-org/synapse#8444)) ([CVE-2020-26891](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-26891)) This fix was originally included in v1.21.0 but was missing a security advisory. This was reported by [Denis Kasak](https://github.com/dkasak). Bugfixes -------- - Fix rare bug where sending an event would fail due to a racey assertion. ([\#8530](matrix-org/synapse#8530)) - An updated version of the authlib dependency is included in the Docker and Debian images to fix an issue using OpenID Connect. See [\#8534](matrix-org/synapse#8534) for details. Synapse 1.21.1 (2020-10-13) =========================== This release fixes a regression in v1.21.0 that prevented debian packages from being built. It is otherwise identical to v1.21.0. Synapse 1.21.0 (2020-10-12) =========================== No significant changes since v1.21.0rc3. As [noted in v1.20.0](https://github.com/matrix-org/synapse/blob/release-v1.21.0/CHANGES.md#synapse-1200-2020-09-22), a future release will drop support for accessing Synapse's [Admin API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api) under the `/_matrix/client/*` endpoint prefixes. At that point, the Admin API will only be accessible under `/_synapse/admin`. Synapse 1.21.0rc3 (2020-10-08) ============================== Bugfixes -------- - Fix duplication of events on high traffic servers, caused by PostgreSQL `could not serialize access due to concurrent update` errors. ([\#8456](matrix-org/synapse#8456)) Internal Changes ---------------- - Add Groovy Gorilla to the list of distributions we build `.deb`s for. ([\#8475](matrix-org/synapse#8475)) Synapse 1.21.0rc2 (2020-10-02) ============================== Features -------- - Convert additional templates from inline HTML to Jinja2 templates. ([\#8444](matrix-org/synapse#8444)) Bugfixes -------- - Fix a regression in v1.21.0rc1 which broke thumbnails of remote media. ([\#8438](matrix-org/synapse#8438)) - Do not expose the experimental `uk.half-shot.msc2778.login.application_service` flow in the login API, which caused a compatibility problem with Element iOS. ([\#8440](matrix-org/synapse#8440)) - Fix malformed log line in new federation "catch up" logic. ([\#8442](matrix-org/synapse#8442)) - Fix DB query on startup for negative streams which caused long start up times. Introduced in [\#8374](matrix-org/synapse#8374). ([\#8447](matrix-org/synapse#8447)) Synapse 1.21.0rc1 (2020-10-01) ============================== Features -------- - Require the user to confirm that their password should be reset after clicking the email confirmation link. ([\#8004](matrix-org/synapse#8004)) - Add an admin API `GET /_synapse/admin/v1/event_reports` to read entries of table `event_reports`. Contributed by @dklimpel. ([\#8217](matrix-org/synapse#8217)) - Consolidate the SSO error template across all configuration. ([\#8248](matrix-org/synapse#8248), [\#8405](matrix-org/synapse#8405)) - Add a configuration option to specify a whitelist of domains that a user can be redirected to after validating their email or phone number. ([\#8275](matrix-org/synapse#8275), [\#8417](matrix-org/synapse#8417)) - Add experimental support for sharding event persister. ([\#8294](matrix-org/synapse#8294), [\#8387](matrix-org/synapse#8387), [\#8396](matrix-org/synapse#8396), [\#8419](matrix-org/synapse#8419)) - Add the room topic and avatar to the room details admin API. ([\#8305](matrix-org/synapse#8305)) - Add an admin API for querying rooms where a user is a member. Contributed by @dklimpel. ([\#8306](matrix-org/synapse#8306)) - Add `uk.half-shot.msc2778.login.application_service` login type to allow appservices to login. ([\#8320](matrix-org/synapse#8320)) - Add a configuration option that allows existing users to log in with OpenID Connect. Contributed by @BBBSnowball and @OmmyZhang. ([\#8345](matrix-org/synapse#8345)) - Add prometheus metrics for replication requests. ([\#8406](matrix-org/synapse#8406)) - Support passing additional single sign-on parameters to the client. ([\#8413](matrix-org/synapse#8413)) - Add experimental reporting of metrics on expensive rooms for state-resolution. ([\#8420](matrix-org/synapse#8420)) - Add experimental prometheus metric to track numbers of "large" rooms for state resolutiom. ([\#8425](matrix-org/synapse#8425)) - Add prometheus metrics to track federation delays. ([\#8430](matrix-org/synapse#8430)) Bugfixes -------- - Fix a bug in the media repository where remote thumbnails with the same size but different crop methods would overwrite each other. Contributed by @deepbluev7. ([\#7124](matrix-org/synapse#7124)) - Fix inconsistent handling of non-existent push rules, and stop tracking the `enabled` state of removed push rules. ([\#7796](matrix-org/synapse#7796)) - Fix a longstanding bug when storing a media file with an empty `upload_name`. ([\#7905](matrix-org/synapse#7905)) - Fix messages not being sent over federation until an event is sent into the same room. ([\#8230](matrix-org/synapse#8230), [\#8247](matrix-org/synapse#8247), [\#8258](matrix-org/synapse#8258), [\#8272](matrix-org/synapse#8272), [\#8322](matrix-org/synapse#8322)) - Fix a longstanding bug where files that could not be thumbnailed would result in an Internal Server Error. ([\#8236](matrix-org/synapse#8236), [\#8435](matrix-org/synapse#8435)) - Upgrade minimum version of `canonicaljson` to version 1.4.0, to fix an unicode encoding issue. ([\#8262](matrix-org/synapse#8262)) - Fix longstanding bug which could lead to incomplete database upgrades on SQLite. ([\#8265](matrix-org/synapse#8265)) - Fix stack overflow when stderr is redirected to the logging system, and the logging system encounters an error. ([\#8268](matrix-org/synapse#8268)) - Fix a bug which cause the logging system to report errors, if `DEBUG` was enabled and no `context` filter was applied. ([\#8278](matrix-org/synapse#8278)) - Fix edge case where push could get delayed for a user until a later event was pushed. ([\#8287](matrix-org/synapse#8287)) - Fix fetching malformed events from remote servers. ([\#8324](matrix-org/synapse#8324)) - Fix `UnboundLocalError` from occuring when appservices send a malformed register request. ([\#8329](matrix-org/synapse#8329)) - Don't send push notifications to expired user accounts. ([\#8353](matrix-org/synapse#8353)) - Fix a regression in v1.19.0 with reactivating users through the admin API. ([\#8362](matrix-org/synapse#8362)) - Fix a bug where during device registration the length of the device name wasn't limited. ([\#8364](matrix-org/synapse#8364)) - Include `guest_access` in the fields that are checked for null bytes when updating `room_stats_state`. Broke in v1.7.2. ([\#8373](matrix-org/synapse#8373)) - Fix theoretical race condition where events are not sent down `/sync` if the synchrotron worker is restarted without restarting other workers. ([\#8374](matrix-org/synapse#8374)) - Fix a bug which could cause errors in rooms with malformed membership events, on servers using sqlite. ([\#8385](matrix-org/synapse#8385)) - Fix "Re-starting finished log context" warning when receiving an event we already had over federation. ([\#8398](matrix-org/synapse#8398)) - Fix incorrect handling of timeouts on outgoing HTTP requests. ([\#8400](matrix-org/synapse#8400)) - Fix a regression in v1.20.0 in the `synapse_port_db` script regarding the `ui_auth_sessions_ips` table. ([\#8410](matrix-org/synapse#8410)) - Remove unnecessary 3PID registration check when resetting password via an email address. Bug introduced in v0.34.0rc2. ([\#8414](matrix-org/synapse#8414)) Improved Documentation ---------------------- - Add `/_synapse/client` to the reverse proxy documentation. ([\#8227](matrix-org/synapse#8227)) - Add note to the reverse proxy settings documentation about disabling Apache's mod_security2. Contributed by Julian Fietkau (@jfietkau). ([\#8375](matrix-org/synapse#8375)) - Improve description of `server_name` config option in `homserver.yaml`. ([\#8415](matrix-org/synapse#8415)) Deprecations and Removals ------------------------- - Drop support for `prometheus_client` older than 0.4.0. ([\#8426](matrix-org/synapse#8426)) Internal Changes ---------------- - Fix tests on distros which disable TLSv1.0. Contributed by @danc86. ([\#8208](matrix-org/synapse#8208)) - Simplify the distributor code to avoid unnecessary work. ([\#8216](matrix-org/synapse#8216)) - Remove the `populate_stats_process_rooms_2` background job and restore functionality to `populate_stats_process_rooms`. ([\#8243](matrix-org/synapse#8243)) - Clean up type hints for `PaginationConfig`. ([\#8250](matrix-org/synapse#8250), [\#8282](matrix-org/synapse#8282)) - Track the latest event for every destination and room for catch-up after federation outage. ([\#8256](matrix-org/synapse#8256)) - Fix non-user visible bug in implementation of `MultiWriterIdGenerator.get_current_token_for_writer`. ([\#8257](matrix-org/synapse#8257)) - Switch to the JSON implementation from the standard library. ([\#8259](matrix-org/synapse#8259)) - Add type hints to `synapse.util.async_helpers`. ([\#8260](matrix-org/synapse#8260)) - Simplify tests that mock asynchronous functions. ([\#8261](matrix-org/synapse#8261)) - Add type hints to `StreamToken` and `RoomStreamToken` classes. ([\#8279](matrix-org/synapse#8279)) - Change `StreamToken.room_key` to be a `RoomStreamToken` instance. ([\#8281](matrix-org/synapse#8281)) - Refactor notifier code to correctly use the max event stream position. ([\#8288](matrix-org/synapse#8288)) - Use slotted classes where possible. ([\#8296](matrix-org/synapse#8296)) - Support testing the local Synapse checkout against the [Complement homeserver test suite](https://github.com/matrix-org/complement/). ([\#8317](matrix-org/synapse#8317)) - Update outdated usages of `metaclass` to python 3 syntax. ([\#8326](matrix-org/synapse#8326)) - Move lint-related dependencies to package-extra field, update CONTRIBUTING.md to utilise this. ([\#8330](matrix-org/synapse#8330), [\#8377](matrix-org/synapse#8377)) - Use the `admin_patterns` helper in additional locations. ([\#8331](matrix-org/synapse#8331)) - Fix test logging to allow braces in log output. ([\#8335](matrix-org/synapse#8335)) - Remove `__future__` imports related to Python 2 compatibility. ([\#8337](matrix-org/synapse#8337)) - Simplify `super()` calls to Python 3 syntax. ([\#8344](matrix-org/synapse#8344)) - Fix bad merge from `release-v1.20.0` branch to `develop`. ([\#8354](matrix-org/synapse#8354)) - Factor out a `_send_dummy_event_for_room` method. ([\#8370](matrix-org/synapse#8370)) - Improve logging of state resolution. ([\#8371](matrix-org/synapse#8371)) - Add type annotations to `SimpleHttpClient`. ([\#8372](matrix-org/synapse#8372)) - Refactor ID generators to use `async with` syntax. ([\#8383](matrix-org/synapse#8383)) - Add `EventStreamPosition` type. ([\#8388](matrix-org/synapse#8388)) - Create a mechanism for marking tests "logcontext clean". ([\#8399](matrix-org/synapse#8399)) - A pair of tiny cleanups in the federation request code. ([\#8401](matrix-org/synapse#8401)) - Add checks on startup that PostgreSQL sequences are consistent with their associated tables. ([\#8402](matrix-org/synapse#8402)) - Do not include appservice users when calculating the total MAU for a server. ([\#8404](matrix-org/synapse#8404)) - Typing fixes for `synapse.handlers.federation`. ([\#8422](matrix-org/synapse#8422)) - Various refactors to simplify stream token handling. ([\#8423](matrix-org/synapse#8423)) - Make stream token serializing/deserializing async. ([\#8427](matrix-org/synapse#8427))
* commit '31acc5c30': Escape the error description on the sso_error template. (#8405) Fix occasional "Re-starting finished log context" from keyring (#8398) Allow existing users to login via OpenID Connect. (#8345) Fix schema delta for servers that have not backfilled (#8396) Fix MultiWriteIdGenerator's handling of restarts. (#8374) s/URLs/variables in changelog s/accidentally/incorrectly in changelog Update changelog wording Add type annotations to SimpleHttpClient (#8372) Add new sequences to port DB script (#8387) Add EventStreamPosition type (#8388) Mark the shadow_banned column as boolean in synapse_port_db. (#8386)
On startup
MultiWriteIdGenerator
fetches the maximum stream ID for each instance from the table and uses that as its initial "current position" for each writer. This is problematic as a) it involves either a scan of events table or adding a new index (neither of which is ideal), and b) if rows are being persisted out of order elsewhere while the process restarts then using the maximum stream ID is not correct. This could theoretically lead to race conditions where e.g. events that are persisted out of order are not sent down sync streams.We fix this by creating a new table that tracks the current positions of each writer to the stream, and update it each time we finish persisting a new entry. This is a relatively small overhead when persisting events. However for the cache invalidation stream this is a much bigger relative overhead, so instead we note that for invalidation we don't actually care about reliability over restarts (as there's no caches to invalidate) and simply don't bother reading and writing to the new table in that particular case.
This has the side effect of fixing start up times on develop.
Probably best to review commits separately. If it helps I can move the first two commits to a dedicated PR.