Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix serialization errors when rotating notifications #13118

Merged
merged 14 commits into from
Jun 28, 2022

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Jun 27, 2022

Fixes #13117

The idea is that we include the last read receipt in event_push_summary when calculating push, and only use the counts if it matches the current read receipt. This allows us to move deleting of push actions out of sending a receipt and into a background task.

@erikjohnston erikjohnston changed the title Handle receipts async Fix serialization errors when rotating notifications Jun 27, 2022
@erikjohnston erikjohnston force-pushed the erikj/receipts_push branch 6 times, most recently from a3c49e8 to 5583dca Compare June 27, 2022 12:49
@erikjohnston erikjohnston marked this pull request as ready for review June 27, 2022 12:50
@erikjohnston erikjohnston requested a review from a team as a code owner June 27, 2022 12:50
@@ -150,7 +175,7 @@ def _mark_read(stream: int, depth: int) -> None:

_assert_counts(1, 0)

_mark_read(7, 7)
_mark_read(6, 6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change in behavior then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is just making sure we're inject a read receipt at an actual event (which we have to), rather than at some arbitrary place.

(Side note: we need to rewrite this test to not poke things directly into the DB, and instead create real rooms, events and receipts)

@erikjohnston erikjohnston requested review from clokep and a team and removed request for clokep June 27, 2022 14:26
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand wtf is going on here. Happy to hop on a call to talk about it, but also I'd generally like to see much clearer comments and docstrings about what we're doing and why.

Major concern: if we're now only processing read receipts every 30 seconds, does that mean it takes 30 seconds for the unread indicator to disappear?

@@ -800,6 +811,18 @@ async def _rotate_notifs(self) -> None:
self._doing_notif_rotation = True

try:
# First we handle any new receipts that have happened.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "handle" mean here? have happened since when?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
Comment on lines 18 to 19
-- are up to date after a new read receipt has been sent. Null means that we can
-- skip that check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does null mean that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have expanded

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
# We always update `event_push_summary_last_receipt_stream_id` to
# ensure that we don't rescan the same receipts for remote users.
#
# This requires repeatable read to be safe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to explain why it needs repeatable read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added some words.

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
Comment on lines 242 to 243
# If `last_receipt_stream_ordering` is null then that means its up to
# date.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does null mean "up to date" ?

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
@erikjohnston
Copy link
Member Author

Major concern: if we're now only processing read receipts every 30 seconds, does that mean it takes 30 seconds for the unread indicator to disappear?

No, it means that for those 30s the event_push_summary table won't be used to calculate unread counts (which should be fine as there won't be any/many push actions to count after getting a receipt).

get_unread_event_push_actions_by_room_for_user works broadly by: a) querying event_push_summary for any usable counts, and then b) querying event_push_actions for actions that haven't been included in the summary. So in the first step won't return any counts, and so the second step will query event_push_actions for all push actions in the range from the receipt to now.

@richvdh
Copy link
Member

richvdh commented Jun 28, 2022

get_unread_event_push_actions_by_room_for_user works broadly by: a) querying event_push_summary for any usable counts, and then b) querying event_push_actions for actions that haven't been included in the summary. So in the first step won't return any counts, and so the second step will query event_push_actions for all push actions in the range from the receipt to now.

ok, makes sense. it would be great to include that sort of explanation in the comments in the code!

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit clearer now. I'm still not entirely grokking it though. At some point I think it would be good to write down how event_push_summary actually works, but that is probably a different PR.

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
Comment on lines 241 to 242
# that. In the latter case we simply ignore `event_push_summary` counts
# and do a manual count of rows in `event_push_actions` table below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't get why this is an ok thing to do. Haven't we deleted rows from event_push_actions, so relying on it (rather than including event_push_summary) will give numbers that are too low?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We delete rows at the same time as we update the event_push_summary. After a user sends a receipt in the room there are two cases:

  1. The receipt has been processed in the _handle_new_receipts_for_notifs_txn background task, and so we have deleted the push actions but updated the event_push_summary.
  2. The receipt has not been processed, so event_push_summary is stale but we haven't yet deleted the push actions so we can correctly calculate the count from event_push_actions.

Maybe:

and do a manual count of rows in event_push_actions table below.
This gives the correct result as we won't have deleted the push actions yet.

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but suppose:

  • a user has 5 unread messages in a room (say E1...E5)
  • A RR arrives for E2. We process it, update event_push_summary to say there are 3 remaining unread messages, and delete all the entries from event_push_actions.
  • A RR arrives for E4, but before we process it fully, we do this query. We ignore event_push_summary and there is nothing left in event_push_actions. So we return 0 unread messages instead of 1?

Or am I fundamentally misunderstanding how this works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We delete all entries from event_push_actions from before E2, so there will still be the three push actions for the three unread messages in there. So in the final step we'd still have push actions E3, E4 and E5 in the event_push_actions table

synapse/storage/databases/main/event_push_actions.py Outdated Show resolved Hide resolved
# We always update `event_push_summary_last_receipt_stream_id` to
# ensure that we don't rescan the same receipts for remote users.
#
# This requires repeatable read to be safe, as we need the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be quite nice to do an assertion that we have the right isolation level, but that looks a bit fiddly and probably a thing for another time.

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense.

# we haven't yet updated the counts in `event_push_summary` to reflect
# that; in that case we simply ignore `event_push_summary` counts
# and do a manual count of all of the rows in the `event_push_actions` table
# for this user/room.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some edge case where this will go wrong if the previous RR was over 24 hours ago (so we've started removing things from event_push_actions)? Or do we not really care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't care. The previous behaviour is that if you send a read receipt that is over 24 hours old then the updated counts will only include unread messages less than 24 hours old. I think this keeps that same behaviour.

@erikjohnston erikjohnston merged commit 7469824 into develop Jun 28, 2022
@erikjohnston erikjohnston deleted the erikj/receipts_push branch June 28, 2022 12:13
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Aug 23, 2022
Synapse 1.62.0 (2022-07-05)
===========================

No significant changes since 1.62.0rc3.

Authors of spam-checker plugins should consult the [upgrade notes](https://github.com/matrix-org/synapse/blob/release-v1.62/docs/upgrade.md#upgrading-to-v1620) to learn about the enriched signatures for spam checker callbacks, which are supported with this release of Synapse.

Synapse 1.62.0rc3 (2022-07-04)
==============================

Bugfixes
--------

- Update the version of the [ldap3 plugin](https://github.com/matrix-org/matrix-synapse-ldap3/) included in the `matrixdotorg/synapse` DockerHub images and the Debian packages hosted on `packages.matrix.org` to 0.2.1. This fixes [a bug](matrix-org/matrix-synapse-ldap3#163) with usernames containing uppercase characters. ([\matrix-org#13156](matrix-org#13156))
- Fix a bug introduced in Synapse 1.62.0rc1 affecting unread counts for users on small servers. ([\matrix-org#13168](matrix-org#13168))

Synapse 1.62.0rc2 (2022-07-01)
==============================

Bugfixes
--------

- Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\matrix-org#13140](matrix-org#13140))
- Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\matrix-org#13141](matrix-org#13141))

Synapse 1.62.0rc1 (2022-06-28)
==============================

Features
--------

- Port the spam-checker API callbacks to a new, richer API. This is part of an ongoing change to let spam-checker modules inform users of the reason their event or operation is rejected. ([\matrix-org#12857](matrix-org#12857), [\matrix-org#13047](matrix-org#13047))
- Allow server admins to customise the response of the `/.well-known/matrix/client` endpoint. ([\matrix-org#13035](matrix-org#13035))
- Add metrics measuring the CPU and DB time spent in state resolution. ([\matrix-org#13036](matrix-org#13036))
- Speed up fetching of device list changes in `/sync` and `/keys/changes`. ([\matrix-org#13045](matrix-org#13045), [\matrix-org#13098](matrix-org#13098))
- Improve URL previews for sites which only provide Twitter Card metadata, e.g. LWN.net. ([\matrix-org#13056](matrix-org#13056))

Bugfixes
--------

- Update [MSC3786](matrix-org/matrix-spec-proposals#3786) implementation to check `state_key`. ([\matrix-org#12939](matrix-org#12939))
- Fix a bug introduced in Synapse 1.58 where Synapse would not report full version information when installed from a git checkout. This is a best-effort affair and not guaranteed to be stable. ([\matrix-org#12973](matrix-org#12973))
- Fix a bug introduced in Synapse 1.60 where Synapse would fail to start if the `sqlite3` module was not available. ([\matrix-org#12979](matrix-org#12979))
- Fix a bug where non-standard information was required when requesting the `/hierarchy` API over federation. Introduced
  in Synapse v1.41.0. ([\matrix-org#12991](matrix-org#12991))
- Fix a long-standing bug which meant that rate limiting was not restrictive enough in some cases. ([\matrix-org#13018](matrix-org#13018))
- Fix a bug introduced in Synapse 1.58 where profile requests for a malformed user ID would ccause an internal error. Synapse now returns 400 Bad Request in this situation. ([\matrix-org#13041](matrix-org#13041))
- Fix some inconsistencies in the event authentication code. ([\matrix-org#13087](matrix-org#13087), [\matrix-org#13088](matrix-org#13088))
- Fix a long-standing bug where room directory requests would cause an internal server error if given a malformed room alias. ([\matrix-org#13106](matrix-org#13106))

Improved Documentation
----------------------

- Add documentation for how to configure Synapse with Workers using Docker Compose. Includes example worker config and docker-compose.yaml. Contributed by @Thumbscrew. ([\matrix-org#12737](matrix-org#12737))
- Ensure the [Poetry cheat sheet](https://matrix-org.github.io/synapse/develop/development/dependencies.html) is available in the online documentation. ([\matrix-org#13022](matrix-org#13022))
- Mention removed community/group worker endpoints in upgrade.md. Contributed by @olmari. ([\matrix-org#13023](matrix-org#13023))
- Add instructions for running Complement with `gotestfmt`-formatted output locally. ([\matrix-org#13073](matrix-org#13073))
- Update OpenTracing docs to reference the configuration manual rather than the configuration file. ([\matrix-org#13076](matrix-org#13076))
- Update information on downstream Debian packages. ([\matrix-org#13095](matrix-org#13095))
- Remove documentation for the Delete Group Admin API which no longer exists. ([\matrix-org#13112](matrix-org#13112))

Deprecations and Removals
-------------------------

- Remove the unspecced `DELETE /directory/list/room/{roomId}` endpoint, which hid rooms from the [public room directory](https://spec.matrix.org/v1.3/client-server-api/#listing-rooms). Instead, `PUT` to the same URL with a visibility of `"private"`. ([\matrix-org#13123](matrix-org#13123))

Internal Changes
----------------

- Add tests for cancellation of `GET /rooms/$room_id/members` and `GET /rooms/$room_id/state` requests. ([\matrix-org#12674](matrix-org#12674))
- Report login failures due to unknown third party identifiers in the same way as failures due to invalid passwords. This prevents an attacker from using the error response to determine if the identifier exists. Contributed by Daniel Aloni. ([\matrix-org#12738](matrix-org#12738))
- Merge the Complement testing Docker images into a single, multi-purpose image. ([\matrix-org#12881](matrix-org#12881), [\matrix-org#13075](matrix-org#13075))
- Simplify the database schema for `event_edges`. ([\matrix-org#12893](matrix-org#12893))
- Clean up the test code for client disconnection. ([\matrix-org#12929](matrix-org#12929))
- Remove code generating comments in configuration. ([\matrix-org#12941](matrix-org#12941))
- Add `Cross-Origin-Resource-Policy: cross-origin` header to content repository's thumbnail and download endpoints. ([\matrix-org#12944](matrix-org#12944))
- Replace noop background updates with `DELETE` delta. ([\matrix-org#12954](matrix-org#12954), [\matrix-org#13050](matrix-org#13050))
- Use lower isolation level when inserting read receipts to avoid serialization errors. Contributed by Nick @ Beeper. ([\matrix-org#12957](matrix-org#12957))
- Reduce the amount of state we pull from the DB. ([\matrix-org#12963](matrix-org#12963))
- Enable testing against PostgreSQL databases in Complement CI. ([\matrix-org#12965](matrix-org#12965), [\matrix-org#13034](matrix-org#13034))
- Fix an inaccurate comment. ([\matrix-org#12969](matrix-org#12969))
- Remove the `delete_device` method and always call `delete_devices`. ([\matrix-org#12970](matrix-org#12970))
- Use a GitHub form for issues rather than a hard-to-read, easy-to-ignore template. ([\matrix-org#12982](matrix-org#12982))
- Move [MSC3715](matrix-org/matrix-spec-proposals#3715) behind an experimental config flag. ([\matrix-org#12984](matrix-org#12984))
- Add type hints to tests. ([\matrix-org#12985](matrix-org#12985), [\matrix-org#13099](matrix-org#13099))
- Refactor macaroon tokens generation and move the unsubscribe link in notification emails to `/_synapse/client/unsubscribe`. ([\matrix-org#12986](matrix-org#12986))
- Fix documentation for running complement tests. ([\matrix-org#12990](matrix-org#12990))
- Faster joins: add issue links to the TODO comments in the code. ([\matrix-org#13004](matrix-org#13004))
- Reduce DB usage of `/sync` when a large number of unread messages have recently been sent in a room. ([\matrix-org#13005](matrix-org#13005), [\matrix-org#13096](matrix-org#13096), [\matrix-org#13118](matrix-org#13118))
- Replaced usage of PyJWT with methods from Authlib in `org.matrix.login.jwt`. Contributed by Hannes Lerchl. ([\matrix-org#13011](matrix-org#13011))
- Modernize the `contrib/graph/` scripts. ([\matrix-org#13013](matrix-org#13013))
- Remove redundant `room_version` parameters from event auth functions. ([\matrix-org#13017](matrix-org#13017))
- Decouple `synapse.api.auth_blocking.AuthBlocking` from `synapse.api.auth.Auth`. ([\matrix-org#13021](matrix-org#13021))
- Add type annotations to `synapse.storage.databases.main.devices`. ([\matrix-org#13025](matrix-org#13025))
- Set default `sync_response_cache_duration` to two minutes. ([\matrix-org#13042](matrix-org#13042))
- Rename CI test runs. ([\matrix-org#13046](matrix-org#13046))
- Increase timeout of complement CI test runs. ([\matrix-org#13048](matrix-org#13048))
- Refactor entry points so that they all have a `main` function. ([\matrix-org#13052](matrix-org#13052))
- Refactor the Dockerfile-workers configuration script to use Jinja2 templates in Synapse workers' Supervisord blocks. ([\matrix-org#13054](matrix-org#13054))
- Add headers to individual options in config documentation to allow for linking. ([\matrix-org#13055](matrix-org#13055))
- Make Complement CI logs easier to read. ([\matrix-org#13057](matrix-org#13057), [\matrix-org#13058](matrix-org#13058), [\matrix-org#13069](matrix-org#13069))
- Don't instantiate modules with keyword arguments. ([\matrix-org#13060](matrix-org#13060))
- Fix type checking errors against Twisted trunk. ([\matrix-org#13061](matrix-org#13061))
- Allow MSC3030 `timestamp_to_event` calls from anyone on world-readable rooms. ([\matrix-org#13062](matrix-org#13062))
- Add a CI job to check that schema deltas are in the correct folder. ([\matrix-org#13063](matrix-org#13063))
- Avoid rechecking event auth rules which are independent of room state. ([\matrix-org#13065](matrix-org#13065))
- Reduce the duplication of code that invokes the rate limiter. ([\matrix-org#13070](matrix-org#13070))
- Add a Subject Alternative Name to the certificate generated for Complement tests. ([\matrix-org#13071](matrix-org#13071))
- Add more tests for room upgrades. ([\matrix-org#13074](matrix-org#13074))
- Pin dependencies maintained by matrix.org to [semantic version](https://semver.org/) bounds. ([\matrix-org#13082](matrix-org#13082))
- Correctly report prometheus DB stats for `get_earliest_token_for_stats`. ([\matrix-org#13085](matrix-org#13085))
- Fix a long-standing bug where a finished logging context would be re-started when Synapse failed to persist an event from federation. ([\matrix-org#13089](matrix-org#13089))
- Simplify the alias deletion logic as an application service. ([\matrix-org#13093](matrix-org#13093))
- Add type annotations to `tests.test_server`. ([\matrix-org#13124](matrix-org#13124))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SerializationFailure when rotating push actions
3 participants