Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Keep track when we try and fail to process a pulled event #13589

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Aug 23, 2022

Keep track when we try and fail to process a pulled event.

We can follow-up this PR with:

  1. Only try to backfill from an event if we haven't tried recently -> Only try to backfill event if we haven't tried before recently #13622
  2. When we decide to backfill that event again, process it in the background so it doesn't block and make /messages slow when we know it will probably fail again -> Process previously failed backfill events in the background #13623
  3. Generally track failures everywhere we try and fail to pull an event over federation -> Generalize tracking whenever we fail to fetch or process a federated event #13700

Fix #13621

Part of #13356

Mentioned in internal doc

Dev notes

  • Last contact time
  • Number of attempts

  • Custom upsert example in set_destination_retry_timings: synapse/storage/databases/main/transactions.py#L207-L330
  • simple_upsert ends up in the following SQL query:
    • INSERT INTO event_backfill_attempts (event_id, num_attempts, last_attempt_ts) VALUES (?, ?, ?) ON CONFLICT (event_id) DO UPDATE SET event_id=EXCLUDED.event_id, num_attempts=EXCLUDED.num_attempts, last_attempt_ts=EXCLUDED.last_attempt_ts
    • args=['$kfuSEOfoHjRUVZrJczZkms8rvFmVBZCceu-4TMIfqZ8', 1, 1661462351963]

What if the event is considered spam at the moment? What if it isn't considered spam in the future?

Can we ingest the event but add a rejection_reason for spam? Do we only use it for auth based rejections?

We have a rejections table:

$ psql synapse
synapse=# \d+ rejections
                                 Table "public.rejections"
   Column   | Type | Collation | Nullable | Default | Storage  | Stats target | Description
------------+------+-----------+----------+---------+----------+--------------+-------------
 event_id   | text |           | not null |         | extended |              |
 reason     | text |           | not null |         | extended |              |
 last_check | text |           | not null |         | extended |              |

backfill
_process_pulled_events
_process_pulled_event
_process_received_pdu
_run_push_actions_and_persist_event
persist_events_and_notify
persist_events
_event_persist_queue
_persist_event_batch
_persist_events_and_state_updates
_persist_events_txn
_store_event_txn
_update_metadata_tables_txn
on_receive_pdu
_process_received_pdu
_run_push_actions_and_persist_event
persist_events_and_notify
persist_events
...

    @trace
    async def clear_event_backfill_attempts(self, event_id: str) -> None:
        await self.db_pool.simple_delete(
            table="event_backfill_attempts",
            keyvalues={"event_id": event_id},
            desc="clear_event_backfill_attempts",
        )

    @trace
    async def clear_event_backfill_attempts_txn(
        self, txn: LoggingTransaction, event_id: str
    ) -> None:
        await self.db_pool.simple_delete_one_txn(
            txn,
            table="event_backfill_attempts",
            keyvalues={"event_id": event_id},
        )
        

# We've successfully persisted the event, no need to backfill it anymore
self.store.clear_event_backfill_attempts_txn(event.event_id)

Todo

  • Remove event from the table when we have successfully persisted it as a non-outlier

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added the A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) label Aug 23, 2022
@MadLittleMods MadLittleMods changed the title Keep track when we tried to backfill an event Keep track when we try to backfill an event Aug 23, 2022
@MadLittleMods MadLittleMods marked this pull request as ready for review August 26, 2022 02:30
@MadLittleMods MadLittleMods requested a review from a team as a code owner August 26, 2022 02:30
@MadLittleMods MadLittleMods added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Aug 27, 2022
MadLittleMods added a commit that referenced this pull request Aug 29, 2022
Discovered while working on #13589
and I had all the messages at the same timestamp

Part of matrix-org/matrix-spec-proposals#3030
if backfilled:
await self._store.record_event_failed_pull_attempt(
event.room_id, event_id, str(e)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the right place to record this? What is the failure mode you're trying to catch here?

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will catch the FederationError where We can't get valid state history,


But perhaps you also mean that we should also generally look for any exceptions and record there as well? That sounds reasonable for recording whatever failure that happens 👍:

try:
    ...
except FederationError as e:
    record_event_failed_pull_attempt(...)
    ...
except Exception as e:
    record_event_failed_pull_attempt(...)
    raise e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was mostly just thinking that its not obvious that this particular spot is the right place in the handling. For example, I think by this point we've already attempted to try and refetch events that have bad signatures returned by /backfill.

I guess this is a good first step, and we can always add it the checks in more places later.

synapse/handlers/federation_event.py Outdated Show resolved Hide resolved
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you remove the if backfilled: check

@MadLittleMods MadLittleMods changed the title Keep track when we try to backfill an event Keep track when we try and fail to backfill an event Sep 14, 2022
@MadLittleMods MadLittleMods changed the title Keep track when we try and fail to backfill an event Keep track when we try and fail to process a pulled event Sep 14, 2022
@MadLittleMods MadLittleMods merged commit 957e3d7 into develop Sep 14, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/keep-track-when-we-tried-to-backfill-an-event branch September 14, 2022 18:57
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @reivilibre, @richvdh, @DMRobertson, and @erikjohnston 🐮

MadLittleMods added a commit that referenced this pull request Sep 15, 2022
MadLittleMods added a commit that referenced this pull request Sep 23, 2022
Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over.

Fix #13622
Fix #8451

Follow-up to #13589

Part of #13356
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 30, 2022
Synapse 1.68.0 (2022-09-27)
===========================

Please note that Synapse will now refuse to start if configured to use a version of SQLite older than 3.27.

In addition, please note that installing Synapse from a source checkout now requires a recent Rust compiler.
Those using packages will not be affected. On most platforms, installing with `pip install matrix-synapse` will not be affected.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.68/upgrade.html#upgrading-to-v1680).

Bugfixes
--------

- Fix packaging to include `Cargo.lock` in `sdist`. ([\matrix-org#13909](matrix-org#13909))

Synapse 1.68.0rc2 (2022-09-23)
==============================

Bugfixes
--------

- Fix building from packaged sdist. Broken in v1.68.0rc1. ([\matrix-org#13866](matrix-org#13866))

Internal Changes
----------------

- Fix the release script not publishing binary wheels. ([\matrix-org#13850](matrix-org#13850))
- Lower minimum supported rustc version to 1.58.1. ([\matrix-org#13857](matrix-org#13857))
- Lock Rust dependencies' versions. ([\matrix-org#13858](matrix-org#13858))

Synapse 1.68.0rc1 (2022-09-20)
==============================

Features
--------

- Keep track of when we fail to process a pulled event over federation so we can intelligently back off in the future. ([\matrix-org#13589](matrix-org#13589), [\matrix-org#13814](matrix-org#13814))
- Add an [admin API endpoint to fetch messages within a particular window of time](https://matrix-org.github.io/synapse/v1.68/admin_api/rooms.html#room-messages-api). ([\matrix-org#13672](matrix-org#13672))
- Add an [admin API endpoint to find a user based on their external ID in an auth provider](https://matrix-org.github.io/synapse/v1.68/admin_api/user_admin_api.html#find-a-user-based-on-their-id-in-an-auth-provider). ([\matrix-org#13810](matrix-org#13810))
- Cancel the processing of key query requests when they time out. ([\matrix-org#13680](matrix-org#13680))
- Improve validation of request bodies for the following client-server API endpoints: [`/account/3pid/msisdn/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidmsisdnrequesttoken), [`/org.matrix.msc3720/account_status`](https://github.com/matrix-org/matrix-spec-proposals/blob/babolivier/user_status/proposals/3720-account-status.md#post-_matrixclientv1account_status), [`/account/3pid/add`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidadd), [`/account/3pid/bind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidbind), [`/account/3pid/delete`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3piddelete) and [`/account/3pid/unbind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidunbind). ([\matrix-org#13687](matrix-org#13687), [\matrix-org#13736](matrix-org#13736))
- Document the timestamp when a user accepts the consent, if [consent tracking](https://matrix-org.github.io/synapse/latest/consent_tracking.html) is used. ([\matrix-org#13741](matrix-org#13741))
- Add a `listeners[x].request_id_header` configuration option to specify which request header to extract and use as the request ID in order to correlate requests from a reverse proxy. ([\matrix-org#13801](matrix-org#13801))

Bugfixes
--------

- Fix a bug introduced in Synapse 1.41.0 where the `/hierarchy` API returned non-standard information (a `room_id` field under each entry in `children_state`). ([\matrix-org#13506](matrix-org#13506))
- Fix a long-standing bug where previously rejected events could end up in room state because they pass auth checks given the current state of the room. ([\matrix-org#13723](matrix-org#13723))
- Fix a long-standing bug where Synapse fails to start if a signing key file contains an empty line. ([\matrix-org#13738](matrix-org#13738))
- Fix a long-standing bug where Synapse would fail to handle malformed user IDs or room aliases gracefully in certain cases. ([\matrix-org#13746](matrix-org#13746))
- Fix a long-standing bug where device lists would remain cached when remote users left and rejoined the last room shared with the local homeserver. ([\matrix-org#13749](matrix-org#13749), [\matrix-org#13826](matrix-org#13826))
- Fix a long-standing bug that could cause stale caches in some rare cases on the first startup of Synapse with replication. ([\matrix-org#13766](matrix-org#13766))
- Fix a long-standing spec compliance bug where Synapse would accept a trailing slash on the end of `/get_missing_events` federation requests. ([\matrix-org#13789](matrix-org#13789))
- Delete associated data from `event_failed_pull_attempts`, `insertion_events`, `insertion_event_extremities`, `insertion_event_extremities`, `insertion_event_extremities` when purging the room. ([\matrix-org#13825](matrix-org#13825))

Improved Documentation
----------------------

- Note that `libpq` is required on ARM-based Macs. ([\matrix-org#13480](matrix-org#13480))
- Fix a mistake in the config manual introduced in Synapse 1.22.0: the `event_cache_size` _is_ scaled by `caches.global_factor`. ([\matrix-org#13726](matrix-org#13726))
- Fix a typo in the documentation for the login ratelimiting configuration. ([\matrix-org#13727](matrix-org#13727))
- Define Synapse's compatability policy for SQLite versions. ([\matrix-org#13728](matrix-org#13728))
- Add docs for the common fix of deleting the `matrix_synapse.egg-info/` directory for fixing Python dependency problems. ([\matrix-org#13785](matrix-org#13785))
- Update request log format documentation to mention the format used when the authenticated user is controlling another user. ([\matrix-org#13794](matrix-org#13794))

Deprecations and Removals
-------------------------

- Synapse will now refuse to start if configured to use SQLite < 3.27. ([\matrix-org#13760](matrix-org#13760))
- Don't include redundant `prev_state` in new events. Contributed by Denis Kariakin (@dakariakin). ([\matrix-org#13791](matrix-org#13791))

Internal Changes
----------------

- Add a stub Rust crate. ([\matrix-org#12595](matrix-org#12595), [\matrix-org#13734](matrix-org#13734), [\matrix-org#13735](matrix-org#13735), [\matrix-org#13743](matrix-org#13743), [\matrix-org#13763](matrix-org#13763), [\matrix-org#13769](matrix-org#13769), [\matrix-org#13778](matrix-org#13778))
- Bump the minimum dependency of `matrix_common` to 1.3.0 to make use of the `MXCUri` class. Use `MXCUri` to simplify media retention test code. ([\matrix-org#13162](matrix-org#13162))
- Add and populate the `event_stream_ordering` column on the `receipts` table for future optimisation of push action processing. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13703](matrix-org#13703))
- Rename the `EventFormatVersions` enum values so that they line up with room version numbers. ([\matrix-org#13706](matrix-org#13706))
- Update trial old deps CI to use Poetry 1.2.0. ([\matrix-org#13707](matrix-org#13707), [\matrix-org#13725](matrix-org#13725))
- Add experimental configuration option to allow disabling legacy Prometheus metric names. ([\matrix-org#13714](matrix-org#13714), [\matrix-org#13717](matrix-org#13717), [\matrix-org#13718](matrix-org#13718))
- Fix typechecking with latest types-jsonschema. ([\matrix-org#13724](matrix-org#13724))
- Strip number suffix from instance name to consolidate services that traces are spread over. ([\matrix-org#13729](matrix-org#13729))
- Instrument `get_metadata_for_events` for understandable traces in Jaeger. ([\matrix-org#13730](matrix-org#13730))
- Remove old queries to join room memberships to current state events. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13745](matrix-org#13745))
- Avoid raising an error due to malformed user IDs in `get_current_hosts_in_room`. Malformed user IDs cannot currently join a room, so this error would not be hit. ([\matrix-org#13748](matrix-org#13748))
- Update the docstrings for `get_users_in_room` and `get_current_hosts_in_room` to explain the impact of partial state. ([\matrix-org#13750](matrix-org#13750))
- Use an additional database query when persisting receipts. ([\matrix-org#13752](matrix-org#13752))
- Preparatory work for storing thread IDs for notifications and receipts. ([\matrix-org#13753](matrix-org#13753))
- Re-type hint some collections as read-only. ([\matrix-org#13754](matrix-org#13754))
- Remove unused Prometheus recording rules from `synapse-v2.rules` and add comments describing where the rest are used. ([\matrix-org#13756](matrix-org#13756))
- Add a check for editable installs if the Rust library needs rebuilding. ([\matrix-org#13759](matrix-org#13759))
- Tag traces with the instance name to be able to easily jump into the right logs and filter traces by instance. ([\matrix-org#13761](matrix-org#13761))
- Concurrently fetch room push actions when calculating badge counts. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13765](matrix-org#13765))
- Update the script which makes full schema dumps. ([\matrix-org#13770](matrix-org#13770))
- Deduplicate `is_server_notices_room`. ([\matrix-org#13780](matrix-org#13780))
- Simplify the dependency DAG in the tests workflow. ([\matrix-org#13784](matrix-org#13784))
- Remove an old, incorrect migration file. ([\matrix-org#13788](matrix-org#13788))
- Remove unused method in `synapse.api.auth.Auth`. ([\matrix-org#13795](matrix-org#13795))
- Fix a memory leak when running the unit tests. ([\matrix-org#13798](matrix-org#13798))
- Use partial indices on SQLite. ([\matrix-org#13802](matrix-org#13802))
- Check that portdb generates the same postgres schema as that in the source tree. ([\matrix-org#13808](matrix-org#13808))
- Fix Docker build when Rust .so has been built locally first. ([\matrix-org#13811](matrix-org#13811))
- Complement: Initialise the Postgres database directly inside the target image instead of the base Postgres image to fix building using Buildah. ([\matrix-org#13819](matrix-org#13819))
- Support providing an index predicate clause when doing upserts. ([\matrix-org#13822](matrix-org#13822))
- Minor speedups to linting in CI. ([\matrix-org#13827](matrix-org#13827))

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEE1508oLYUKainYFJakD7OEIo53t0FAmMy4FoACgkQkD7OEIo5
# 3t009g/6A4S26H6NG4GM44JD9+OB25fO59m9UAWWLrePmOKsBaGXVp86scPq9epI
# vQbr4Czi8WEqCJlMRxIWLXv7BL3TLXnLF1vC0wSE6YiJqrPU9vMZ0UYxWNErl8Sr
# eFBpuHXDlfppQUXs903iNmXVbdTpXCVjdTEwaZmgU1/FKydgU0o90PQseb/hnegX
# hcJrepL6xhcs37fP2zdlixissLQ85WE10x6h7FX+SkCEHGvkiKrqSvXsS4ZN0Scn
# vGCy3GD69j/ZpRu7RczdDwwzCRerg6r7spokRK/b6pzz9sXmCyY8SrrUyEEdzveN
# uEEe4A8vmR3v0sR4Ao2cGZ7zy/jq6WyrWjmfOd+hfYD9tXx/g8190RFkRQrrkYVU
# jTNdD6Zom0rtENEgHuFQ8joD96MNsaq4dvDefYYpcXDOh3YZnA/fiwfmG9XZRq6u
# B42RZEtUZ3sjZ3VdRb3AvhPTTrY4kiwEqVBFSUTBKEAKXQtdrsiqv2QPvQTSC5BJ
# YFXMwj32E7Zi3mTYjl60yggtBAxYt49KsHL8qAq75t6A38HpnYEyEW1R3S6VDmtp
# rIR5ktxyzoGvxpJ4YPUdC4A2hbaOztwPvGE3iEDiPgUdkfb6m8x3MhaWhShid4Df
# v2BQu+SFIl1wXPLxjlX2rmkQMhUGD2RGYmkUgYfoWnjdTtQV10w=
# =4eYj
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue Sep 27 12:36:58 2022 BST
# gpg:                using RSA key D79D3CA0B61429A8A760525A903ECE108A39DEDD
# gpg: Can't check signature: No public key

# Conflicts:
#	docker/Dockerfile
#	poetry.lock
#	synapse/push/bulk_push_rule_evaluator.py
#	synapse/push/push_tools.py
#	synapse/storage/databases/main/event_push_actions.py
#	synapse/storage/databases/main/events_worker.py
#	tests/replication/slave/storage/test_events.py
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 28, 2022
Packaging changes: This package now require rust, because of upstream,
and thus will fail to build on many platforms where the previous one
worked.  Please feel free to address this with upstream or package an
old version in wip :-(

This version, if configured to use sqlite, will refuse to start if the
sqlite version is old, as it is in netbsd-9 base.  This is not a
regression from the previous package because while that would start,
it would fail to run, because the netbsd-9 base sqlite is missing
required features.   This package should work with pkgsrc sqlite.

Tested with pgsql on netbsd-9 amd64.

Upstream NEWS:

Synapse 1.68.0 (2022-09-27)
===========================

Please note that Synapse will now refuse to start if configured to use a version of SQLite older than 3.27.

In addition, please note that installing Synapse from a source checkout now requires a recent Rust compiler.
Those using packages will not be affected. On most platforms, installing with `pip install matrix-synapse` will not be affected.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.68/upgrade.html#upgrading-to-v1680).

Features
--------

- Keep track of when we fail to process a pulled event over federation so we can intelligently back off in the future. ([\#13589](matrix-org/synapse#13589), [\#13814](matrix-org/synapse#13814))
- Add an [admin API endpoint to fetch messages within a particular window of time](https://matrix-org.github.io/synapse/v1.68/admin_api/rooms.html#room-messages-api). ([\#13672](matrix-org/synapse#13672))
- Add an [admin API endpoint to find a user based on their external ID in an auth provider](https://matrix-org.github.io/synapse/v1.68/admin_api/user_admin_api.html#find-a-user-based-on-their-id-in-an-auth-provider). ([\#13810](matrix-org/synapse#13810))
- Cancel the processing of key query requests when they time out. ([\#13680](matrix-org/synapse#13680))
- Improve validation of request bodies for the following client-server API endpoints: [`/account/3pid/msisdn/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidmsisdnrequesttoken), [`/org.matrix.msc3720/account_status`](https://github.com/matrix-org/matrix-spec-proposals/blob/babolivier/user_status/proposals/3720-account-status.md#post-_matrixclientv1account_status), [`/account/3pid/add`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidadd), [`/account/3pid/bind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidbind), [`/account/3pid/delete`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3piddelete) and [`/account/3pid/unbind`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidunbind). ([\#13687](matrix-org/synapse#13687), [\#13736](matrix-org/synapse#13736))
- Document the timestamp when a user accepts the consent, if [consent tracking](https://matrix-org.github.io/synapse/latest/consent_tracking.html) is used. ([\#13741](matrix-org/synapse#13741))
- Add a `listeners[x].request_id_header` configuration option to specify which request header to extract and use as the request ID in order to correlate requests from a reverse proxy. ([\#13801](matrix-org/synapse#13801))

Deprecations and Removals
-------------------------

- Synapse will now refuse to start if configured to use SQLite < 3.27. ([\#13760](matrix-org/synapse#13760))
- Don't include redundant `prev_state` in new events. Contributed by Denis Kariakin (@dakariakin). ([\#13791](matrix-org/synapse#13791))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep track when we try to backfill an event
5 participants