Regression in federation sender instance calls to get joined profile #13444

Fizzadar · 2022-08-03T09:11:45Z

v1.61.1 release dramatically reduced federation sender calls to the _get_joined_profile_from_event_id cache (see: #12955 (comment)). This seems to have returned in 1.63.1, the jump below is when we upgraded from 1.61.1 -> 1.63.1:

The text was updated successfully, but these errors were encountered:

reivilibre · 2022-08-03T10:58:25Z

It would appear that matrix.org has also seen something like that:

I don't currently know of any relevant changes that led to this.

This is currently at roughly 1% of DB time on the federation sender, so also not sure it's urgent.

Fizzadar · 2022-08-12T19:51:54Z

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

richvdh · 2022-08-13T21:15:56Z

I've spent a while looking into this.

On matrix.org there is a notable uptick in miss rate on 12th July (coinciding with a deployment of these changes, roughly 1.62.0+ to 1.63.0rc1-), followed by a second uptick on 22nd July (coinciding with a deployment with these changes, roughly equivalent to v1.63.0 to v1.64.0rc1). However, both increases are an order of magnitude smaller than those reported by @Fizzadar (300 requests/second vs 2000.) So: bad news for @Fizzadar in that it looks like this gets even worse for 1.64.0.

The federation senders primarily use this lookup to determine which servers they should send events to. Normally, it's expected that the event creators should precache the list of servers in redis, so we'd only expect to see requests to _get_joined_profile_from_event_id where a server list was not found in the redis cache, and indeed we can see a corresponding uptick in cache misses on the get_joined_hosts external cache:

I've not been able to find any obvious reason why we might be seeing more cache misses here. In particular, I'm not seeing any change to the rate of cache population, and given it's a redis cache, I think it's unlikely there will have been any changes to the expiry characteristics. What's particularly odd is that the event_to_prev_state_group hit rate is unchanged, and the two are supposedly populated at pretty much the same time.

It might be instructive to add some logging to report which events are seeing a get_joined_hosts cache miss, and see if we can find any pattern.

@Fizzadar it would be interesting to know if you also see a corresponding pattern in cache misses for the get_joined_hosts external cache. Note that you'll probably have to update the grafana dashboard as I just added a "External cache miss rate" panel.

richvdh · 2022-08-15T13:36:42Z

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

sorry, missed this. matrix.org apparently has 20 federation senders, handling a total of around 100 outgoing federation PDUs per second. But note that the code that hits get_joined_hosts is done before the work is sharded (ie, each of the federation senders does the lookup for every event).

Fizzadar · 2022-08-17T14:47:07Z

@richvdh sorry for the delay - some interesting charts from our end. Misses (bump is 1.64 rollout):

Hits for the same period (so we're ~99% hit rate):

And the joined profile cache hits:

What seems crazy to me is we're talking 1/s missed cache hit on get_joined_hosts resulting in ~3-4k requests back to the joined profile query. As far as I know our synapse has no changes within the federation logic. This feels like we're missing the cache for one or two huge rooms (#matrix:matrix.org perhaps?) which is resulting in the profile request.

Not had a chance yet to poke into the changes you linked which could be interesting, I wonder why this is getting missed...

erikjohnston · 2022-08-18T15:42:44Z

Next steps: as per Rich's #13444 (comment) let's add some log lines to figure out where this is coming from.

@Vetchu

Synapse 1.67.0 (2022-09-13) =========================== This release removes using the deprecated direct TCP replication configuration for workers. Server admins should use Redis instead. See the [upgrade notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670). The minimum version of `poetry` supported for managing source checkouts is now 1.2.0. **Notice:** from the next major release (1.68.0) installing Synapse from a source checkout will require a recent Rust compiler. Those using packages or `pip install matrix-synapse` will not be affected. See the [upgrade notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670). **Notice:** from the next major release (1.68.0), running Synapse with a SQLite database will require SQLite version 3.27.0 or higher. (The [current minimum version is SQLite 3.22.0](https://github.com/matrix-org/synapse/blob/release-v1.67/synapse/storage/engines/sqlite.py#L69-L78).) See [matrix-org#12983](matrix-org#12983) and the [upgrade notes](https://matrix-org.github.io/synapse/v1.67/upgrade.html#upgrading-to-v1670) for more details. No significant changes since 1.67.0rc1. Synapse 1.67.0rc1 (2022-09-06) ============================== Features -------- - Support setting the registration shared secret in a file, via a new `registration_shared_secret_path` configuration option. ([\matrix-org#13614](matrix-org#13614)) - Change the default startup behaviour so that any missing "additional" configuration files (signing key, etc) are generated automatically. ([\matrix-org#13615](matrix-org#13615)) - Improve performance of sending messages in rooms with thousands of local users. ([\matrix-org#13634](matrix-org#13634)) Bugfixes -------- - Fix a bug introduced in Synapse 1.13 where the [List Rooms admin API](https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#list-room-api) would return integers instead of booleans for the `federatable` and `public` fields when using a Sqlite database. ([\matrix-org#13509](matrix-org#13509)) - Fix bug that user cannot `/forget` rooms after the last member has left the room. ([\matrix-org#13546](matrix-org#13546)) - Faster Room Joins: fix `/make_knock` blocking indefinitely when the room in question is a partial-stated room. ([\matrix-org#13583](matrix-org#13583)) - Fix loading the current stream position behind the actual position. ([\matrix-org#13585](matrix-org#13585)) - Fix a longstanding bug in `register_new_matrix_user` which meant it was always necessary to explicitly give a server URL. ([\matrix-org#13616](matrix-org#13616)) - Fix the running of [MSC1763](matrix-org/matrix-spec-proposals#1763) retention purge_jobs in deployments with background jobs running on a worker by forcing them back onto the main worker. Contributed by Brad @ Beeper. ([\matrix-org#13632](matrix-org#13632)) - Fix a long-standing bug that downloaded media for URL previews was not deleted while database background updates were running. ([\matrix-org#13657](matrix-org#13657)) - Fix [MSC3030](matrix-org/matrix-spec-proposals#3030) `/timestamp_to_event` endpoint to return the correct next event when the events have the same timestamp. ([\matrix-org#13658](matrix-org#13658)) - Fix bug where we wedge media plugins if clients disconnect early. Introduced in v1.22.0. ([\matrix-org#13660](matrix-org#13660)) - Fix a long-standing bug which meant that keys for unwhitelisted servers were not returned by `/_matrix/key/v2/query`. ([\matrix-org#13683](matrix-org#13683)) - Fix a bug introduced in Synapse v1.20.0 that would cause the unstable unread counts from [MSC2654](matrix-org/matrix-spec-proposals#2654) to be calculated even if the feature is disabled. ([\matrix-org#13694](matrix-org#13694)) Updates to the Docker image --------------------------- - Update docker image to use a stable version of poetry. ([\matrix-org#13688](matrix-org#13688)) Improved Documentation ---------------------- - Improve the description of the ["chain cover index"](https://matrix-org.github.io/synapse/latest/auth_chain_difference_algorithm.html) used internally by Synapse. ([\matrix-org#13602](matrix-org#13602)) - Document how ["monthly active users"](https://matrix-org.github.io/synapse/latest/usage/administration/monthly_active_users.html) is calculated and used. ([\matrix-org#13617](matrix-org#13617)) - Improve documentation around user registration. ([\matrix-org#13640](matrix-org#13640)) - Remove documentation of legacy `frontend_proxy` worker app. ([\matrix-org#13645](matrix-org#13645)) - Clarify documentation that HTTP replication traffic can be protected with a shared secret. ([\matrix-org#13656](matrix-org#13656)) - Remove unintentional colons from [config manual](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html) headers. ([\matrix-org#13665](matrix-org#13665)) - Update docs to make enabling metrics more clear. ([\matrix-org#13678](matrix-org#13678)) - Clarify `(room_id, event_id)` global uniqueness and how we should scope our database schemas. ([\matrix-org#13701](matrix-org#13701)) Deprecations and Removals ------------------------- - Drop support for calling `/_matrix/client/v3/rooms/{roomId}/invite` without an `id_access_token`, which was not permitted by the spec. Contributed by @Vetchu. ([\matrix-org#13241](matrix-org#13241)) - Remove redundant `_get_joined_users_from_context` cache. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13569](matrix-org#13569)) - Remove the ability to use direct TCP replication with workers. Direct TCP replication was deprecated in Synapse v1.18.0. Workers now require using Redis. ([\matrix-org#13647](matrix-org#13647)) - Remove support for unstable [private read receipts](matrix-org/matrix-spec-proposals#2285). ([\matrix-org#13653](matrix-org#13653), [\matrix-org#13692](matrix-org#13692)) Internal Changes ---------------- - Extend the release script to wait for GitHub Actions to finish and to be usable as a guide for the whole process. ([\matrix-org#13483](matrix-org#13483)) - Add experimental configuration option to allow disabling legacy Prometheus metric names. ([\matrix-org#13540](matrix-org#13540)) - Cache user IDs instead of profiles to reduce cache memory usage. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13573](matrix-org#13573), [\matrix-org#13600](matrix-org#13600)) - Optimize how Synapse calculates domains to fetch from during backfill. ([\matrix-org#13575](matrix-org#13575)) - Comment about a better future where we can get the state diff between two events. ([\matrix-org#13586](matrix-org#13586)) - Instrument `_check_sigs_and_hash_and_fetch` to trace time spent in child concurrent calls for understandable traces in Jaeger. ([\matrix-org#13588](matrix-org#13588)) - Improve performance of `@cachedList`. ([\matrix-org#13591](matrix-org#13591)) - Minor speed up of fetching large numbers of push rules. ([\matrix-org#13592](matrix-org#13592)) - Optimise push action fetching queries. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13597](matrix-org#13597)) - Rename `event_map` to `unpersisted_events` when computing the auth differences. ([\matrix-org#13603](matrix-org#13603)) - Refactor `get_users_in_room(room_id)` mis-use with dedicated `get_current_hosts_in_room(room_id)` function. ([\matrix-org#13605](matrix-org#13605)) - Use dedicated `get_local_users_in_room(room_id)` function to find local users when calculating `join_authorised_via_users_server` of a `/make_join` request. ([\matrix-org#13606](matrix-org#13606)) - Refactor `get_users_in_room(room_id)` mis-use to lookup single local user with dedicated `check_local_user_in_room(...)` function. ([\matrix-org#13608](matrix-org#13608)) - Drop unused column `application_services_state.last_txn`. ([\matrix-org#13627](matrix-org#13627)) - Improve readability of Complement CI logs by printing failure results last. ([\matrix-org#13639](matrix-org#13639)) - Generalise the `@cancellable` annotation so it can be used on functions other than just servlet methods. ([\matrix-org#13662](matrix-org#13662)) - Introduce a `CommonUsageMetrics` class to share some usage metrics between the Prometheus exporter and the phone home stats. ([\matrix-org#13671](matrix-org#13671)) - Add some logging to help track down matrix-org#13444. ([\matrix-org#13679](matrix-org#13679)) - Update poetry lock file for v1.2.0. ([\matrix-org#13689](matrix-org#13689)) - Add cache to `is_partial_state_room`. ([\matrix-org#13693](matrix-org#13693)) - Update the Grafana dashboard that is included with Synapse in the `contrib` directory. ([\matrix-org#13697](matrix-org#13697)) - Only run trial CI on all python versions on non-PRs. ([\matrix-org#13698](matrix-org#13698)) - Fix typechecking with latest types-jsonschema. ([\matrix-org#13712](matrix-org#13712)) - Reduce number of CI checks we run for PRs. ([\matrix-org#13713](matrix-org#13713)) # -----BEGIN PGP SIGNATURE----- # # iQFEBAABCgAuFiEEBTGR3/RnAzBGUif3pULk7RsPrAkFAmMgR2QQHGVyaWtAbWF0 # cml4Lm9yZwAKCRClQuTtGw+sCfG7B/94PwW1ChsaI8hkz/3e+93PEl/mNJ6YFaEB # 5pP4Dh/0dipP/iKbpgNuj5xz/JFnIi8D49A8sKNnku3jk0/8AZHgqDiBgOkrN76z # Y3awo5Q9ag4xww/105V3bhdnX1NrX8Avf6F2jchDv6/9q8wQHGBPg6DMgfZ/m/BL # SB4dypbbNpgLykuwtWxx6YMUYH+trsXJOn/MoAqld3QcZsqkDR25wXCt9+Dr+6AT # dPd/czi8kV8ruU59tf2K5HB7XKzBW9S3Qb3dJJmGOTTJ7ccUkN/XuTwqnII950Mo # bSlMXjY2hqk8rKUNhGZpi9bqUkwNhMgOkZl9A0Y1XtsXx6yjy0T/ # =zSGi # -----END PGP SIGNATURE----- # gpg: Signature made Tue Sep 13 10:03:32 2022 BST # gpg: using RSA key 053191DFF4670330465227F7A542E4ED1B0FAC09 # gpg: issuer "erik@matrix.org" # gpg: Can't check signature: No public key # Conflicts: # synapse/config/experimental.py # synapse/push/bulk_push_rule_evaluator.py # synapse/storage/databases/main/event_push_actions.py # synapse/util/caches/descriptors.py

richvdh · 2022-09-22T22:11:45Z

We're certainly seeing quite a lot of "Unexpectedly did not have cached destinations" lines in the matrix.org logs.

I managed to track a couple down in jaeger. In both cases what appears to have happened is that the call to get_joined_hosts (https://github.com/matrix-org/synapse/blob/v1.68.0rc1/synapse/handlers/message.py#L1468) has taken so long (hundreds or even thousands of milliseconds) that the event has been persisted, and the federation sender has picked it up, before the calculation completes.

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

richvdh · 2022-09-22T22:15:35Z

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

except both of those significantly postdate the arrival of this issue :/

Fizzadar · 2022-10-10T07:20:16Z

Apologies for the very long update on this one - we discovered a quick hack that removed 99% of hits to this function in our setup as the federation senders don't ignore bridged events (beeper/synapse@6a1e423). This doesn't address the increase noticed above though so this issue still remains I believe.

erikjohnston · 2022-10-10T15:37:17Z

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

except both of those significantly postdate the arrival of this issue :/

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms? 🤷

richvdh · 2022-10-10T16:39:28Z

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms?

Possibly...

Is it possible that we now avoid pulling all of the state out of the database during the event-creation path... right up until we attempt to cache_joined_hosts_for_event, when we'll suddenly need to pull the entire membership list out of the database?

richvdh self-assigned this Aug 5, 2022

H-Shay added S-Minor Blocks non-critical functionality, workarounds exist. X-Regression Something broke which worked on a previous release A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db labels Aug 12, 2022

anoadragon453 added the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 17, 2022

richvdh removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 18, 2022

erikjohnston self-assigned this Aug 18, 2022

richvdh removed their assignment Aug 18, 2022

anoadragon453 added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Aug 22, 2022

erikjohnston added a commit that referenced this issue Aug 31, 2022

Add some logging to help track down #13444

f6c4832

erikjohnston mentioned this issue Aug 31, 2022

Add some logging to help track down #13444 #13679

Merged

erikjohnston added a commit that referenced this issue Sep 1, 2022

Add some logging to help track down #13444 (#13679)

2318603

richvdh assigned richvdh and unassigned erikjohnston Oct 3, 2022

MadLittleMods mentioned this issue Oct 6, 2022

Complement can_paginate_after_getting_remote_event_from_timestamp_to_event_endpoint is flakey(?) #13944

Closed

MadLittleMods added the A-Federation label Apr 25, 2023

matrixbot mentioned this issue Dec 21, 2023

Regression in federation sender instance calls to get joined profile element-hq/synapse#13444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in federation sender instance calls to get joined profile #13444

Regression in federation sender instance calls to get joined profile #13444

Fizzadar commented Aug 3, 2022

reivilibre commented Aug 3, 2022

Fizzadar commented Aug 12, 2022

richvdh commented Aug 13, 2022

richvdh commented Aug 15, 2022

Fizzadar commented Aug 17, 2022

erikjohnston commented Aug 18, 2022

richvdh commented Sep 22, 2022

richvdh commented Sep 22, 2022

Fizzadar commented Oct 10, 2022

erikjohnston commented Oct 10, 2022

richvdh commented Oct 10, 2022

Regression in federation sender instance calls to get joined profile #13444

Regression in federation sender instance calls to get joined profile #13444

Comments

Fizzadar commented Aug 3, 2022

reivilibre commented Aug 3, 2022

Fizzadar commented Aug 12, 2022

richvdh commented Aug 13, 2022

richvdh commented Aug 15, 2022

Fizzadar commented Aug 17, 2022

erikjohnston commented Aug 18, 2022

richvdh commented Sep 22, 2022

richvdh commented Sep 22, 2022

Fizzadar commented Oct 10, 2022

erikjohnston commented Oct 10, 2022

richvdh commented Oct 10, 2022