-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Don't wake up destination transaction queue if they're not due for retry. #16223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to see some graphs showing perf gains if you have them!
changelog.d/16223.misc
Outdated
@@ -0,0 +1 @@ | |||
Improve resource usage when sending data to a large number of remote hosts that are marked as "down". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably .feature
) | ||
|
||
return { | ||
row.pop("destination"): DestinationRetryTimings(**row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's critical here that the key (and hence the pop
) is evaluated before the value!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is a little hacky. I can update it to actually specify each key explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine, it just took me a moment to see what was going on. (And: it's the kind of thing that I'm paranoid might break in a future Python release)
retry_timings = await store.get_destination_retry_timings_batch(destinations) | ||
|
||
now = int(clock.time_msec()) | ||
|
||
return [ | ||
destination | ||
for destination, timings in retry_timings.items() | ||
if timings is None | ||
or timings.retry_last_ts + timings.retry_interval <= now + retry_due_within_ms | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was gonna say "it might be faster to do this logic in the query"... but I guess that would get in the way of using cachedList
to piggyback off the existing cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, exactly
@@ -909,8 +909,14 @@ def test_send_and_get(self) -> None: | |||
|
|||
prev_token = self.queue.get_current_token(self.instance_name) | |||
|
|||
self.queue.send_presence_to_destinations((state1, state2), ("dest1", "dest2")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below: was the test broken by not having get_success
(or similar) here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as calling a coroutine doesn't do anything unless you await on it
self.datastore.get_destination_retry_timings = AsyncMock(return_value=None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this just unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it must have broken the test somehow but I've forgotten how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to add the filtering to
- _send_pdu
- send_presence_to_destinations
- send_read_receipt
Are there other EDUs we should worry about here? (Typing? Device list stuff? to-device messages?)
What about the other methods on FederationSender?
- send_edu
- send_device_messages
is the point that only the former three handle multiple destinations in one call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Oh there is logic below for typing EDUs. But why don't they go via the federation sender too?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, mainly because a) send_edu
and send_device_messages
just take a single host, and b) I forgot about device messages 🤦
self.hs.get_federation_sender().send_device_messages("host2") | ||
self.get_success( | ||
self.hs.get_federation_sender().send_device_messages(["host2"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, more of these. Maybe we should resurrect #14519 , at least for the tests
for host in hosts: | ||
self.federation_sender.send_device_messages( | ||
host, immediate=False | ||
) | ||
# TODO: when called, this isn't in a logging context. | ||
# This leads to log spam, sentry event spam, and massive | ||
# memory usage. | ||
# See https://github.com/matrix-org/synapse/issues/12552. | ||
# log_kv( | ||
# {"message": "sent device update to host", "host": host} | ||
# ) | ||
await self.federation_sender.send_device_messages( | ||
hosts, immediate=False | ||
) | ||
# TODO: when called, this isn't in a logging context. | ||
# This leads to log spam, sentry event spam, and massive | ||
# memory usage. | ||
# See https://github.com/matrix-org/synapse/issues/12552. | ||
# log_kv( | ||
# {"message": "sent device update to host", "host": host} | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 -> 6 levels of indentation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy if CI is and you're convinced we haven't missed out any other EDUs.
- Add configuration setting for CAS protocol version. Contributed by Aurélien Grimpard. ([\#15816](#15816)) - Suppress notifications from message edits per [MSC3958](matrix-org/matrix-spec-proposals#3958). ([\#16113](#16113)) - Return a `Retry-After` with `M_LIMIT_EXCEEDED` error responses. ([\#16136](#16136)) - Add `last_seen_ts` to the [admin users API](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html). ([\#16218](#16218)) - Improve resource usage when sending data to a large number of remote hosts that are marked as "down". ([\#16223](#16223)) - Fix IPv6-related bugs on SMTP settings, adding groundwork to fix similar issues. Contributed by @evilham and @telmich (ungleich.ch). ([\#16155](#16155)) - Fix a spec compliance issue where requests to the `/publicRooms` federation API would specify `include_all_networks` as a string. ([\#16185](#16185)) - Fix inaccurate error message while attempting to ban or unban a user with the same or higher PL by spliting the conditional statements. Contributed by @leviosacz. ([\#16205](#16205)) - Fix a rare bug that broke looping calls, which could lead to e.g. linearly increasing memory usage. Introduced in v1.90.0. ([\#16210](#16210)) - Fix a long-standing bug where uploading images would fail if we could not generate thumbnails for them. ([\#16211](#16211)) - Fix a long-standing bug where we did not correctly back off from servers that had "gone" if they returned 4xx series error codes. ([\#16221](#16221)) - Update links to the [matrix.org blog](https://matrix.org/blog/). ([\#16008](#16008)) - Document which [admin APIs](https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/index.html) are disabled when experimental [MSC3861](matrix-org/matrix-spec-proposals#3861) support is enabled. ([\#16168](#16168)) - Document [`exclude_rooms_from_sync`](https://matrix-org.github.io/synapse/v1.92/usage/configuration/config_documentation.html#exclude_rooms_from_sync) configuration option. ([\#16178](#16178)) - Prepare unit tests for Python 3.12. ([\#16099](#16099)) - Fix nightly CI jobs. ([\#16121](#16121), [\#16213](#16213)) - Describe which rate limiter was hit in logs. ([\#16135](#16135)) - Simplify presence code when using workers. ([\#16170](#16170)) - Track per-device information in the presence code. ([\#16171](#16171), [\#16172](#16172)) - Stop using the `event_txn_id` table. ([\#16175](#16175)) - Use `AsyncMock` instead of custom code. ([\#16179](#16179), [\#16180](#16180)) - Improve error reporting of invalid data passed to `/_matrix/key/v2/query`. ([\#16183](#16183)) - Task scheduler: add replication notify for new task to launch ASAP. ([\#16184](#16184)) - Improve type hints. ([\#16186](#16186), [\#16188](#16188), [\#16201](#16201)) - Bump black version to 23.7.0. ([\#16187](#16187)) - Log the details of background update failures. ([\#16212](#16212)) - Cache device resync requests over replication. ([\#16241](#16241)) * Bump anyhow from 1.0.72 to 1.0.75. ([\#16141](#16141)) * Bump furo from 2023.7.26 to 2023.8.19. ([\#16238](#16238)) * Bump phonenumbers from 8.13.18 to 8.13.19. ([\#16237](#16237)) * Bump psycopg2 from 2.9.6 to 2.9.7. ([\#16196](#16196)) * Bump regex from 1.9.3 to 1.9.4. ([\#16195](#16195)) * Bump ruff from 0.0.277 to 0.0.286. ([\#16198](#16198)) * Bump sentry-sdk from 1.29.2 to 1.30.0. ([\#16236](#16236)) * Bump serde from 1.0.184 to 1.0.188. ([\#16194](#16194)) * Bump serde_json from 1.0.104 to 1.0.105. ([\#16140](#16140)) * Bump types-psycopg2 from 2.9.21.10 to 2.9.21.11. ([\#16200](#16200)) * Bump types-pyyaml from 6.0.12.10 to 6.0.12.11. ([\#16199](#16199))
Based on #16221.
When we send an EDU to a large number of hosts (e.g. presence or a receipt in a large room), we instansiate a
PerDestinationQueue
per host. This has a bunch of overhead, and is pointless for hosts that are marked as "down".To avoid this, we make sure to only wake up destinations that we will actually try to contact with the new data. This dramatically reduces CPU and memory usages of federation sending in large rooms.