Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Don't wake up destination transaction queue if they're not due for retry. #16223

Merged
merged 12 commits into from
Sep 4, 2023

Conversation

erikjohnston
Copy link
Member

Based on #16221.

When we send an EDU to a large number of hosts (e.g. presence or a receipt in a large room), we instansiate a PerDestinationQueue per host. This has a bunch of overhead, and is pointless for hosts that are marked as "down".

To avoid this, we make sure to only wake up destinations that we will actually try to contact with the new data. This dramatically reduces CPU and memory usages of federation sending in large rooms.

Base automatically changed from erikj/dont_reset_retry_on_403 to develop September 4, 2023 13:04
@erikjohnston erikjohnston marked this pull request as ready for review September 4, 2023 14:04
@erikjohnston erikjohnston requested a review from a team as a code owner September 4, 2023 14:04
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see some graphs showing perf gains if you have them!

@@ -0,0 +1 @@
Improve resource usage when sending data to a large number of remote hosts that are marked as "down".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably .feature

)

return {
row.pop("destination"): DestinationRetryTimings(**row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's critical here that the key (and hence the pop) is evaluated before the value!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is a little hacky. I can update it to actually specify each key explicitly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, it just took me a moment to see what was going on. (And: it's the kind of thing that I'm paranoid might break in a future Python release)

Comment on lines +132 to +141
retry_timings = await store.get_destination_retry_timings_batch(destinations)

now = int(clock.time_msec())

return [
destination
for destination, timings in retry_timings.items()
if timings is None
or timings.retry_last_ts + timings.retry_interval <= now + retry_due_within_ms
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna say "it might be faster to do this logic in the query"... but I guess that would get in the way of using cachedList to piggyback off the existing cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, exactly

@@ -909,8 +909,14 @@ def test_send_and_get(self) -> None:

prev_token = self.queue.get_current_token(self.instance_name)

self.queue.send_presence_to_destinations((state1, state2), ("dest1", "dest2"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below: was the test broken by not having get_success (or similar) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as calling a coroutine doesn't do anything unless you await on it

Comment on lines -123 to -124
self.datastore.get_destination_retry_timings = AsyncMock(return_value=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just unused?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it must have broken the test somehow but I've forgotten how

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to add the filtering to

  • _send_pdu
  • send_presence_to_destinations
  • send_read_receipt

Are there other EDUs we should worry about here? (Typing? Device list stuff? to-device messages?)

What about the other methods on FederationSender?

  • send_edu
  • send_device_messages

is the point that only the former three handle multiple destinations in one call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Oh there is logic below for typing EDUs. But why don't they go via the federation sender too?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, mainly because a) send_edu and send_device_messages just take a single host, and b) I forgot about device messages 🤦

@erikjohnston
Copy link
Member Author

Would love to see some graphs showing perf gains if you have them!

I have a memory graph, but given I deployed it at the start of the weekend I don't think its particularly useful.

Screenshot from 2023-09-04 16-08-50

Comment on lines -346 to +348
self.hs.get_federation_sender().send_device_messages("host2")
self.get_success(
self.hs.get_federation_sender().send_device_messages(["host2"])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, more of these. Maybe we should resurrect #14519 , at least for the tests

Comment on lines -839 to +848
for host in hosts:
self.federation_sender.send_device_messages(
host, immediate=False
)
# TODO: when called, this isn't in a logging context.
# This leads to log spam, sentry event spam, and massive
# memory usage.
# See https://github.com/matrix-org/synapse/issues/12552.
# log_kv(
# {"message": "sent device update to host", "host": host}
# )
await self.federation_sender.send_device_messages(
hosts, immediate=False
)
# TODO: when called, this isn't in a logging context.
# This leads to log spam, sentry event spam, and massive
# memory usage.
# See https://github.com/matrix-org/synapse/issues/12552.
# log_kv(
# {"message": "sent device update to host", "host": host}
# )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 -> 6 levels of indentation :)

Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy if CI is and you're convinced we haven't missed out any other EDUs.

@erikjohnston erikjohnston merged commit d35bed8 into develop Sep 4, 2023
37 checks passed
@erikjohnston erikjohnston deleted the erikj/prune_destinations branch September 4, 2023 16:14
DMRobertson pushed a commit that referenced this pull request Sep 5, 2023
- Add configuration setting for CAS protocol version. Contributed by Aurélien Grimpard. ([\#15816](#15816))
- Suppress notifications from message edits per [MSC3958](matrix-org/matrix-spec-proposals#3958). ([\#16113](#16113))
- Return a `Retry-After` with `M_LIMIT_EXCEEDED` error responses. ([\#16136](#16136))
- Add `last_seen_ts` to the [admin users API](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html). ([\#16218](#16218))
- Improve resource usage when sending data to a large number of remote hosts that are marked as "down". ([\#16223](#16223))

- Fix IPv6-related bugs on SMTP settings, adding groundwork to fix similar issues. Contributed by @evilham and @telmich (ungleich.ch). ([\#16155](#16155))
- Fix a spec compliance issue where requests to the `/publicRooms` federation API would specify `include_all_networks` as a string. ([\#16185](#16185))
- Fix inaccurate error message while attempting to ban or unban a user with the same or higher PL by spliting the conditional statements. Contributed by @leviosacz. ([\#16205](#16205))
- Fix a rare bug that broke looping calls, which could lead to e.g. linearly increasing memory usage. Introduced in v1.90.0. ([\#16210](#16210))
- Fix a long-standing bug where uploading images would fail if we could not generate thumbnails for them. ([\#16211](#16211))
- Fix a long-standing bug where we did not correctly back off from servers that had "gone" if they returned 4xx series error codes. ([\#16221](#16221))

- Update links to the [matrix.org blog](https://matrix.org/blog/). ([\#16008](#16008))
- Document which [admin APIs](https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/index.html) are disabled when experimental [MSC3861](matrix-org/matrix-spec-proposals#3861) support is enabled. ([\#16168](#16168))
- Document [`exclude_rooms_from_sync`](https://matrix-org.github.io/synapse/v1.92/usage/configuration/config_documentation.html#exclude_rooms_from_sync) configuration option. ([\#16178](#16178))

- Prepare unit tests for Python 3.12. ([\#16099](#16099))
- Fix nightly CI jobs. ([\#16121](#16121), [\#16213](#16213))
- Describe which rate limiter was hit in logs. ([\#16135](#16135))
- Simplify presence code when using workers. ([\#16170](#16170))
- Track per-device information in the presence code. ([\#16171](#16171), [\#16172](#16172))
- Stop using the `event_txn_id` table. ([\#16175](#16175))
- Use `AsyncMock` instead of custom code. ([\#16179](#16179), [\#16180](#16180))
- Improve error reporting of invalid data passed to `/_matrix/key/v2/query`. ([\#16183](#16183))
- Task scheduler: add replication notify for new task to launch ASAP. ([\#16184](#16184))
- Improve type hints. ([\#16186](#16186), [\#16188](#16188), [\#16201](#16201))
- Bump black version to 23.7.0. ([\#16187](#16187))
- Log the details of background update failures. ([\#16212](#16212))
- Cache device resync requests over replication. ([\#16241](#16241))

* Bump anyhow from 1.0.72 to 1.0.75. ([\#16141](#16141))
* Bump furo from 2023.7.26 to 2023.8.19. ([\#16238](#16238))
* Bump phonenumbers from 8.13.18 to 8.13.19. ([\#16237](#16237))
* Bump psycopg2 from 2.9.6 to 2.9.7. ([\#16196](#16196))
* Bump regex from 1.9.3 to 1.9.4. ([\#16195](#16195))
* Bump ruff from 0.0.277 to 0.0.286. ([\#16198](#16198))
* Bump sentry-sdk from 1.29.2 to 1.30.0. ([\#16236](#16236))
* Bump serde from 1.0.184 to 1.0.188. ([\#16194](#16194))
* Bump serde_json from 1.0.104 to 1.0.105. ([\#16140](#16140))
* Bump types-psycopg2 from 2.9.21.10 to 2.9.21.11. ([\#16200](#16200))
* Bump types-pyyaml from 6.0.12.10 to 6.0.12.11. ([\#16199](#16199))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants