Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: smarter algorithm for picking a server to resync from #12999

Closed
Tracked by #14030
richvdh opened this issue Jun 9, 2022 · 5 comments · Fixed by #14126
Closed
Tracked by #14030

Faster joins: smarter algorithm for picking a server to resync from #12999

richvdh opened this issue Jun 9, 2022 · 5 comments · Fixed by #14126
Assignees
Labels
A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@richvdh
Copy link
Member

richvdh commented Jun 9, 2022

Per #12813 (comment): when we restart resyncing a room, it can take a while before we find an appropriate server.

# TODO(faster_joins): we need some way of prioritising which homeservers in
# `other_destinations` to try first, otherwise we'll spend ages trying dead
# homeservers for large rooms.
# https://github.com/matrix-org/synapse/issues/12999

@richvdh richvdh added this to the Faster joins (further work) milestone Jun 9, 2022
@squahtx squahtx added A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jun 9, 2022
@richvdh
Copy link
Member Author

richvdh commented Oct 4, 2022

We should prioritise the server that we did the join through initially.

@DMRobertson DMRobertson self-assigned this Oct 7, 2022
@DMRobertson
Copy link
Contributor

To check: is the given list of other_destinations acceptable---we just need to prioritise that list? Or do we need to change the callers to provide a wider list of other_destinations?

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

Other thoughts:

  • Prioritise servers that people have sent event from recently
  • Prioritise the server that sent the m.room.create event(?)

@MadLittleMods
Copy link
Contributor

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

This happens automatically for you by anything that uses synapse/http/matrixfederationclient.py because it uses the RetryDestinationLimiter (retryutils limiter). It does use the destinations table under the hood 👍

You just need to catch the NotRetryingDestination exception and continue to the next destination to try.


Also of interest is the generic pattern we have for this _try_destination_list(...),

async def _try_destination_list(
self,
description: str,
destinations: Iterable[str],
callback: Callable[[str], Awaitable[T]],
failover_errcodes: Optional[Container[str]] = None,
failover_on_unknown_endpoint: bool = False,
) -> T:
"""Try an operation on a series of servers, until it succeeds


we just need to prioritise that list?

One of the heuristics we have around prioritizing destinations to try is

The heuristic of sorting by servers who have been in the room the
longest is good because they're most likely to have anything we ask
about.

But that doesn't exactly solve the problem if the goal is to avoid long-gone dead servers. Does our NotRetryingDestination tracking already solve this though?

We should prioritise the server that we did the join through initially.

This makes sense since it's a recent alive server that had the state to authorize the join 👍

@richvdh
Copy link
Member Author

richvdh commented Oct 10, 2022

Note that the particular problem here comes when we resume resyncing after a restart, which currently causes us to lose our place in the list of potential destinations (initial_destination will be None in this case).

So one solution here is just to arrange for initial_destination to be populated.

To check: is the given list of other_destinations acceptable---we just need to prioritise that list? Or do we need to change the callers to provide a wider list of other_destinations?

other_destinations remains our best guess of servers that might be able to handle the request, so I think the answer is "yes, we just need to prioritise that list".

I think we have some table that tracks "we failed to make a federation request from this server recently"---destinations? So should exclude anyone in that?

As @MadLittleMods, the destinations table is already factored in, at a lower layer of abstraction. However, it's a very lossy abstraction because it assumes that all requests and all failure modes are created equal, which really isn't the case (cf #8917). But that's something we should aim to improve in general, rather than for one particular usecase. So, I'd just ignore destinations for now, and as @MadLittleMods, just make sure we catch the relevant exceptions, and move on.

Other thoughts:

* Prioritise servers that people have sent event from recently

Possibly, though really this sort of thing is meant to be handled via the destinations mechanism (successfully sending a transaction removes your failure flag from the destinations table).

* Prioritise the server that sent the m.room.create event(?)

I'm not sure that the server that sent the m.room.create event is much more or less likely to be reachable than any of the other servers in the room.

So: yes, let's just keep using other_destinations, but prioritise:

  1. the server that we joined through
  2. Ideally: the server that we most recently did a successful resync from.

@DMRobertson
Copy link
Contributor

Discussed with @richvdh. Actions here:

  1. Catch NotRetryingDestination during the resync process so that we'll try another server rather than fail immediately.
  2. When we first partial-join to a room, persist the server we join via in the partial_state_rooms table. Read from this to determine the server that we joined through. Prioritise that server.
  3. Rich doesn't believe we'll ever call /state as part of the resync process for any event other than the join we requested in the beginning. Therefore there's no point tracking or persisting the server we have most recently synced state from.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
4 participants