You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems inappropriate that a single failed request can cause all subsequent requests to a server to fail for the next 10 minutes.
(See also matrix-org/synapse#8915 which asks why it's a single failure rather than at least a few)
For example, when we are handling a federation transaction, we can end up needing to make many requests to /v1/event. If any one of these hundreds of requests fails, all subsequent requests also fail. The upshot is that it's hard to make progress in populating complex rooms over federation: if we did a better job of persisting the events we did receive rather than aborting halfway through the operation, we might be able to make progress in the right direction so that subsequent federation transactions have a better chance of succeeding.
Essentially I think we should consider that there are different sorts of requests that need different "backoff" behaviour:
stuff we "push" (ie, /v1/send requests) vs stuff that we "pull".
stuff we pull from a specific server, vs stuff we could get from any server in the room. This is actually a spectrum, ranging from "claim E2E keys" which cannot possibly go anywhere else, through "fetch an event" which should probably go back to the server originating a transaction, to "join a room" where almost any server is as good as any other.
Obviously repeated failures to /send should mean we back off from further /send attempts; it should maybe also mean that the target server is moved down the preference list for "join a room" requests. But it should it affect key-claim requests or /v1/event requests?
We have some provision for this sort of thing with the "long retry" schedule, and the "ignore backoff" flag, but I don't think we use it consistently, and tbh I don't really think the larger picture has been considered: it's just been thrown together as the need arises.
The text was updated successfully, but these errors were encountered:
This issue has been migrated from #8917.
It seems inappropriate that a single failed request can cause all subsequent requests to a server to fail for the next 10 minutes.
(See also matrix-org/synapse#8915 which asks why it's a single failure rather than at least a few)
For example, when we are handling a federation transaction, we can end up needing to make many requests to
/v1/event
. If any one of these hundreds of requests fails, all subsequent requests also fail. The upshot is that it's hard to make progress in populating complex rooms over federation: if we did a better job of persisting the events we did receive rather than aborting halfway through the operation, we might be able to make progress in the right direction so that subsequent federation transactions have a better chance of succeeding.Essentially I think we should consider that there are different sorts of requests that need different "backoff" behaviour:
/v1/send
requests) vs stuff that we "pull".Obviously repeated failures to
/send
should mean we back off from further/send
attempts; it should maybe also mean that the target server is moved down the preference list for "join a room" requests. But it should it affect key-claim requests or/v1/event
requests?We have some provision for this sort of thing with the "long retry" schedule, and the "ignore backoff" flag, but I don't think we use it consistently, and tbh I don't really think the larger picture has been considered: it's just been thrown together as the need arises.
The text was updated successfully, but these errors were encountered: