Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Retry dead servers a lot less often #340

Merged
merged 1 commit into from
Nov 5, 2015
Merged

Conversation

erikjohnston
Copy link
Member

The individual federation HTTP requests have a retry schedule of: 5s, 25s, 2m, 10m

We then only try the host after: 10m, 50m, 4h, 20h and then every 24h.

(All these values are subject to 80% - 140% fuzzing)

@NegativeMjark
Copy link
Contributor

How do these times compare with what synapse did before?

@ara4n
Copy link
Member

ara4n commented Nov 2, 2015

I'm not sure I follow the rationale here... is it just to make the back-off less aggressive? What does the random factor actually achieve (given we don't have thundering herd problems here)?

The actual bug we've been seeing are fast-retries on trying to talk on tarpitted servers on federation - this is easily seen on matrix.org. Shouldn't we be fixing whatever that bug is rather than tuning the benign behaviour? I assume I'm missing something...

@erikjohnston
Copy link
Member Author

@NegativeMjark The current times were:

  • 1s, 2s, 4s, 8s, 16s for HTTP retries
  • 5s, 10s, 20s, 40s, 1m20s, 2m40s, 5m, 10m, 21m, 42m and then every 1h for retrying the host.

@ara4n The current retry times (above) seem overly aggressive. If retrying servers every few seconds causes noticeable performance issues then the current schedule will certainly exacerbate them.

This is most certainly not an attempt to fix the bug, rather something which I noticed while doing so. Given I have yet been unable to track down exactly what is causing said bug, taking a few minutes to fix this now (while I remember it) seemed prudent.

The randomization is there because a) its trivial to add and b) it does help spread subsequent retries out. Due to the fact that we only retry hosts when we have something new to send them, retries will naturally batch up each time someone on that server sends a message into a large room (e.g. Matrix HQ).

@NegativeMjark
Copy link
Contributor

LGTM

erikjohnston added a commit that referenced this pull request Nov 5, 2015
Retry dead servers a lot less often
@erikjohnston erikjohnston merged commit 5bc6904 into develop Nov 5, 2015
@erikjohnston erikjohnston deleted the erikj/server_retries branch November 19, 2015 16:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants