Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Allow configuring maximum interval between AS transaction retries #12685

Closed
wants to merge 11 commits into from
Closed

Allow configuring maximum interval between AS transaction retries #12685

wants to merge 11 commits into from

Conversation

tadzik
Copy link
Contributor

@tadzik tadzik commented May 10, 2022

Reducing this backoff period comes in handy in situations when the service (synapse and appservice) spawn order is hard to control (e.g. in k8s environments, where Synapse and Appservices would be (re)started concurrently, often with wildly different startup times) and in integration tests, where starting up and tearing down of the HS and appservices happens a lot, and every additional second of backoff contributes significant amounts of wasted time.

Essentially, I'm looking for a "low-latency" mode, where currently it's very common to see bridges being initially unresponsive due to AS backoff, or integration tests taking forever because Homerunner has to come up before the AS (to know what its URL actually is once it's passed to AS itself), and yet it gets quickly "bored" of trying to talk to the AS itself, resulting in tests taking forever.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

Signed-off-by: Tadeusz Sośnierz <tadeusz@sosnierz.com>
@tadzik tadzik requested a review from a team as a code owner May 10, 2022 10:54
tadzik added 2 commits May 10, 2022 13:06
Signed-off-by: Tadeusz Sośnierz <tadeusz@sosnierz.com>
Signed-off-by: Tadeusz Sośnierz <tadeusz@sosnierz.com>
@clokep
Copy link
Member

clokep commented May 10, 2022

@tadzik Looks like some of the unit tests are failing here!

@clokep clokep removed the request for review from a team May 10, 2022 11:47
@clokep clokep requested a review from a team May 10, 2022 14:01
@DMRobertson DMRobertson self-assigned this May 11, 2022
Copy link
Contributor

@DMRobertson DMRobertson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, thanks. Could you add an entry to the config manual for this option, please? The source is https://github.com/matrix-org/synapse/blob/develop/docs/usage/configuration/config_documentation.md

synapse/appservice/scheduler.py Outdated Show resolved Hide resolved
Comment on lines 51 to 56

# Set to establish maximum backoff (in seconds) between HS -> AS connection attempts.
# Upon failing to push appservice events the homeserver will wait
# an increasing amount of seconds between retries. This sets an upper limit of that (in seconds)
#
#appservice_max_backoff: 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explicitly say what the default value/behaviour is, if this config key is not provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now in 727bbf8

synapse/config/appservice.py Outdated Show resolved Hide resolved
tadzik and others added 4 commits May 13, 2022 11:54
Co-authored-by: David Robertson <david.m.robertson1@gmail.com>
Co-authored-by: David Robertson <david.m.robertson1@gmail.com>
@tadzik
Copy link
Contributor Author

tadzik commented May 13, 2022

Broadly looks good, thanks. Could you add an entry to the config manual for this option, please? The source is https://github.com/matrix-org/synapse/blob/develop/docs/usage/configuration/config_documentation.md

This seems to be duplicating information from the sample config – for appservice configs it seems to be copied from it almost verbatim. If you didn't tell me otherwise, I'd assume it was autogenerated this way :)

Is there any plan/effort to make it generated, or does it serve some broader purpose? Is it expected/acceptable to just copypaste the description from config sample?

@DMRobertson
Copy link
Contributor

Broadly looks good, thanks. Could you add an entry to the config manual for this option, please? The source is https://github.com/matrix-org/synapse/blob/develop/docs/usage/configuration/config_documentation.md

This seems to be duplicating information from the sample config – for appservice configs it seems to be copied from it almost verbatim. If you didn't tell me otherwise, I'd assume it was autogenerated this way :)

Is there any plan/effort to make it generated, or does it serve some broader purpose? Is it expected/acceptable to just copypaste the description from config sample?

I had wondered this too. I don't know what the idea is there; perhaps @H-Shay or @anoadragon453 have a plan? See also #12368.

@DMRobertson DMRobertson removed their assignment May 13, 2022
@anoadragon453
Copy link
Member

I had wondered this too. I don't know what the idea is there; perhaps @H-Shay or @anoadragon453 have a plan? See also #12368

We're in a bit of an awkward in-between state now. The referenced PR mentions then pulling out the documentation from the config file, so that we're left with a minimal configuration file, where people can add more options at their leisure.

#8159 is the original issue for this, and we should take further discussion of it there.

@H-Shay
Copy link
Contributor

H-Shay commented May 13, 2022

Is there any plan/effort to make it generated, or does it serve some broader purpose? Is it expected/acceptable to just copypaste the description from config sample?

The broader purpose is for the manual to replace the comments in the config, so that the config becomes a more manageable file and the config documentation is indexed, searchable, and easier to read and link to. Right now it's totally fine to copy/paste your description from the config sample into the config manual as long as you follow the formatting in the manual. If all goes to plan in the future there will only be the config manual to update, no comments in the config so hopefully that will be a little clearer!

Comment on lines 1597 to 1601
# Set to establish maximum backoff (in seconds) between HS -> AS connection attempts.
# Upon failing to push appservice events the homeserver will wait
# an increasing amount of seconds between retries. This sets an upper limit of that (in seconds)
#
#appservice_max_backoff: 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to say what the default behaviour is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that pretty much everything in the config file which is a duration is in milliseconds, so having a setting in seconds is confusing. Consider making it take a suffix like "s" for seconds, for consistency with all the other time period settings.

@richvdh
Copy link
Member

richvdh commented May 17, 2022

please could this PR be updated with a bit more information about why this is a useful thing to have? Generally we should strive to have solutions that work for everyone, rather than requiring users to configure everything, so I'd like to understand why that isn't possible.

question: should this go in the per-appservice config file rather than the main one? (a) the main config file already has quite enough options, thank you; (b) depending on why this is needed, it might be better to have different settings for different ASes?

@tadzik
Copy link
Contributor Author

tadzik commented May 20, 2022

Is there any plan/effort to make it generated, or does it serve some broader purpose? Is it expected/acceptable to just copypaste the description from config sample?

The broader purpose is for the manual to replace the comments in the config, so that the config becomes a more manageable file and the config documentation is indexed, searchable, and easier to read and link to. Right now it's totally fine to copy/paste your description from the config sample into the config manual as long as you follow the formatting in the manual. If all goes to plan in the future there will only be the config manual to update, no comments in the config so hopefully that will be a little clearer!

Alright – done now in 08f7d32

@tadzik tadzik requested review from richvdh and DMRobertson May 20, 2022 08:17
@richvdh richvdh requested review from a team and removed request for richvdh May 23, 2022 11:01
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I think this makes sense

#
# Regardless of this setting, the delay will never be longer than 512 seconds (about 8.5 minutes).
#
#appservice_max_backoff_s: 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this take a duration please? I.e. appservice_max_backoff: 60s instead. When reading the config you can use self.parse_duration(...).

Comment on lines +2295 to +2307
Config option: `appservice_max_backoff_s`

Set to establish a maximum backoff (in seconds) between HS -> AS connection attempts.
Upon failing to push appservice events, the homeserver will reattempt connection to the
application service after a delay. The delay increases with subsequent retries.
This value sets an upper limit on that delay.

Regardless of this setting, the delay will never be longer than 512 seconds (about 8.5 minutes),
which is the default behaviour if this option is not set.

Example configuration:
```yaml
appservice_max_backoff_s: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstood my earlier comment. The name of the setting should not have an s suffix, rather the value of the setting should be able to take a suffix (such as s for seconds, m for minutes), and if no suffix is present, it should be in milliseconds. That will make it consistent with other durations in the config file.

see for example redaction_retention_period.

@@ -2292,6 +2292,21 @@ Example configuration:
track_appservice_user_ips: true
```
---
Config option: `appservice_max_backoff_s`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I saw an answer on this before, so I'll ask again:

should this go in the per-appservice config file rather than the main one? (a) the main config file already has quite enough options; (b) depending on why this is needed, it might be better to have different settings for different ASes?

@DMRobertson DMRobertson added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Jun 6, 2022
@DMRobertson DMRobertson removed their request for review July 12, 2022 15:10
@DMRobertson
Copy link
Contributor

What's the status of this? Is there still merit in having this option?

AFAICS the outstanding points are:

(Is there a need for an option at all? Can we do something else which e.g. resets the backoff timer? E.g. a heartbeat/ping-pong mechanism between HS and AS for each to be convinced that the other is alive?)

@tadzik
Copy link
Contributor Author

tadzik commented Jul 22, 2022

What's the status of this? Is there still merit in having this option?

I think the merit is still there, though in ideal world I'd expect this to Just Work Reliably, without needing to configure more aggressive polling.

Can we do something else which e.g. resets the backoff timer?

Perhaps that's what we're missing. Most bridges I'm aware of do some variant of pinging Synapse on startup (usually checking if its user is registered), which should be a clear enough signal for Synapse that the AS is alive and worth talking to.

I'm happy to close this in favour of some future AS-request-resets-backoff-timer PR.

@tadzik
Copy link
Contributor Author

tadzik commented Oct 20, 2022

Closing as per my last comment.

@tadzik tadzik closed this Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants