Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Make sure synapse_rate_limit_reject_affected_hosts does what it says it does #13670

Closed
MadLittleMods opened this issue Aug 30, 2022 · 5 comments
Labels
A-Federation A-Metrics metrics, measures, stuff we put in Prometheus O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@MadLittleMods
Copy link
Contributor

Spawning from #13541 (comment)


The synapse_rate_limit_reject_affected_hosts gauge is always evaluating to 0. The raw data in Prometheus also shows 0 for reference.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1661855196368&to=1661876796368&viewPanel=220

synapse_rate_limit_reject_affected_hosts all 0 in the graph

Even though we see individual requests being rejected (synapse_rate_limit_reject_total) which should mean at least 1 host,

synapse_rate_limit_reject_total with some activity in the graph

But this could be a mismatch in how the guages were being reported because we were accidentally registering them twice, #13641


Now that we fixed the duplicate metric registering issue in #13649 and the fix was put on matrix.org this morning, we're seeing both at 0 now. This could mean that the previous rejections we were seeing were all from the UsernameAvailabilityRestServlet which we are no longer tracking. And we're not rejecting any requests in the federation servlets.

It is a bit suspicious though.

synapse_rate_limit_reject_affected_hosts and synapse_rate_limit_reject_total being 0 in the graph after matrix.org deploy

How can we know if it's right?

In order to confirm that synapse_rate_limit_reject_affected_hosts is working, it would be nice to see a non-zero value.

The reject_limit is 50 which I think means there has to be more than 50 requests within the 1 second federation_rc_window_size to start rejecting.

We do see the rate of slept requests go above 70 sometimes which I would expect to trigger this 🤔

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1661873701156&to=1661877301156&viewPanel=223

synapse_rate_limit_sleep_total showing spikes above the 50 reject_limit

Dev notes

The synapse_rate_limit_reject_affected_hosts metric was originally added in #13541 and updated in #13649

@MadLittleMods MadLittleMods added A-Federation A-Metrics metrics, measures, stuff we put in Prometheus labels Aug 30, 2022
@DMRobertson
Copy link
Contributor

Erm, isn't that sum is a sum across all workers?

I'm anxious that two different workers might separately decide to rate-limit example.com. If so, that'd mean that example.com is double-counted as two hosts in these graphs (I think?)

@DMRobertson DMRobertson added S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Aug 30, 2022
@MadLittleMods
Copy link
Contributor Author

@DMRobertson Yes but it's the best we can do to count hosts AFAIK

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Sep 6, 2022

Since the deploy, we have at least seen a blip of rejected requests on matrix.org but it looks like it wasn't sustained enough to be caught be the 15 second /collect interval that the rejected host gauge runs on.

Same blip (https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1662486044397&to=1662486894994&viewPanel=206):

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Oct 20, 2022

We finally saw something to show up on the "Number of hosts being rejected by the rate limiter" graph (the second graph the image). But it doesn't correspond to any spikes in the number of rate limit rejected requests 🤔

https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1664997566912&to=1666293566912&viewPanel=237
Blips of 1 on the "Number of hosts being rejected by the rate limiter" graph

@MadLittleMods
Copy link
Contributor Author

MadLittleMods commented Dec 12, 2022

Seeing this more active on other hosts (libera.chat):

Going to close as it seems to count something.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Metrics metrics, measures, stuff we put in Prometheus O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

2 participants