-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix rate limit metrics registering twice and misreporting #13649
Fix rate limit metrics registering twice and misreporting #13649
Conversation
LaterGauge( | ||
"synapse_rate_limit_sleep_affected_hosts", | ||
"Number of hosts that had requests put to sleep", | ||
["rate_limiter_name"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the label rate_limiter_name
instead of name
because name
seemed pretty ambiguous in the context of synapse_rate_limit_sleep_affected_hosts
. Trying to clarify the question "name of what?"
_rate_limiter_instances: Set["FederationRateLimiter"] = set() | ||
# Protects the _rate_limiter_instances set from concurrent access | ||
_rate_limiter_instances_lock = threading.Lock() | ||
|
||
|
||
def _get_counts_from_rate_limiter_instance( | ||
count_func: Callable[["FederationRateLimiter"], int] | ||
) -> Mapping[Tuple[str, ...], int]: | ||
"""Returns a count of something (slept/rejected hosts) by (metrics_name)""" | ||
# Cast to a list to prevent it changing while the Prometheus | ||
# thread is collecting metrics | ||
with _rate_limiter_instances_lock: | ||
rate_limiter_instances = list(_rate_limiter_instances) | ||
|
||
# Map from (metrics_name,) -> int, the number of something like slept hosts | ||
# or rejected hosts. The key type is Tuple[str], but we leave the length | ||
# unspecified for compatability with LaterGauge's annotations. | ||
counts: Dict[Tuple[str, ...], int] = {} | ||
for rate_limiter_instance in rate_limiter_instances: | ||
# Only track metrics if they provided a `metrics_name` to | ||
# differentiate this instance of the rate limiter. | ||
if rate_limiter_instance.metrics_name: | ||
key = (rate_limiter_instance.metrics_name,) | ||
counts[key] = count_func(rate_limiter_instance) | ||
|
||
return counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same pattern as
synapse/synapse/http/request_metrics.py
Lines 110 to 142 in c4e29b6
_in_flight_requests: Set["RequestMetrics"] = set() | |
# Protects the _in_flight_requests set from concurrent access | |
_in_flight_requests_lock = threading.Lock() | |
def _get_in_flight_counts() -> Mapping[Tuple[str, ...], int]: | |
"""Returns a count of all in flight requests by (method, server_name)""" | |
# Cast to a list to prevent it changing while the Prometheus | |
# thread is collecting metrics | |
with _in_flight_requests_lock: | |
reqs = list(_in_flight_requests) | |
for rm in reqs: | |
rm.update_metrics() | |
# Map from (method, name) -> int, the number of in flight requests of that | |
# type. The key type is Tuple[str, str], but we leave the length unspecified | |
# for compatability with LaterGauge's annotations. | |
counts: Dict[Tuple[str, ...], int] = {} | |
for rm in reqs: | |
key = (rm.method, rm.name) | |
counts[key] = counts.get(key, 0) + 1 | |
return counts | |
LaterGauge( | |
"synapse_http_server_in_flight_requests_count", | |
"", | |
["method", "servlet"], | |
_get_in_flight_counts, | |
) |
maybe_metrics_cm: ContextManager = contextlib.nullcontext() | ||
if self.metrics_name: | ||
maybe_metrics_cm = queue_wait_timer.labels(self.metrics_name).time() | ||
with start_active_span("ratelimit wait"), maybe_metrics_cm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly worried about context
issues here again just because it's going a bit off the beaten path (please double-check) 😰
I've made metrics_name
optional because I'm not concerned about tracking the UsernameAvailabilityRestServlet
which is used to rate limit people by IP. I only care about tracking the federation servlets which FederationRateLimiter
was made for.
We could make metrics_name
required and record metrics for the username availability servlet to get rid of this complexity though.
d8ea855
to
7921b1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this @richvdh @MadLittleMods?
@DMRobertson: sgtm. Shall we just merge that and get an RC out? |
Let's do it.
|
Given that |
I sanity checked this by grepping for "already registered" in logs. No matches from today. |
Thanks for the review @richvdh and @DMRobertson for making the changes to make this mergeable and confirming the duplicate registering message wasn't in the logs 🐸 |
Synapse 1.66.0 (2022-08-31) =========================== No significant changes since 1.66.0rc2. This release removes the ability for homeservers to delegate email ownership verification and password reset confirmation to identity servers. This removal was originally planned for Synapse 1.64, but was later deferred until now. See the [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details. Deployments with multiple workers should note that the direct TCP replication configuration was deprecated in Synapse v1.18.0 and will be removed in Synapse v1.67.0. In particular, the TCP `replication` [listener](https://matrix-org.github.io/synapse/v1.66/usage/configuration/config_documentation.html#listeners) type (not to be confused with the `replication` resource on the `http` listener type) and the `worker_replication_port` config option will be removed . To migrate to Redis, add the [`redis` config](https://matrix-org.github.io/synapse/v1.66/workers.html#shared-configuration), then remove the TCP `replication` listener from config of the master and `worker_replication_port` from worker config. Note that a HTTP listener with a `replication` resource is still required. See the [worker documentation](https://matrix-org.github.io/synapse/v1.66/workers.html) for more details. Synapse 1.66.0rc2 (2022-08-30) ============================== Bugfixes -------- - Fix a bug introduced in Synapse 1.66.0rc1 where the new rate limit metrics were misreported (`synapse_rate_limit_sleep_affected_hosts`, `synapse_rate_limit_reject_affected_hosts`). ([\matrix-org#13649](matrix-org#13649)) Synapse 1.66.0rc1 (2022-08-23) ============================== Features -------- - Improve validation of request bodies for the following client-server API endpoints: [`/account/password`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpassword), [`/account/password/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpasswordemailrequesttoken), [`/account/deactivate`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountdeactivate) and [`/account/3pid/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidemailrequesttoken). ([\matrix-org#13188](matrix-org#13188), [\matrix-org#13563](matrix-org#13563)) - Add forgotten status to [Room Details Admin API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#room-details-api). ([\matrix-org#13503](matrix-org#13503)) - Add an experimental implementation for [MSC3852 (Expose user agents on `Device`)](matrix-org/matrix-spec-proposals#3852). ([\matrix-org#13549](matrix-org#13549)) - Add `org.matrix.msc2716v4` experimental room version with updated content fields. Part of [MSC2716 (Importing history)](matrix-org/matrix-spec-proposals#2716). ([\matrix-org#13551](matrix-org#13551)) - Add support for compression to federation responses. ([\matrix-org#13537](matrix-org#13537)) - Improve performance of sending messages in rooms with thousands of local users. ([\matrix-org#13522](matrix-org#13522), [\matrix-org#13547](matrix-org#13547)) Bugfixes -------- - Faster room joins: make `/joined_members` block whilst the room is partial stated. ([\matrix-org#13514](matrix-org#13514)) - Fix a bug introduced in Synapse 1.21.0 where the [`/event_reports` Admin API](https://matrix-org.github.io/synapse/develop/admin_api/event_reports.html) could return a total count which was larger than the number of results you can actually query for. ([\matrix-org#13525](matrix-org#13525)) - Fix a bug introduced in Synapse 1.52.0 where sending server notices fails if `max_avatar_size` or `allowed_avatar_mimetypes` is set and not `system_mxid_avatar_url`. ([\matrix-org#13566](matrix-org#13566)) - Fix a bug where the `opentracing.force_tracing_for_users` config option would not apply to [`/sendToDevice`](https://spec.matrix.org/v1.3/client-server-api/#put_matrixclientv3sendtodeviceeventtypetxnid) and [`/keys/upload`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3keysupload) requests. ([\matrix-org#13574](matrix-org#13574)) Improved Documentation ---------------------- - Add `openssl` example for generating registration HMAC digest. ([\matrix-org#13472](matrix-org#13472)) - Tidy up Synapse's README. ([\matrix-org#13491](matrix-org#13491)) - Document that event purging related to the `redaction_retention_period` config option is executed only every 5 minutes. ([\matrix-org#13492](matrix-org#13492)) - Add a warning to retention documentation regarding the possibility of database corruption. ([\matrix-org#13497](matrix-org#13497)) - Document that the `DOCKER_BUILDKIT=1` flag is needed to build the docker image. ([\matrix-org#13515](matrix-org#13515)) - Add missing links in `user_consent` section of configuration manual. ([\matrix-org#13536](matrix-org#13536)) - Fix the doc and some warnings that were referring to the nonexistent `custom_templates_directory` setting (instead of `custom_template_directory`). ([\matrix-org#13538](matrix-org#13538)) Deprecations and Removals ------------------------- - Remove the ability for homeservers to delegate email ownership verification and password reset confirmation to identity servers. See [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details. Internal Changes ---------------- - Update the rejected state of events during de-partial-stating. ([\matrix-org#13459](matrix-org#13459)) - Avoid blocking lazy-loading `/sync`s during partial joins due to remote memberships. Pull remote memberships from auth events instead of the room state. ([\matrix-org#13477](matrix-org#13477)) - Refuse to start when faster joins is enabled on a deployment with workers, since worker configurations are not currently supported. ([\matrix-org#13531](matrix-org#13531)) - Allow use of both `@trace` and `@tag_args` stacked on the same function. ([\matrix-org#13453](matrix-org#13453)) - Instrument the federation/backfill part of `/messages` for understandable traces in Jaeger. ([\matrix-org#13489](matrix-org#13489)) - Instrument `FederationStateIdsServlet` (`/state_ids`) for understandable traces in Jaeger. ([\matrix-org#13499](matrix-org#13499), [\matrix-org#13554](matrix-org#13554)) - Track HTTP response times over 10 seconds from `/messages` (`synapse_room_message_list_rest_servlet_response_time_seconds`). ([\matrix-org#13533](matrix-org#13533)) - Add metrics to track how the rate limiter is affecting requests (sleep/reject). ([\matrix-org#13534](matrix-org#13534), [\matrix-org#13541](matrix-org#13541)) - Add metrics to time how long it takes us to do backfill processing (`synapse_federation_backfill_processing_before_time_seconds`, `synapse_federation_backfill_processing_after_time_seconds`). ([\matrix-org#13535](matrix-org#13535), [\matrix-org#13584](matrix-org#13584)) - Add metrics to track rate limiter queue timing (`synapse_rate_limit_queue_wait_time_seconds`). ([\matrix-org#13544](matrix-org#13544)) - Update metrics to track `/messages` response time by room size. ([\matrix-org#13545](matrix-org#13545)) - Refactor methods in `synapse.api.auth.Auth` to use `Requester` objects everywhere instead of user IDs. ([\matrix-org#13024](matrix-org#13024)) - Clean-up tests for notifications. ([\matrix-org#13471](matrix-org#13471)) - Add some miscellaneous comments to document sync, especially around `compute_state_delta`. ([\matrix-org#13474](matrix-org#13474)) - Use literals in place of `HTTPStatus` constants in tests. ([\matrix-org#13479](matrix-org#13479), [\matrix-org#13488](matrix-org#13488)) - Add comments about how event push actions are rotated. ([\matrix-org#13485](matrix-org#13485)) - Modify HTML template content to better support mobile devices' screen sizes. ([\matrix-org#13493](matrix-org#13493)) - Add a linter script which will reject non-strict types in Pydantic models. ([\matrix-org#13502](matrix-org#13502)) - Reduce the number of tests using legacy TCP replication. ([\matrix-org#13543](matrix-org#13543)) - Allow specifying additional request fields when using the `HomeServerTestCase.login` helper method. ([\matrix-org#13549](matrix-org#13549)) - Make `HomeServerTestCase` load any configured homeserver modules automatically. ([\matrix-org#13558](matrix-org#13558)) # -----BEGIN PGP SIGNATURE----- # # iQGzBAABCgAdFiEEWMTnW8Z8khaaf90R+84KzgcyGG8FAmMPT8QACgkQ+84Kzgcy # GG9CUAv+Pv/iDpE2jKlV7zQ/cagaKCGsFK5jy0+K9Wr215nP89tuhU37bJXsgvVu # GP3A8k1c/ENPhXwYHLCnnxV3jick1FuVE0W6h0j2PMYeIGNCQhDswytnsQO4JExg # fGLL4ygCzpe8bFX9+mhIM4z8xkZjZX3lIa8CN2LtRLIo0m7qoT1ZWqdt7kAjj5yL # XMk+3Y1yq/Y4SHHqgKurBNdwNcwnv7ynchWxTYa12WVTINt26dLV0Syk3p8u2SLl # 5YNzcDs2TAM7+VxAu7E0AQl426+Ufi122Oj1ZBUG2FxTPLH8Xr18cN2M/at6WxoX # 8pOkGiuahKKvahw1iCoHAGIC66gFIPxBE9xW4R2SKrQtG4sDuKJI0kvunRV8+cy5 # TuJ9cmdDmJR2vj3P3OULqLXGkWsGNJqfZZF8OWkHEI8LUIXZLrAZocFtlonkr9rV # Y8r8LxL8Id1rbHAnCXcJnYdaJ6ol0RIObDFpitY/D8BDUONVw/byeOyAEkq/XPrZ # Ke/9K8sy # =eg1L # -----END PGP SIGNATURE----- # gpg: Signature made Wed Aug 31 13:10:44 2022 BST # gpg: using RSA key 58C4E75BC67C92169A7FDD11FBCE0ACE0732186F # gpg: Can't check signature: No public key # Conflicts: # synapse/api/auth.py # synapse/push/baserules.py # synapse/push/bulk_push_rule_evaluator.py # synapse/push/push_rule_evaluator.py # synapse/storage/databases/main/event_push_actions.py # tests/server_notices/test_resource_limits_server_notices.py
Fix rate limit gauge metrics registering twice and misreporting by registering once and adding a label to differentiate the instances (although we only track the federation servlets at the moment). The gauge metrics were originally added in #13541
FederationRateLimiter
is being instantiated twice in Synapse:UsernameAvailabilityRestServlet
which uses it to rate limit people by IPFederationRateLimiter
was made forFix #13641
Dev notes
https://github.com/matrix-org/synapse/blob/c4e29b6908ac8ae57b5e9a3e7662ad638b61e94a/docs/metrics-howto.md
prometheus.yml
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.Pull request includes a sign off(run the linters)