Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix incorrect juggling of logging contexts in _PerHostRatelimiter #13554

Merged
merged 9 commits into from
Aug 18, 2022

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Aug 18, 2022

Introduced in #13499.
Likely to be the cause of most of the errors seen in #13552.

Signed-off-by: Sean Quah seanq@matrix.org

@squahtx squahtx requested a review from a team as a code owner August 18, 2022 12:44
Introduced in #13499.

Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx force-pushed the squah/fix__perhostratelimiter_spans branch from 104fdf5 to a30c01b Compare August 18, 2022 12:48
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm either way

synapse/util/ratelimitutils.py Outdated Show resolved Hide resolved
Comment on lines 238 to 244
async def _trace_wait(self, d: "defer.Deferred[None]") -> None:
"""Waits for a given `Deferred` to resolve, while timing it for metrics.

Args:
d: The `Deferred` to wait for. Must not follow the Synapse logging context
rules.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finding this slightly hard to wrap my head around. Rather than passing in a deferred, I would pass in the request id and have it call _on_enter. And do some renaming of things.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

synapse/util/ratelimitutils.py Show resolved Hide resolved
@squahtx squahtx merged commit b251cff into develop Aug 18, 2022
@squahtx squahtx deleted the squah/fix__perhostratelimiter_spans branch August 18, 2022 15:26
@@ -154,7 +154,7 @@ def ratelimit(self, host: str) -> "Iterator[defer.Deferred[None]]":
self.host = host

request_id = object()
ret = self._on_enter(request_id)
ret = defer.ensureDeferred(self._on_enter_with_tracing(request_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there discussion somewhere about how this fixes the log context? Trying to learn and understand.

Copy link
Member

@richvdh richvdh Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods there was some discussion in Backend Internal (https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$xqpsbzDjJpmqAt9xTBtpS6qAAWXrBv3N_HkjwwK2QoM?via=jki.re&via=matrix.org&via=element.io), but I don't think there's much detailed discussion.

As @squahtx said:

iirc make_deferred_yieldable stashes away the current logcontext and adds its own callback to the deferred to restore it
and each opentracing spans "wraps" its own logcontext (more or less)

In other words: make_deferred_yieldable adds a callback to the deferred, and that callback would restore the logcontext created by start_active_span once the ratelimit wait completed.

It might also be instructive to look at @squahtx's initial fix for this, which effectively just moved the call to make_deferred_yieldable to before the call to start_active_span. That was an entirely valid fix, we just decided that doing it this way felt more natural and more obviously correct.

The lesson I take here (other than that logcontexts are confusing, which we knew) is that __enter__ and __exit__ are hazardous materials and it's best to avoid messing with them directly where possible.

Copy link
Contributor Author

@squahtx squahtx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a subtlety in the way spans, logging contexts and make_deferred_yieldable work.

Spans are implemented on top of logging contexts. That is, every span creates its own nested logging context (minus an optimization).

# parent span's logging context active
span = start_active_span(...)
span.__enter__(...)
# child span's logging context is now active

ret_defer.addCallbacks(on_start, on_err)
ret_defer.addBoth(on_both)  # <-- `span.__exit__` is called inside `on_both`
return make_deferred_yieldable(ret_defer)

Just before make_deferred_yieldable is called, the active logging context is the one associated with span.

make_deferred_yieldable is a little big magical. It clears the active logging context and attaches a callback to ret_defer to restore the logging context after the deferred has resolved. With the previous code, the order of callbacks on ret_defer was:

  1. [no logging context yet] (on_start, on_err)
  2. [no logging context yet] on_both, calling span.__exit__ and restoring the parent span's logging context.
  3. [parent span's logging context] make_deferred_yieldable's callback to restore the child span's logging context.
  4. [child span's logging context]

But we need 2 and 3 to be the other way around. The child logging context needs to be restored, then __exit__ed, which makes it restore the parent logging context.

Copy link
Contributor

@MadLittleMods MadLittleMods Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is subtle!

It might also be instructive to look at @squahtx's initial fix for this, which effectively just moved the call to make_deferred_yieldable to before the call to start_active_span. That was an entirely valid fix, we just decided that doing it this way felt more natural and more obviously correct.

@squahtx's initial fix is useful to look at 👍 But I think the initial fix does something a little different than this description. It moves the span.__exit__() after the make_deferred_yieldable so that make_deferred_yieldable's callback restores span's logging context. And then span.__exit__() restores the parent context. From my understanding, this also aligns with what @squahtx mentions is the subtlety.

But moving the call to make_deferred_yieldable to before the call to start_active_span sounds like it would work as well. That way the span.__exit__() would restore the nested context and then make_deferred_yieldable's callback would restore the parent logging context.

In any case, we want the parent logging context restored when we get out of everything (fold and unfold in the correct order).


Now in the final form of the fix, the span is entered via the with context manager so it has the outermost parent context, then make_deferred_yieldable is called which is in the span context. And because we are using the proper with nesting context manager stuff it also properly unfolds in the correct order: make_deferred_yieldable's callback restores the span context, and the span context manager exit restores the parent context.

🎉

Thanks for explaining @richvdh and @squahtx ❤️


The key pitfall in my change is not knowing the magic in make_deferred_yieldable:

make_deferred_yieldable is a little big magical. It clears the active logging context and attaches a callback to ret_defer to restore the logging context after the deferred has resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry, my description of @squahtx's fix was incorrect.

But moving the call to make_deferred_yieldable to before the call to start_active_span sounds like it would work as well.

Actually it seems it wouldn't. We tried something equivalent in this thread and as @squahtx points out there are other problems.

@MadLittleMods MadLittleMods added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Metrics metrics, measures, stuff we put in Prometheus A-Logging Synapse's logs (structured or otherwise). Not metrics. labels Aug 18, 2022
DMRobertson pushed a commit that referenced this pull request Aug 23, 2022
Synapse 1.66.0rc1 (2022-08-23)
==============================

This release removes the ability for homeservers to delegate email ownership
verification and password reset confirmation to identity servers. This removal
was originally planned for Synapse 1.64, but was later deferred until now.

See the [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details.

Features
--------

- Improve validation of request bodies for the following client-server API endpoints: [`/account/password`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpassword), [`/account/password/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpasswordemailrequesttoken), [`/account/deactivate`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountdeactivate) and [`/account/3pid/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidemailrequesttoken). ([\#13188](#13188), [\#13563](#13563))
- Add forgotten status to [Room Details Admin API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#room-details-api). ([\#13503](#13503))
- Add an experimental implementation for [MSC3852 (Expose user agents on `Device`)](matrix-org/matrix-spec-proposals#3852). ([\#13549](#13549))
- Add `org.matrix.msc2716v4` experimental room version with updated content fields. Part of [MSC2716 (Importing history)](matrix-org/matrix-spec-proposals#2716).  ([\#13551](#13551))
- Add support for compression to federation responses. ([\#13537](#13537))
- Improve performance of sending messages in rooms with thousands of local users. ([\#13522](#13522), [\#13547](#13547))

Bugfixes
--------

- Faster room joins: make `/joined_members` block whilst the room is partial stated. ([\#13514](#13514))
- Fix a bug introduced in Synapse 1.21.0 where the [`/event_reports` Admin API](https://matrix-org.github.io/synapse/develop/admin_api/event_reports.html) could return a total count which was larger than the number of results you can actually query for. ([\#13525](#13525))
- Fix a bug introduced in Synapse 1.52.0 where sending server notices fails if `max_avatar_size` or `allowed_avatar_mimetypes` is set and not `system_mxid_avatar_url`. ([\#13566](#13566))
- Fix a bug where the `opentracing.force_tracing_for_users` config option would not apply to [`/sendToDevice`](https://spec.matrix.org/v1.3/client-server-api/#put_matrixclientv3sendtodeviceeventtypetxnid) and [`/keys/upload`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3keysupload) requests. ([\#13574](#13574))

Improved Documentation
----------------------

- Add `openssl` example for generating registration HMAC digest. ([\#13472](#13472))
- Tidy up Synapse's README. ([\#13491](#13491))
- Document that event purging related to the `redaction_retention_period` config option is executed only every 5 minutes. ([\#13492](#13492))
- Add a warning to retention documentation regarding the possibility of database corruption. ([\#13497](#13497))
- Document that the `DOCKER_BUILDKIT=1` flag is needed to build the docker image. ([\#13515](#13515))
- Add missing links in `user_consent` section of configuration manual. ([\#13536](#13536))
- Fix the doc and some warnings that were referring to the nonexistent `custom_templates_directory` setting (instead of `custom_template_directory`). ([\#13538](#13538))

Deprecations and Removals
-------------------------

- Remove the ability for homeservers to delegate email ownership verification
  and password reset confirmation to identity servers. See [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details.

Internal Changes
----------------

- Update the rejected state of events during de-partial-stating. ([\#13459](#13459))
- Avoid blocking lazy-loading `/sync`s during partial joins due to remote memberships. Pull remote memberships from auth events instead of the room state. ([\#13477](#13477))
- Refuse to start when faster joins is enabled on a deployment with workers, since worker configurations are not currently supported. ([\#13531](#13531))

- Allow use of both `@trace` and `@tag_args` stacked on the same function. ([\#13453](#13453))
- Instrument the federation/backfill part of `/messages` for understandable traces in Jaeger. ([\#13489](#13489))
- Instrument `FederationStateIdsServlet` (`/state_ids`) for understandable traces in Jaeger. ([\#13499](#13499), [\#13554](#13554))
- Track HTTP response times over 10 seconds from `/messages` (`synapse_room_message_list_rest_servlet_response_time_seconds`). ([\#13533](#13533))
- Add metrics to track how the rate limiter is affecting requests (sleep/reject). ([\#13534](#13534), [\#13541](#13541))
- Add metrics to time how long it takes us to do backfill processing (`synapse_federation_backfill_processing_before_time_seconds`, `synapse_federation_backfill_processing_after_time_seconds`). ([\#13535](#13535), [\#13584](#13584))
- Add metrics to track rate limiter queue timing (`synapse_rate_limit_queue_wait_time_seconds`). ([\#13544](#13544))
- Update metrics to track `/messages` response time by room size. ([\#13545](#13545))

- Refactor methods in `synapse.api.auth.Auth` to use `Requester` objects everywhere instead of user IDs. ([\#13024](#13024))
- Clean-up tests for notifications. ([\#13471](#13471))
- Add some miscellaneous comments to document sync, especially around `compute_state_delta`. ([\#13474](#13474))
- Use literals in place of `HTTPStatus` constants in tests. ([\#13479](#13479), [\#13488](#13488))
- Add comments about how event push actions are rotated. ([\#13485](#13485))
- Modify HTML template content to better support mobile devices' screen sizes. ([\#13493](#13493))
- Add a linter script which will reject non-strict types in Pydantic models. ([\#13502](#13502))
- Reduce the number of tests using legacy TCP replication. ([\#13543](#13543))
- Allow specifying additional request fields when using the `HomeServerTestCase.login` helper method. ([\#13549](#13549))
- Make `HomeServerTestCase` load any configured homeserver modules automatically. ([\#13558](#13558))
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Sep 1, 2022
Synapse 1.66.0 (2022-08-31)
===========================

No significant changes since 1.66.0rc2.

This release removes the ability for homeservers to delegate email ownership
verification and password reset confirmation to identity servers. This removal
was originally planned for Synapse 1.64, but was later deferred until now. See
the [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details.

Deployments with multiple workers should note that the direct TCP replication
configuration was deprecated in Synapse v1.18.0 and will be removed in Synapse
v1.67.0. In particular, the TCP `replication` [listener](https://matrix-org.github.io/synapse/v1.66/usage/configuration/config_documentation.html#listeners)
type (not to be confused with the `replication` resource on the `http` listener
type) and the `worker_replication_port` config option will be removed .

To migrate to Redis, add the [`redis` config](https://matrix-org.github.io/synapse/v1.66/workers.html#shared-configuration),
then remove the TCP `replication` listener from config of the master and
`worker_replication_port` from worker config. Note that a HTTP listener with a
`replication` resource is still required. See the
[worker documentation](https://matrix-org.github.io/synapse/v1.66/workers.html)
for more details.

Synapse 1.66.0rc2 (2022-08-30)
==============================

Bugfixes
--------

- Fix a bug introduced in Synapse 1.66.0rc1 where the new rate limit metrics were misreported (`synapse_rate_limit_sleep_affected_hosts`, `synapse_rate_limit_reject_affected_hosts`). ([\matrix-org#13649](matrix-org#13649))

Synapse 1.66.0rc1 (2022-08-23)
==============================

Features
--------

- Improve validation of request bodies for the following client-server API endpoints: [`/account/password`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpassword), [`/account/password/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountpasswordemailrequesttoken), [`/account/deactivate`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3accountdeactivate) and [`/account/3pid/email/requestToken`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3account3pidemailrequesttoken). ([\matrix-org#13188](matrix-org#13188), [\matrix-org#13563](matrix-org#13563))
- Add forgotten status to [Room Details Admin API](https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#room-details-api). ([\matrix-org#13503](matrix-org#13503))
- Add an experimental implementation for [MSC3852 (Expose user agents on `Device`)](matrix-org/matrix-spec-proposals#3852). ([\matrix-org#13549](matrix-org#13549))
- Add `org.matrix.msc2716v4` experimental room version with updated content fields. Part of [MSC2716 (Importing history)](matrix-org/matrix-spec-proposals#2716).  ([\matrix-org#13551](matrix-org#13551))
- Add support for compression to federation responses. ([\matrix-org#13537](matrix-org#13537))
- Improve performance of sending messages in rooms with thousands of local users. ([\matrix-org#13522](matrix-org#13522), [\matrix-org#13547](matrix-org#13547))

Bugfixes
--------

- Faster room joins: make `/joined_members` block whilst the room is partial stated. ([\matrix-org#13514](matrix-org#13514))
- Fix a bug introduced in Synapse 1.21.0 where the [`/event_reports` Admin API](https://matrix-org.github.io/synapse/develop/admin_api/event_reports.html) could return a total count which was larger than the number of results you can actually query for. ([\matrix-org#13525](matrix-org#13525))
- Fix a bug introduced in Synapse 1.52.0 where sending server notices fails if `max_avatar_size` or `allowed_avatar_mimetypes` is set and not `system_mxid_avatar_url`. ([\matrix-org#13566](matrix-org#13566))
- Fix a bug where the `opentracing.force_tracing_for_users` config option would not apply to [`/sendToDevice`](https://spec.matrix.org/v1.3/client-server-api/#put_matrixclientv3sendtodeviceeventtypetxnid) and [`/keys/upload`](https://spec.matrix.org/v1.3/client-server-api/#post_matrixclientv3keysupload) requests. ([\matrix-org#13574](matrix-org#13574))

Improved Documentation
----------------------

- Add `openssl` example for generating registration HMAC digest. ([\matrix-org#13472](matrix-org#13472))
- Tidy up Synapse's README. ([\matrix-org#13491](matrix-org#13491))
- Document that event purging related to the `redaction_retention_period` config option is executed only every 5 minutes. ([\matrix-org#13492](matrix-org#13492))
- Add a warning to retention documentation regarding the possibility of database corruption. ([\matrix-org#13497](matrix-org#13497))
- Document that the `DOCKER_BUILDKIT=1` flag is needed to build the docker image. ([\matrix-org#13515](matrix-org#13515))
- Add missing links in `user_consent` section of configuration manual. ([\matrix-org#13536](matrix-org#13536))
- Fix the doc and some warnings that were referring to the nonexistent `custom_templates_directory` setting (instead of `custom_template_directory`). ([\matrix-org#13538](matrix-org#13538))

Deprecations and Removals
-------------------------

- Remove the ability for homeservers to delegate email ownership verification
  and password reset confirmation to identity servers. See [upgrade notes](https://matrix-org.github.io/synapse/v1.66/upgrade.html#upgrading-to-v1660) for more details.

Internal Changes
----------------

- Update the rejected state of events during de-partial-stating. ([\matrix-org#13459](matrix-org#13459))
- Avoid blocking lazy-loading `/sync`s during partial joins due to remote memberships. Pull remote memberships from auth events instead of the room state. ([\matrix-org#13477](matrix-org#13477))
- Refuse to start when faster joins is enabled on a deployment with workers, since worker configurations are not currently supported. ([\matrix-org#13531](matrix-org#13531))

- Allow use of both `@trace` and `@tag_args` stacked on the same function. ([\matrix-org#13453](matrix-org#13453))
- Instrument the federation/backfill part of `/messages` for understandable traces in Jaeger. ([\matrix-org#13489](matrix-org#13489))
- Instrument `FederationStateIdsServlet` (`/state_ids`) for understandable traces in Jaeger. ([\matrix-org#13499](matrix-org#13499), [\matrix-org#13554](matrix-org#13554))
- Track HTTP response times over 10 seconds from `/messages` (`synapse_room_message_list_rest_servlet_response_time_seconds`). ([\matrix-org#13533](matrix-org#13533))
- Add metrics to track how the rate limiter is affecting requests (sleep/reject). ([\matrix-org#13534](matrix-org#13534), [\matrix-org#13541](matrix-org#13541))
- Add metrics to time how long it takes us to do backfill processing (`synapse_federation_backfill_processing_before_time_seconds`, `synapse_federation_backfill_processing_after_time_seconds`). ([\matrix-org#13535](matrix-org#13535), [\matrix-org#13584](matrix-org#13584))
- Add metrics to track rate limiter queue timing (`synapse_rate_limit_queue_wait_time_seconds`). ([\matrix-org#13544](matrix-org#13544))
- Update metrics to track `/messages` response time by room size. ([\matrix-org#13545](matrix-org#13545))

- Refactor methods in `synapse.api.auth.Auth` to use `Requester` objects everywhere instead of user IDs. ([\matrix-org#13024](matrix-org#13024))
- Clean-up tests for notifications. ([\matrix-org#13471](matrix-org#13471))
- Add some miscellaneous comments to document sync, especially around `compute_state_delta`. ([\matrix-org#13474](matrix-org#13474))
- Use literals in place of `HTTPStatus` constants in tests. ([\matrix-org#13479](matrix-org#13479), [\matrix-org#13488](matrix-org#13488))
- Add comments about how event push actions are rotated. ([\matrix-org#13485](matrix-org#13485))
- Modify HTML template content to better support mobile devices' screen sizes. ([\matrix-org#13493](matrix-org#13493))
- Add a linter script which will reject non-strict types in Pydantic models. ([\matrix-org#13502](matrix-org#13502))
- Reduce the number of tests using legacy TCP replication. ([\matrix-org#13543](matrix-org#13543))
- Allow specifying additional request fields when using the `HomeServerTestCase.login` helper method. ([\matrix-org#13549](matrix-org#13549))
- Make `HomeServerTestCase` load any configured homeserver modules automatically. ([\matrix-org#13558](matrix-org#13558))

# -----BEGIN PGP SIGNATURE-----
#
# iQGzBAABCgAdFiEEWMTnW8Z8khaaf90R+84KzgcyGG8FAmMPT8QACgkQ+84Kzgcy
# GG9CUAv+Pv/iDpE2jKlV7zQ/cagaKCGsFK5jy0+K9Wr215nP89tuhU37bJXsgvVu
# GP3A8k1c/ENPhXwYHLCnnxV3jick1FuVE0W6h0j2PMYeIGNCQhDswytnsQO4JExg
# fGLL4ygCzpe8bFX9+mhIM4z8xkZjZX3lIa8CN2LtRLIo0m7qoT1ZWqdt7kAjj5yL
# XMk+3Y1yq/Y4SHHqgKurBNdwNcwnv7ynchWxTYa12WVTINt26dLV0Syk3p8u2SLl
# 5YNzcDs2TAM7+VxAu7E0AQl426+Ufi122Oj1ZBUG2FxTPLH8Xr18cN2M/at6WxoX
# 8pOkGiuahKKvahw1iCoHAGIC66gFIPxBE9xW4R2SKrQtG4sDuKJI0kvunRV8+cy5
# TuJ9cmdDmJR2vj3P3OULqLXGkWsGNJqfZZF8OWkHEI8LUIXZLrAZocFtlonkr9rV
# Y8r8LxL8Id1rbHAnCXcJnYdaJ6ol0RIObDFpitY/D8BDUONVw/byeOyAEkq/XPrZ
# Ke/9K8sy
# =eg1L
# -----END PGP SIGNATURE-----
# gpg: Signature made Wed Aug 31 13:10:44 2022 BST
# gpg:                using RSA key 58C4E75BC67C92169A7FDD11FBCE0ACE0732186F
# gpg: Can't check signature: No public key

# Conflicts:
#	synapse/api/auth.py
#	synapse/push/baserules.py
#	synapse/push/bulk_push_rule_evaluator.py
#	synapse/push/push_rule_evaluator.py
#	synapse/storage/databases/main/event_push_actions.py
#	tests/server_notices/test_resource_limits_server_notices.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Logging Synapse's logs (structured or otherwise). Not metrics. A-Metrics metrics, measures, stuff we put in Prometheus T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants