Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix incorrect juggling of logging contexts in _PerHostRatelimiter #13554

Merged
merged 9 commits into from
Aug 18, 2022
1 change: 1 addition & 0 deletions changelog.d/13554.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Instrument `FederationStateIdsServlet` (`/state_ids`) for understandable traces in Jaeger.
15 changes: 5 additions & 10 deletions synapse/util/ratelimitutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ def ratelimit(self, host: str) -> "Iterator[defer.Deferred[None]]":
self.host = host

request_id = object()
ret = self._on_enter(request_id)
ret = defer.ensureDeferred(self._on_enter_with_tracing(request_id))
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there discussion somewhere about how this fixes the log context? Trying to learn and understand.

Copy link
Member

@richvdh richvdh Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MadLittleMods there was some discussion in Backend Internal (https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$xqpsbzDjJpmqAt9xTBtpS6qAAWXrBv3N_HkjwwK2QoM?via=jki.re&via=matrix.org&via=element.io), but I don't think there's much detailed discussion.

As @squahtx said:

iirc make_deferred_yieldable stashes away the current logcontext and adds its own callback to the deferred to restore it
and each opentracing spans "wraps" its own logcontext (more or less)

In other words: make_deferred_yieldable adds a callback to the deferred, and that callback would restore the logcontext created by start_active_span once the ratelimit wait completed.

It might also be instructive to look at @squahtx's initial fix for this, which effectively just moved the call to make_deferred_yieldable to before the call to start_active_span. That was an entirely valid fix, we just decided that doing it this way felt more natural and more obviously correct.

The lesson I take here (other than that logcontexts are confusing, which we knew) is that __enter__ and __exit__ are hazardous materials and it's best to avoid messing with them directly where possible.

Copy link
Contributor Author

@squahtx squahtx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a subtlety in the way spans, logging contexts and make_deferred_yieldable work.

Spans are implemented on top of logging contexts. That is, every span creates its own nested logging context (minus an optimization).

# parent span's logging context active
span = start_active_span(...)
span.__enter__(...)
# child span's logging context is now active

ret_defer.addCallbacks(on_start, on_err)
ret_defer.addBoth(on_both)  # <-- `span.__exit__` is called inside `on_both`
return make_deferred_yieldable(ret_defer)

Just before make_deferred_yieldable is called, the active logging context is the one associated with span.

make_deferred_yieldable is a little big magical. It clears the active logging context and attaches a callback to ret_defer to restore the logging context after the deferred has resolved. With the previous code, the order of callbacks on ret_defer was:

  1. [no logging context yet] (on_start, on_err)
  2. [no logging context yet] on_both, calling span.__exit__ and restoring the parent span's logging context.
  3. [parent span's logging context] make_deferred_yieldable's callback to restore the child span's logging context.
  4. [child span's logging context]

But we need 2 and 3 to be the other way around. The child logging context needs to be restored, then __exit__ed, which makes it restore the parent logging context.

Copy link
Contributor

@MadLittleMods MadLittleMods Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is subtle!

It might also be instructive to look at @squahtx's initial fix for this, which effectively just moved the call to make_deferred_yieldable to before the call to start_active_span. That was an entirely valid fix, we just decided that doing it this way felt more natural and more obviously correct.

@squahtx's initial fix is useful to look at 👍 But I think the initial fix does something a little different than this description. It moves the span.__exit__() after the make_deferred_yieldable so that make_deferred_yieldable's callback restores span's logging context. And then span.__exit__() restores the parent context. From my understanding, this also aligns with what @squahtx mentions is the subtlety.

But moving the call to make_deferred_yieldable to before the call to start_active_span sounds like it would work as well. That way the span.__exit__() would restore the nested context and then make_deferred_yieldable's callback would restore the parent logging context.

In any case, we want the parent logging context restored when we get out of everything (fold and unfold in the correct order).


Now in the final form of the fix, the span is entered via the with context manager so it has the outermost parent context, then make_deferred_yieldable is called which is in the span context. And because we are using the proper with nesting context manager stuff it also properly unfolds in the correct order: make_deferred_yieldable's callback restores the span context, and the span context manager exit restores the parent context.

🎉

Thanks for explaining @richvdh and @squahtx ❤️


The key pitfall in my change is not knowing the magic in make_deferred_yieldable:

make_deferred_yieldable is a little big magical. It clears the active logging context and attaches a callback to ret_defer to restore the logging context after the deferred has resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry, my description of @squahtx's fix was incorrect.

But moving the call to make_deferred_yieldable to before the call to start_active_span sounds like it would work as well.

Actually it seems it wouldn't. We tried something equivalent in this thread and as @squahtx points out there are other problems.

try:
yield ret
finally:
Expand All @@ -175,6 +175,10 @@ def should_sleep(self) -> bool:
"""
return len(self.request_times) > self.sleep_limit

async def _on_enter_with_tracing(self, request_id: object) -> None:
with start_active_span("ratelimit wait"), queue_wait_timer.time():
await self._on_enter(request_id)
richvdh marked this conversation as resolved.
Show resolved Hide resolved

def _on_enter(self, request_id: object) -> "defer.Deferred[None]":
time_now = self.clock.time_msec()

Expand Down Expand Up @@ -257,17 +261,8 @@ def on_both(r: object) -> object:
# Ensure that we've properly cleaned up.
self.sleeping_requests.discard(request_id)
self.ready_request_queue.pop(request_id, None)
wait_span_scope.__exit__(None, None, None)
wait_timer_cm.__exit__(None, None, None)
return r

# Tracing
wait_span_scope = start_active_span("ratelimit wait")
wait_span_scope.__enter__()
# Metrics
wait_timer_cm = queue_wait_timer.time()
wait_timer_cm.__enter__()

ret_defer.addCallbacks(on_start, on_err)
ret_defer.addBoth(on_both)
return make_deferred_yieldable(ret_defer)
Expand Down