-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Test cancellation at every await
during request handling
#12674
Conversation
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
Signed-off-by: Sean Quah <seanq@element.io>
bc61e33
to
b7aa039
Compare
Wow, this is quite magical. Broadly, I think if this is the least-worst option for testing this stuff then we should merge it. In my mind awful testing strategies are better than no testing strategies? 🤷 |
It is quite magical, but I can still make sense of it so it's not too bad. For a testing strategy I think this is reasonable and I wouldn't turn it down if the alternative is weaker or no coverage. I think it'd be better knowing about these problems rather than covering them up by not having such a testing scheme, so I'd say 'merge it'. If we really regret it for some reason later, we can just rip them out — they're only tests after all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As others have said: it's magical, but not awful, and I don't think you're digging into anything that is likely to change in the near future.
It's definitely a net win.
Thanks for your thoughts everyone! Sounds like we think the benefits outweigh the drawbacks, so I'll put this up for real review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a start on looking at this and made some suggestions. Could you have a look at what I've said so far, and then I'll take a closer look?
tests/http/server/_base.py
Outdated
@@ -98,3 +111,413 @@ def _test_disconnect( | |||
self.assertEqual(code, expected_code) | |||
self.assertEqual(request.code, expected_code) | |||
self.assertEqual(body, expected_body) | |||
|
|||
@logcontext_clean | |||
def _test_cancellation_at_every_await( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid the underscore here, if this is meant to be used by test methods - otherwise it's hard to see what is meant to be the usable api, and what are just the helper methods like _log_await_stack
.
In fact, as far as I can tell the only reason this should be in a mixin rather than a completely separate module is so that you can call self.assertEqual
and friends. All assertEqual
does is raise an AssertionError
if the condition fails, so you could do that yourself without having to inherit from TestCase
.
tests/http/server/_base.py
Outdated
with mock.patch( | ||
"synapse.http.server.respond_with_json", wraps=respond_with_json | ||
) as respond_mock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to patch this? Can we not just check the status of channel
(in particular, check channel.is_finished
instead of respond_mock.called
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no, disconnecting detaches the channel
from the Request
and so it never gets the response.
tests/http/server/_base.py
Outdated
message = f"{' ' * i}> *{note}* at {frame}" | ||
self._log_for_request(request_number, message) | ||
|
||
def _format_stack_frame(self, frame_info: inspect.FrameInfo) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the other methods below can be turned into regular functions. There's no need for it all to be part of the same object.
tests/http/server/_base.py
Outdated
) -> Tuple[ | ||
Callable[["Deferred[T]"], "Deferred[T]"], | ||
Callable[[], None], | ||
Callable[[], int], | ||
Callable[[], bool], | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest making this whole thing an object, and turn create_deferred___next___patch
into its constructor.
tests/http/server/_base.py
Outdated
self.assertFalse( | ||
respond_mock.called, | ||
"Request finished before we could disconnect - " | ||
"was `await_result=False` passed to `make_request`?", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood this the first time I read it, and thought that await_result=False
was a bad thing to do.
"was `await_result=False` passed to `make_request`?", | |
"ensure `await_result=False` is passed to `make_request`.", |
tests/http/server/_base.py
Outdated
while not respond_mock.called and not has_seen_new_await(): | ||
previous_awaits_seen = get_awaits_seen() | ||
|
||
reactor.pump([0.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd inline this (and the other calls to reactor.pump
) for clarity:
reactor.pump([0.0]) | |
reactor.advance(0.0) |
tests/http/server/_base.py
Outdated
self.assertEqual(channel.code, expected_code) | ||
return channel.json_body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just return channel
and leave these up to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly? There may be an unfortunate interaction where we block the await
s in _async_write_json_to_request_in_thread
.
I'll have to test it.
tests/http/server/_base.py
Outdated
def _test_cancellation_at_every_await( | ||
self, | ||
reactor: ThreadedMemoryReactorClock, | ||
make_request: Callable[[], FakeChannel], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than passing this in, can we just take a *args, **kwargs
to pass through to make_request
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that we might want to test federation requests in future, which have their own make_signed_federation_request
function
Line 795 in 9d8e380
def make_signed_federation_request( |
though there is nothing stopping us from defining a make_signed_federation_request_with_cancellation_test
variant.
tests/http/server/_base.py
Outdated
@@ -98,3 +111,413 @@ def _test_disconnect( | |||
self.assertEqual(code, expected_code) | |||
self.assertEqual(request.code, expected_code) | |||
self.assertEqual(body, expected_body) | |||
|
|||
@logcontext_clean | |||
def _test_cancellation_at_every_await( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this should be called make_request_with_cancellation_test
or something
tests/http/server/_base.py
Outdated
|
||
self.assertEqual(channel.code, expected_code) | ||
return channel.json_body | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can drop this else
, since there's a return
above.
…et the caller test the status code
This reverts commit 92045c8.
…llation_test` and have it make the request
Thanks for the feedback. I think I've addressed all of it now. Mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally good. A few comments.
The tests seem to have broken after #12872 |
… that we forced to block
It looks like a bug in the test. Before #12872, when running
After #12872, we see:
and the test got stuck here, since the await we've forced to block sits within a
for reasons I cannot explain. I've added code to unblock force-blocked awaits while waiting for a cancellation response. |
... which implies that in the before case, this is happening after our patch has been removed, implying some sort of timing change? Do you know which callback I'm not sure it matters too much, if you have something that works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
It's running
The patch is definitely still installed in both cases. If I add the following debug to `callFromThread` debugging hack--- a/tests/server.py
+++ b/tests/server.py
@@ -411,7 +411,18 @@ class ThreadedMemoryReactorClock(MemoryReactorClock):
"""
Make the callback fire in the next reactor iteration.
"""
- cb = lambda: callback(*args, **kwargs)
+ import types
+ def cb():
+ if isinstance(callback, types.MethodType) and callback.__func__ == Deferred.callback:
+ d = callback.__self__
+ gotResult_callbacks = d.callbacks[1] # `d.callbacks[0]` is `_set_context_cb`
+ gotResult_callback, gotResult_errback = gotResult_callbacks
+ gotResult, _args, _kwargs = gotResult_callback
+ coroutine = gotResult.__closure__[1].cell_contents
+ import sys
+ sys.stdout.write(f" callFromThread: resuming coroutine {coroutine}\n")
+ sys.stdout.write(f" d.__next__ = {d.__next__.__qualname__}\n")
+ callback(*args, **kwargs)
# it's not safe to call callLater() here, so we append the callback to a I get: Before #12872:
After #12872:
More logs with extra debugging:
After #12872:
I genuinely cannot see a difference between the before/after except whether |
Anyway I'll get this merged after the long weekend. |
Synapse 1.62.0 (2022-07-05) =========================== No significant changes since 1.62.0rc3. Authors of spam-checker plugins should consult the [upgrade notes](https://github.com/matrix-org/synapse/blob/release-v1.62/docs/upgrade.md#upgrading-to-v1620) to learn about the enriched signatures for spam checker callbacks, which are supported with this release of Synapse. Synapse 1.62.0rc3 (2022-07-04) ============================== Bugfixes -------- - Update the version of the [ldap3 plugin](https://github.com/matrix-org/matrix-synapse-ldap3/) included in the `matrixdotorg/synapse` DockerHub images and the Debian packages hosted on `packages.matrix.org` to 0.2.1. This fixes [a bug](matrix-org/matrix-synapse-ldap3#163) with usernames containing uppercase characters. ([\matrix-org#13156](matrix-org#13156)) - Fix a bug introduced in Synapse 1.62.0rc1 affecting unread counts for users on small servers. ([\matrix-org#13168](matrix-org#13168)) Synapse 1.62.0rc2 (2022-07-01) ============================== Bugfixes -------- - Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\matrix-org#13140](matrix-org#13140)) - Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\matrix-org#13141](matrix-org#13141)) Synapse 1.62.0rc1 (2022-06-28) ============================== Features -------- - Port the spam-checker API callbacks to a new, richer API. This is part of an ongoing change to let spam-checker modules inform users of the reason their event or operation is rejected. ([\matrix-org#12857](matrix-org#12857), [\matrix-org#13047](matrix-org#13047)) - Allow server admins to customise the response of the `/.well-known/matrix/client` endpoint. ([\matrix-org#13035](matrix-org#13035)) - Add metrics measuring the CPU and DB time spent in state resolution. ([\matrix-org#13036](matrix-org#13036)) - Speed up fetching of device list changes in `/sync` and `/keys/changes`. ([\matrix-org#13045](matrix-org#13045), [\matrix-org#13098](matrix-org#13098)) - Improve URL previews for sites which only provide Twitter Card metadata, e.g. LWN.net. ([\matrix-org#13056](matrix-org#13056)) Bugfixes -------- - Update [MSC3786](matrix-org/matrix-spec-proposals#3786) implementation to check `state_key`. ([\matrix-org#12939](matrix-org#12939)) - Fix a bug introduced in Synapse 1.58 where Synapse would not report full version information when installed from a git checkout. This is a best-effort affair and not guaranteed to be stable. ([\matrix-org#12973](matrix-org#12973)) - Fix a bug introduced in Synapse 1.60 where Synapse would fail to start if the `sqlite3` module was not available. ([\matrix-org#12979](matrix-org#12979)) - Fix a bug where non-standard information was required when requesting the `/hierarchy` API over federation. Introduced in Synapse v1.41.0. ([\matrix-org#12991](matrix-org#12991)) - Fix a long-standing bug which meant that rate limiting was not restrictive enough in some cases. ([\matrix-org#13018](matrix-org#13018)) - Fix a bug introduced in Synapse 1.58 where profile requests for a malformed user ID would ccause an internal error. Synapse now returns 400 Bad Request in this situation. ([\matrix-org#13041](matrix-org#13041)) - Fix some inconsistencies in the event authentication code. ([\matrix-org#13087](matrix-org#13087), [\matrix-org#13088](matrix-org#13088)) - Fix a long-standing bug where room directory requests would cause an internal server error if given a malformed room alias. ([\matrix-org#13106](matrix-org#13106)) Improved Documentation ---------------------- - Add documentation for how to configure Synapse with Workers using Docker Compose. Includes example worker config and docker-compose.yaml. Contributed by @Thumbscrew. ([\matrix-org#12737](matrix-org#12737)) - Ensure the [Poetry cheat sheet](https://matrix-org.github.io/synapse/develop/development/dependencies.html) is available in the online documentation. ([\matrix-org#13022](matrix-org#13022)) - Mention removed community/group worker endpoints in upgrade.md. Contributed by @olmari. ([\matrix-org#13023](matrix-org#13023)) - Add instructions for running Complement with `gotestfmt`-formatted output locally. ([\matrix-org#13073](matrix-org#13073)) - Update OpenTracing docs to reference the configuration manual rather than the configuration file. ([\matrix-org#13076](matrix-org#13076)) - Update information on downstream Debian packages. ([\matrix-org#13095](matrix-org#13095)) - Remove documentation for the Delete Group Admin API which no longer exists. ([\matrix-org#13112](matrix-org#13112)) Deprecations and Removals ------------------------- - Remove the unspecced `DELETE /directory/list/room/{roomId}` endpoint, which hid rooms from the [public room directory](https://spec.matrix.org/v1.3/client-server-api/#listing-rooms). Instead, `PUT` to the same URL with a visibility of `"private"`. ([\matrix-org#13123](matrix-org#13123)) Internal Changes ---------------- - Add tests for cancellation of `GET /rooms/$room_id/members` and `GET /rooms/$room_id/state` requests. ([\matrix-org#12674](matrix-org#12674)) - Report login failures due to unknown third party identifiers in the same way as failures due to invalid passwords. This prevents an attacker from using the error response to determine if the identifier exists. Contributed by Daniel Aloni. ([\matrix-org#12738](matrix-org#12738)) - Merge the Complement testing Docker images into a single, multi-purpose image. ([\matrix-org#12881](matrix-org#12881), [\matrix-org#13075](matrix-org#13075)) - Simplify the database schema for `event_edges`. ([\matrix-org#12893](matrix-org#12893)) - Clean up the test code for client disconnection. ([\matrix-org#12929](matrix-org#12929)) - Remove code generating comments in configuration. ([\matrix-org#12941](matrix-org#12941)) - Add `Cross-Origin-Resource-Policy: cross-origin` header to content repository's thumbnail and download endpoints. ([\matrix-org#12944](matrix-org#12944)) - Replace noop background updates with `DELETE` delta. ([\matrix-org#12954](matrix-org#12954), [\matrix-org#13050](matrix-org#13050)) - Use lower isolation level when inserting read receipts to avoid serialization errors. Contributed by Nick @ Beeper. ([\matrix-org#12957](matrix-org#12957)) - Reduce the amount of state we pull from the DB. ([\matrix-org#12963](matrix-org#12963)) - Enable testing against PostgreSQL databases in Complement CI. ([\matrix-org#12965](matrix-org#12965), [\matrix-org#13034](matrix-org#13034)) - Fix an inaccurate comment. ([\matrix-org#12969](matrix-org#12969)) - Remove the `delete_device` method and always call `delete_devices`. ([\matrix-org#12970](matrix-org#12970)) - Use a GitHub form for issues rather than a hard-to-read, easy-to-ignore template. ([\matrix-org#12982](matrix-org#12982)) - Move [MSC3715](matrix-org/matrix-spec-proposals#3715) behind an experimental config flag. ([\matrix-org#12984](matrix-org#12984)) - Add type hints to tests. ([\matrix-org#12985](matrix-org#12985), [\matrix-org#13099](matrix-org#13099)) - Refactor macaroon tokens generation and move the unsubscribe link in notification emails to `/_synapse/client/unsubscribe`. ([\matrix-org#12986](matrix-org#12986)) - Fix documentation for running complement tests. ([\matrix-org#12990](matrix-org#12990)) - Faster joins: add issue links to the TODO comments in the code. ([\matrix-org#13004](matrix-org#13004)) - Reduce DB usage of `/sync` when a large number of unread messages have recently been sent in a room. ([\matrix-org#13005](matrix-org#13005), [\matrix-org#13096](matrix-org#13096), [\matrix-org#13118](matrix-org#13118)) - Replaced usage of PyJWT with methods from Authlib in `org.matrix.login.jwt`. Contributed by Hannes Lerchl. ([\matrix-org#13011](matrix-org#13011)) - Modernize the `contrib/graph/` scripts. ([\matrix-org#13013](matrix-org#13013)) - Remove redundant `room_version` parameters from event auth functions. ([\matrix-org#13017](matrix-org#13017)) - Decouple `synapse.api.auth_blocking.AuthBlocking` from `synapse.api.auth.Auth`. ([\matrix-org#13021](matrix-org#13021)) - Add type annotations to `synapse.storage.databases.main.devices`. ([\matrix-org#13025](matrix-org#13025)) - Set default `sync_response_cache_duration` to two minutes. ([\matrix-org#13042](matrix-org#13042)) - Rename CI test runs. ([\matrix-org#13046](matrix-org#13046)) - Increase timeout of complement CI test runs. ([\matrix-org#13048](matrix-org#13048)) - Refactor entry points so that they all have a `main` function. ([\matrix-org#13052](matrix-org#13052)) - Refactor the Dockerfile-workers configuration script to use Jinja2 templates in Synapse workers' Supervisord blocks. ([\matrix-org#13054](matrix-org#13054)) - Add headers to individual options in config documentation to allow for linking. ([\matrix-org#13055](matrix-org#13055)) - Make Complement CI logs easier to read. ([\matrix-org#13057](matrix-org#13057), [\matrix-org#13058](matrix-org#13058), [\matrix-org#13069](matrix-org#13069)) - Don't instantiate modules with keyword arguments. ([\matrix-org#13060](matrix-org#13060)) - Fix type checking errors against Twisted trunk. ([\matrix-org#13061](matrix-org#13061)) - Allow MSC3030 `timestamp_to_event` calls from anyone on world-readable rooms. ([\matrix-org#13062](matrix-org#13062)) - Add a CI job to check that schema deltas are in the correct folder. ([\matrix-org#13063](matrix-org#13063)) - Avoid rechecking event auth rules which are independent of room state. ([\matrix-org#13065](matrix-org#13065)) - Reduce the duplication of code that invokes the rate limiter. ([\matrix-org#13070](matrix-org#13070)) - Add a Subject Alternative Name to the certificate generated for Complement tests. ([\matrix-org#13071](matrix-org#13071)) - Add more tests for room upgrades. ([\matrix-org#13074](matrix-org#13074)) - Pin dependencies maintained by matrix.org to [semantic version](https://semver.org/) bounds. ([\matrix-org#13082](matrix-org#13082)) - Correctly report prometheus DB stats for `get_earliest_token_for_stats`. ([\matrix-org#13085](matrix-org#13085)) - Fix a long-standing bug where a finished logging context would be re-started when Synapse failed to persist an event from federation. ([\matrix-org#13089](matrix-org#13089)) - Simplify the alias deletion logic as an application service. ([\matrix-org#13093](matrix-org#13093)) - Add type annotations to `tests.test_server`. ([\matrix-org#13124](matrix-org#13124))
I'm throwing up this PR for discussion rather than to get it merged as-is.The code's pretty magical and I'm not sure if it's something we want to merge.EDIT: Consensus seems to be that we're happy to have integration tests, even if they're magical.
I'm putting this up for a real review now.
The interesting commit is "Add
_test_cancellation_at_every_await
helper method".Motivation
Once we start enabling cancellation for various endpoints, it would be desirable to have some sort of integration testing for it as part of CI.
Improper handling of cancellation can manifest as anything from logging contexts going missing to stuck requests to certain endpoints. As it's uncommon for clients to disconnect before a request finishes processing, any bugs won't be immediately obvious after deployment to production and will bite at random times.
Testing Approach
It turns out that when you
await
aDeferred
, Python callsDeferred.__await__().__next__()
.Deferred.__await__()
just returnsself
.And
__next__
controls what happens to theawait
:Deferred
, which blocks the coroutine.Python returns the
Deferred
out of thecoroutine.send()
call which started the coroutine.https://github.com/twisted/twisted/blob/7d6387b2c0d6c8111277a96cc56c50efdff2a4bc/src/twisted/internet/defer.py#L1687
The twisted coroutine-runner,
inlineCallbacks
, sees the return value and adds callbacks to resume the coroutine.StopIteration
with the result of the await.We can patch
Deferred.__next__
to interceptawait
s and force them to block. Which gives us time to throw a cancellation into the request processing!With that power in hand, we can construct a test that will run a request up to the first
await
(on aDeferred
) and cancel the request. And then we can re-run the request, allow it to get a little further, then cancel it again once it hits the nextawait
, and so on. ie. we can test a cancellation at everyawait
on aDeferred
along a single code path.We identify when we see a new
await
s by looking at the stack. If it's a stack we've seen before, we let theawait
behave as normal.This PR implements this approach.
Logging
The test process is very magical and it's not clear what it's actually doing for a given endpoint. So I've added some debug logging at INFO level that will print a call tree for the endpoint under test, showing every
await
the test sees. There's an example below for theGET /rooms/$room_id/members
endpoint.PartialStateEventsTracker.await_full_state
shows up, as you might hope, and gets cancel-tested.Logging from a run through `GET /rooms/$room_id/members`
Only the new parts of the call tree are printed for each request.
Limitations
ReadWriteLock
s.Pros of merging something like this
@cached
and@cachedList
decorators #12183.Cons of merging something like this
Deferred.__next__
and inspecting the stack.DatabasePool.runInteraction()
#12199.Alternatives
defer.inlineCallbacks
instead, like we do for_check_yield_points
? We won't be able to forceawait
s to block if we go for this.