Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

we shouldn't process requests where the requester has timed out/gone away #3528

Closed
richvdh opened this issue Jul 13, 2018 · 7 comments
Closed
Labels
A-Performance Performance, both client-facing and admin-facing T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Jul 13, 2018

when we get busy, sometimes we queue up requests, which means that once we get round to processing them, there is no point in doing so. See #3444 for a good example.

@richvdh
Copy link
Member Author

richvdh commented Jul 13, 2018

a nice way to do this might be via cancellation of Deferreds?

@richvdh richvdh added the A-Performance Performance, both client-facing and admin-facing label Jul 13, 2018
@neilisfragile neilisfragile added z-p2 (Deprecated Label) z-minor (Deprecated Label) labels Jul 23, 2018
@richvdh
Copy link
Member Author

richvdh commented Jan 15, 2020

keywords, since I just spent 10 minutes searching for this: cancel, disconnect

@squahtx squahtx self-assigned this Feb 17, 2022
@erikjohnston erikjohnston added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. and removed z-minor (Deprecated Label) T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. labels Feb 23, 2022
@squahtx
Copy link
Contributor

squahtx commented Feb 25, 2022

Cancelling Deferreds to abort processing looks very promising!

How Deferred.cancel() works

When Deferred.cancel() is called on a Deferred wrapping a coroutine, the call propagates down into the Deferred the coroutine is waiting on. That Deferred will either propagate the cancel() further or fail with a twisted.internet.defer.CancelledError, which gets raised inside the coroutine. This means that any await expression may raise a CancelledError, including trivial ones like sleep()!

The coroutine may then deal with the CancelledError using normal exception handling. The Deferred wrapping the coroutine will not automatically fail with a CancelledError unless it bubbles up through the coroutine.

asyncio has a similar feature, where an asyncio.CancelledError will be raised instead. It's interesting to note that twisted CancelledErrors are Exceptions, while asyncio CancelledErrors are BaseExceptions.

Complications

Unfortunately aborting processing of requests isn't as simple as calling Deferred.cancel() when clients disconnect:

  • In many places, we catch and swallow plain Exceptions without regard for CancelledErrors. The result is that cancellation is ignored, which isn't too bad, since that's almost what happens today.
  • More importantly, we have quite a few constructs (gatherResults, ObservableDeferred, ReadWriteLock, DeferredCache, run_in_background, defer_to_thread and more) which do fancy things with Deferreds and may not have been designed with cancellation in mind.
    Some of them allow cancel() calls to propagate when they shouldn't, don't propagate cancel() calls when they should or don't clean up after CancelledErrors correctly when they come from unexpected places.
    This can result in stuck requests piling up and all sorts of general weirdness. Ideally each construct needs to be checked, tested and fixed which is going to be very time-consuming.

Perhaps we could limit cancellation to a handful of interesting endpoints, like /sync? That way we wouldn't have to make all of synapse support cancellation right now. Though that might not be much help since all the interesting endpoints probably make use of many of the constructs that need checking.

@richvdh
Copy link
Member Author

richvdh commented Feb 25, 2022

yeah it might be good to start this with one endpoint, and work from there. /sync isn't a great candidate though, since it goes through a ResponseCache. /members or /state or somesuch might work better.

@squahtx
Copy link
Contributor

squahtx commented Mar 30, 2022

In general, for some code to support cancellation:

  • Deferred.cancel() calls must propagate downwards through awaits correctly.
    This happens automatically for awaits on coroutines.
    Care must be taken around Deferreds. Sometimes cancellation ought to propagate to another Deferred and has to be wired up manually. And sometimes the Deferred being awaited on must not be cancelled, since it may be shared between multiple requests.
    The new stop_cancellation and delay_cancellation wrapper functions can be useful here.
  • CancelledErrors coming out of awaits are handled correctly, ie:
  • Logging contexts must not be finished while there are uncancelled tasks using them. The delay_cancellation wrapper function can be useful for this.

Checking all of this usually involves exploring all awaits in the code of interest recursively (and watching out for decorators!).

TODO

In the call trees below, an [x] means that a function has been checked to handle cancellation correctly and that all its awaits have been listed below it.
It does not necessarily mean that the other functions it awaits on have been checked.

✔️ The @cancellable flag for endpoints

Cancellation support for /sync

Let's not bother with this, since we rely on syncs completing in the background when clients time out their requests.

✔️ Cancellation support for /rooms/$room_id/members

Summary

Call tree
  • Auth.get_user_by_req (see below)
  • StreamToken.from_string (Make StreamToken and RoomStreamToken methods propagate cancellations #12366)
    except Exception needs fixing
  • MessageHandler.get_state_events
    • StreamWorkerStore.get_recent_events_for_room
    • StateGroupStorage.get_state_for_event -> StateGroupStorage.get_state_for_events
      • @cachedList StateGroupWorkerStore._get_state_group_for_events (#12183) -> DatabasePool.simple_select_many_batch -> DatabasePool.runInteraction (#12199)
      • StateGroupDataStore._get_state_for_groups -> StateGroupDataStore._get_state_groups_from_groups -> DatabasePool.runInteraction (#12199)
      • EventsWorkerStore.get_events -> EventsWorkerStore.get_events_as_list -> ... (see above)
    • filter_events_for_client
      • StateGroupStorage.get_state_for_events -> ... (see above)
      • @cached AccountDataWorkerStore.ignored_users (#12183) -> DatabasePool.simple_select_onecol -> DatabasePool.runInteraction (#12199)
      • @cachedList UserErasureWorkerStore.are_users_erased (#12183) -> DatabasePool.simple_select_many_batch -> DatabasePool.runInteraction (#12199)
      • @cached RoomWorkerStore.get_retention_policy_for_room (#12183) -> DatabasePool.runInteraction (#12199)
    • StateGroupStorage.get_state_for_events -> ... (see above)
    • Auth.check_user_in_room_or_world_readable
      • Auth.check_user_in_room
        • StateHandler.get_current_state
          • @cached EventFederationWorkerStore.get_latest_event_ids_in_room (#12183) -> DatabasePool.simple_select_onecol -> DatabasePool.runInteraction (#12199)
          • @measure_func StateHandler.resolve_state_groups_for_events
            • StateGroupStorage.get_state_groups_ids
              • @cachedList StateGroupWorkerStore._get_state_group_for_events (#12183) -> DatabasePool.simple_select_many_batch -> DatabasePool.runInteraction (#12199)
              • StateGroupDataStore._get_state_for_groups -> StateGroupDataStore._get_state_groups_from_groups -> DatabasePool.runInteraction (#12199)
            • StateGroupStorage.get_state_group_delta -> @cached StateGroupDataStore.get_state_group_delta (#12183) -> DatabasePool.runInteraction (#12199)
            • @cached StateGroupWorkerStore.get_room_version_id (#12183) -> DatabasePool.runInteraction (#12199)
            • StateResolutionHandler.resolve_state_groups
              • Linearizer.queue (Linearizer can get stuck due to cancellation #12114)
              • StateResolutionHandler.resolve_events_with_store
                • v1.resolve_events_with_store
                • StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                • v2.resolve_events_with_store
                  • _get_auth_chain_difference -> StateResolutionStore.get_auth_chain_difference -> EventFederationWorkerStore.get_auth_chain_difference
                    • RoomWorkerStore.get_room -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
                    • DatabasePool.runInteraction (#12199)
                  • StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                  • _reverse_topological_power_sort
                    • _add_event_and_auth_chain_to_graph -> _get_event -> StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                    • Clock.sleep
                    • _get_power_level_for_sender -> _get_event -> StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                    • Clock.sleep
                  • _iterative_auth_checks
                    • _get_event -> StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                    • Clock.sleep
                  • _mainline_sort
                    • _get_event -> StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                    • Clock.sleep
                    • _get_mainline_depth_for_event -> _get_event -> StateResolutionStore.get_events -> EventsWorkerStore.get_events -> ...
                  • _iterative_auth_checks -> ...
          • EventsWorkerStore.get_event -> EventsWorkerStore.get_events_as_list -> ... (see above)
          • EventsWorkerStore.get_events -> EventsWorkerStore.get_events_as_list -> ... (see above)
        • @cached RoomMemberWorkerStore.did_forget (#12183) -> DatabasePool.runInteraction (#12199)
      • StateHandler.get_current_state -> ...
    • StateGroupWorkerStore.get_filtered_current_state_ids
      • @cached StateGroupWorkerStore.get_current_state_ids (#12183) -> DatabasePool.runInteraction (#12199)
      • DatabasePool.runInteraction (#12199)
    • EventsWorkerStore.get_events -> EventsWorkerStore.get_events_as_list -> ... (see above)
    • StateGroupStorage.get_state_for_events -> ... (see above)

✔️ Cancellation support for /rooms/$room_id/state

Comes for free once we deal with /rooms/$room_id/members, since it calls the same methods.

✔️ Cancellation support for /rooms/$room_id/state/$state_key/*

Comes for free once we deal with /rooms/$room_id/members, since it calls the same methods.

Call tree
  • Auth.get_user_by_req (see below)
  • MessageHandler.get_room_data
    • Auth.check_user_in_room_or_world_readable -> ... (see /rooms/$room_id/members above)
    • StateHandler.get_current_state -> ... (see /rooms/$room_id/members above)
    • StateGroupStorage.get_state_for_events -> ... (see /rooms/$room_id/members above)

✔️ Cancellation support for Auth.get_user_by_req

This method is called by almost every endpoint.

Call tree
  • Auth._wrapped_get_user_by_req
    • Auth.get_appservice_user_id_and_device_id
      • Auth.validate_appservice_can_control_user_id -> @cached RegistrationWorkerStore.get_user_by_id (#12183) -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
      • DeviceWorkerStore.get_device -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
    • SlavedClientIpStore.insert_client_ip
    • ClientIpStore.insert_client_ip -> MonthlyActiveUsersStore.populate_monthly_active_users
      • @cached RegistrationBackgroundUpdateStore.is_guest (#12183) -> DatabasePool.simple_select_one_onecol -> DatabasePool.runInteraction (#12199)
      • RegistrationWorkerStore.is_trial_user -> @cached RegistrationWorkerStore.get_user_by_id (#12183) -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
      • @cached MonthlyActiveUsersWorkerStore.user_last_seen_monthly_active (#12183) -> DatabasePool.simple_select_one_onecol -> DatabasePool.runInteraction (#12199)
      • MonthlyActiveUsersStore.upsert_monthly_active_user
        • @cached RegistrationWorkerStore.is_support_user (#12183) -> DatabasePool.runInteraction (#12199)
        • DatabasePool.runInteraction (#12199)
      • @cached get_monthly_active_count (#12183) -> DatabasePool.runInteraction (#12199)
      • MonthlyActiveUsersStore.upsert_monthly_active_user -> ...
      • MonthlyActiveUsersStore.upsert_monthly_active_user -> ...
    • Auth.get_user_by_access_token
      • @cached RegistrationWorkerStore.get_user_by_access_token (#12183) -> DatabasePool.runInteraction (#12199)
      • @cached RegistrationWorkerStore.get_user_by_id (#12183) -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
    • AccountValidityHandler.is_user_expired
      • Module callback: IS_USER_EXPIRED_CALLBACK (see below)
      • RegistrationWorkerStore.is_account_expired -> @cached RegistrationWorkerStore.get_expiration_ts_for_user (#12183) -> DatabasePool.simple_select_one_onecol -> DatabasePool.runInteraction (#12199)
    • {SlavedClientIpStore,ClientIpStore}.insert_client_ip -> ...
    • @cached RegistrationWorkerStore.mark_access_token_as_used (#12183) -> DatabasePool.simple_update_one -> DatabasePool.runInteraction (#12199)

✔️ Cancellation support for module callbacks

Hypothetically, module callbacks can call anything exposed by the module API, which is large and going to take forever to check.
For now, we can wrap module callbacks with delay_cancellation(...).

Cancellation support for the module-facing module API

I've given up reviewing the module API for now. The unchecked items either need fixing or haven't been reviewed.

Call trees
  • ModuleApi.get_userinfo_by_id -> RegistrationWorkerStore.get_userinfo_by_id -> @cached RegistrationWorkerStore.get_user_by_id (#12183) -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
  • ModuleApi.get_user_by_req -> Auth.get_user_by_req (see above)
  • ModuleApi.is_user_admin -> RegistrationWorkerStore.is_server_admin -> DatabasePool.simple_select_one_onecol -> DatabasePool.runInteraction (#12199)
  • ModuleApi.get_profile_for_user -> ProfileWorkerStore.get_profileinfo -> DatabasePool.simple_select_one -> DatabasePool.runInteraction (#12199)
  • ModuleApi.get_threepids_for_user -> RegistrationWorkerStore.user_get_threepids -> DatabasePool.simple_select_list -> DatabasePool.runInteraction (#12199)
  • ModuleApi.check_user_exists -> AuthHandler.check_user_exists -> AuthHandler._find_user_id_and_pwd_hash -> RegistrationWorkerStore.get_users_by_id_case_insensitive -> DatabasePool.runInteraction (#12199)
  • ModuleApi.register -> ModuleApi.{register_user,register_device} -> ... (see below)
  • ModuleApi.register_user -> RegistrationHandler.register_user
  • ModuleApi.register_device -> RegistrationHandler.register_device
  • ModuleApi.record_user_external_id -> RegistrationWorkerStore.record_user_external_id -> DatabasePool.runInteraction (#12199)
  • ModuleApi.invalidate_access_token
    • Auth.get_user_by_access_token -> ...
    • @trace DeviceHandler.delete_device
      • DeviceStore.delete_device -> DeviceStore.delete_devices
        Cache invalidation needs to be turned into a transaction callback, otherwise it will be missed when cancellation happens.
        • DatabasePool.runInteraction (#12199)
      • AuthHandler.delete_access_tokens_for_user
        • RegistrationStore.user_delete_access_tokens -> DatabasePool.runInteraction (#12199)
        • PasswordAuthProvider.on_logged_out
          except Exception needs fixing
          • Module callback: ON_LOGGED_OUT_CALLBACK (see above)
        • PusherPool.remove_pushers_by_access_token
          • PusherWorkerStore.get_pushers_by_user_id -> PusherWorkerStore.get_pushers_by -> DatabasePool.simple_select_list -> DatabasePool.runInteraction (#12199)
          • PusherPool.remove_pusher
            • ReplicationRemovePusherRestServlet.make_client()
              Ignoring this, since we can't get here unless we're on the main process.
            • PusherStore.delete_pusher_by_app_id_pushkey_user_id
              • MultiWriterIdGenerator.get_next -> _MultiWriterCtxManager
                The DatabasePool.runInteraction call in __aenter__ may raise a CancelledError. We need to ensure that IDs are marked as finished if this happens.
              • DatabasePool.runInteraction (#12199)
      • EndToEndKeyStore.delete_e2e_keys_by_device -> DatabasePool.runInteraction (#12199)
      • @trace @measure_func DeviceHandler.notify_device_update
        • @cached RoomMemberWorkerStore.get_users_who_share_room_with_user (#12183)
          • @cached RoomMemberWorkerStore.get_rooms_for_user (#12183) -> DatabasePool.runInteraction (#12199)
          • @cached RoomMemberWorkerStore.get_users_in_room (#12183) -> DatabasePool.runInteraction (#12199)
        • DeviceStore.add_device_change_to_streams
          • MultiWriterIdGenerator.get_next_mult -> _MultiWriterCtxManager (see above)
          • DatabasePool.runInteraction (#12199)
    • AuthHandler.delete_access_token
      • Auth.get_user_by_access_token -> ...
      • RegistrationStore.delete_access_token -> DatabasePool.runInteraction (#12199)
      • PasswordAuthProvider.on_logged_out (see above)
      • PusherPool.remove_pushers_by_access_token
        • PusherWorkerStore.get_pushers_by_user_id -> PusherWorkerStore.get_pushers_by -> DatabasePool.simple_select_list -> DatabasePool.runInteraction (#12199)
        • PusherPool.remove_pusher -> ... (see above)
  • ModuleApi.run_db_interaction -> DatabasePool.runInteraction (#12199)
  • ModuleApi.complete_sso_login_async
  • ModuleApi.get_state_events_in_room
  • ModuleApi.update_room_membership
  • ModuleApi.create_and_send_event_into_room
  • ModuleApi.send_local_online_presence_to
  • ModuleApi.sleep -> Clock.sleep
  • ModuleApi.send_mail
  • ModuleApi.get_user_ip_and_agents
  • ModuleApi.get_room_state
  • ModuleApi.defer_to_thread
  • ModuleApi.check_username
  • ModuleApi.http_client
  • PublicRoomListManager.room_is_in_public_room_list
  • PublicRoomListManager.add_room_to_public_room_list
  • PublicRoomListManager.remove_room_from_public_room_list

Cancellation support for everything

Probably not happening soon. There's a lot of code to check so we'll have to pick which endpoints to prioritise.

An [x] here means that a construct has been made to support cancellation.
Some of these may already support cancellation by accident and may just warrant some unit tests.

@squahtx
Copy link
Contributor

squahtx commented May 11, 2022

GET /rooms/$room_id/{members,state} is now cancellable!

I did some exploration of how we'd test cancellation of entire endpoints at #12674.
Otherwise, we're done with cancellation for the near future.

@richvdh
Copy link
Member Author

richvdh commented Jun 13, 2022

Given that all the groundwork is done here, I don't think keeping this issue open any longer is much use. We can add cancellation support to other APIs as and when it turns out to be useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants