Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster room joins: avoid blocking when pulling events with missing prevs #13355

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Jul 21, 2022

Avoid blocking on full state in _resolve_state_at_missing_prevs and
return a new flag indicating whether the resolved state is partial.
Thread that flag around so that it makes it into the event context.

Resolves #13002.


Corresponding Complement tests: matrix-org/complement#419
Best reviewed commit by commit.
The first commit is from #13354.

Sean Quah added 5 commits July 21, 2022 19:12
When a room has the partial state flag, we may not have an accurate
`m.room.member` event for event senders in the room's current state, and
so cannot perform soft fail checks correctly. Skip the soft fail check
entirely in this case.

As an alternative, we could block until we have full state, but that
would prevent us from receiving incoming events over federation, which
is undesirable.

Signed-off-by: Sean Quah <seanq@matrix.org>
Previously, `partial_state=False` either meant "full state" or
"calculate the flag yourself", depending on whether
`state_ids_before_event` was provided. Split out the latter meaning into
`partial_state=None`.

Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
…t can return partial state

Signed-off-by: Sean Quah <seanq@matrix.org>
…rwise we can block when receiving or backfilling events over federation

Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx requested a review from a team as a code owner July 21, 2022 19:37
Sean Quah added 2 commits July 21, 2022 20:51
…e race

Now that we persist events with partial state, we have to handle
`PartialStateConflictError`s, by retrying just once.

Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx force-pushed the squah/faster_room_joins_calculate_partial_state_flag_for_backfill branch from 32e5e09 to b879d8c Compare July 21, 2022 19:51
@richvdh richvdh self-assigned this Jul 22, 2022
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks great!

Comment on lines +836 to +837
# We ought to have full state now, barring some unlikely race where we left and
# rejoned the room in the background.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check that partial_state is false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, keeping in mind that it won't catch the "calculate-the-state-yourself" case. I can't think of how to handle a second partial state apart from raising an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry yes, I'd envisaged raising an exception, on the grounds we could give a clearer idea of what's going wrong.

But if you'd expect it to behave correctly (ish) in the case where we've rejoined, then I guess that's not appropriate and we should just leave this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've yet to figure out what happens when we rejoin (not that we can even leave right now!) so am happy to raise an exception.

if context.partial_state:

context = None
if not partial_state:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it helpful to enumerate the possibilities here. AIUI, they are:

  • state_ids and partial_state are both None, if we had all the prev_events (some of which may have partial state).
  • state_ids is non-empty, and partial_state is False, meaning we were missing some or all prev_events, and have calculated the state after the prev_events
  • state_ids is non-empty, and partial_state is True, meaning we were missing some prev_events, but others had partial-state. We have calculated the state after all the prev_events, but it is still partial. There is no point building an event context in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I can't help feeling that the interface for _resolve_state_at_missing_prevs is excessively baroque and coupled to that of compute_event_context. (Should we replace _resolve_state_at_missing_prevs with something that returns an EventContext, and we pass an EventContext into _process_received_pdu instead of state_ids and partial_state?)

But that probably requires a bunch of refactoring that isn't justified right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a big comment listing the cases, borrowing some of your wording here.

I agree that _resolve_state_at_missing_prevs is really unwieldy after this change. Is the proposal to have it *always* return an EventContext? I'll try that refactor in a follow on PR.

@@ -82,13 +82,15 @@ async def get_state_group_delta(
return state_group_delta.prev_group, state_group_delta.delta_ids

async def get_state_groups_ids(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a random point for future reference: there is a convention that the subject line of a git commit message should be limited to 50 characters. That helps avoid this sort of thing:

image

There's an art to writing such a pithy summary, of course. In this case I would probably just say:

`get_state_groups_ids`: add `await_full_state` parameter

... and put "so that it can return partial state" in the body of the commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I wasn't aware that the subject line had a shorter recommended length.

Up until now I've been using a limit of 72 characters in merge commits (and ignoring it in commits that get squashed, like the one above). 50 characters certainly makes things tricky!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring it in commits that get squashed, like the one above

Yeah, the problem comes when we're doing a commit-by-commit review: it's much nicer if there's a pithy summary for each commit. It's certainly not worth stressing overmuch over though.

50 characters certainly makes things tricky!

Indeed. github actually seems to truncate at 66 chars, so that seems a more reasonable compromise.

@squahtx squahtx requested a review from richvdh July 22, 2022 15:46
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

synapse/handlers/federation_event.py Outdated Show resolved Hide resolved
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
@richvdh richvdh removed their assignment Jul 26, 2022
@squahtx squahtx merged commit 335ebb2 into develop Jul 26, 2022
@squahtx squahtx deleted the squah/faster_room_joins_calculate_partial_state_flag_for_backfill branch July 26, 2022 11:39
azmeuk pushed a commit to azmeuk/synapse that referenced this pull request Aug 8, 2022
…evs (matrix-org#13355)

Avoid blocking on full state in `_resolve_state_at_missing_prevs` and
return a new flag indicating whether the resolved state is partial.
Thread that flag around so that it makes it into the event context.

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Aug 25, 2022
Synapse 1.65.0 (2022-08-16)
===========================

No significant changes since 1.65.0rc2.

Synapse 1.65.0rc2 (2022-08-11)
==============================

Internal Changes
----------------

- Revert 'Remove the unspecced `room_id` field in the `/hierarchy` response. ([\matrix-org#13365](matrix-org#13365))' to give more time for clients to update. ([\matrix-org#13501](matrix-org#13501))

Synapse 1.65.0rc1 (2022-08-09)
==============================

Features
--------

- Add support for stable prefixes for [MSC2285 (private read receipts)](matrix-org/matrix-spec-proposals#2285). ([\matrix-org#13273](matrix-org#13273))
- Add new unstable error codes `ORG.MATRIX.MSC3848.ALREADY_JOINED`, `ORG.MATRIX.MSC3848.NOT_JOINED`, and `ORG.MATRIX.MSC3848.INSUFFICIENT_POWER` described in [MSC3848](matrix-org/matrix-spec-proposals#3848). ([\matrix-org#13343](matrix-org#13343))
- Use stable prefixes for [MSC3827](matrix-org/matrix-spec-proposals#3827). ([\matrix-org#13370](matrix-org#13370))
- Add a new module API method to translate a room alias into a room ID. ([\matrix-org#13428](matrix-org#13428))
- Add a new module API method to create a room. ([\matrix-org#13429](matrix-org#13429))
- Add remote join capability to the module API's `update_room_membership` method (in a backwards compatible manner). ([\matrix-org#13441](matrix-org#13441))

Bugfixes
--------

- Update the version of the LDAP3 auth provider module included in the `matrixdotorg/synapse` DockerHub images and the Debian packages hosted on packages.matrix.org to 0.2.2. This version fixes a regression in the module. ([\matrix-org#13470](matrix-org#13470))
- Fix a bug introduced in Synapse v1.41.0 where the `/hierarchy` API returned non-standard information (a `room_id` field under each entry in `children_state`). ([\matrix-org#13365](matrix-org#13365))
- Fix a bug introduced in Synapse 0.24.0 that would respond with the wrong error status code to `/joined_members` requests when the requester is not a current member of the room. Contributed by @andrewdoh. ([\matrix-org#13374](matrix-org#13374))
- Fix bug in handling of typing events for appservices. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13392](matrix-org#13392))
- Fix a bug introduced in Synapse 1.57.0 where rooms listed in `exclude_rooms_from_sync` in the configuration file would not be properly excluded from incremental syncs. ([\matrix-org#13408](matrix-org#13408))
- Fix a bug in the experimental faster-room-joins support which could cause it to get stuck in an infinite loop. ([\matrix-org#13353](matrix-org#13353))
- Faster room joins: fix a bug which caused rejected events to become un-rejected during state syncing. ([\matrix-org#13413](matrix-org#13413))
- Faster room joins: fix error when running out of servers to sync partial state with, so that Synapse raises the intended error instead. ([\matrix-org#13432](matrix-org#13432))

Updates to the Docker image
---------------------------

- Make Docker images build on armv7 by installing cryptography dependencies in the 'requirements' stage. Contributed by Jasper Spaans. ([\matrix-org#13372](matrix-org#13372))

Improved Documentation
----------------------

- Update the 'registration tokens' page to acknowledge that the relevant MSC was merged into version 1.2 of the Matrix specification. Contributed by @moan0s. ([\matrix-org#11897](matrix-org#11897))
- Document which HTTP resources support gzip compression. ([\matrix-org#13221](matrix-org#13221))
- Add steps describing how to elevate an existing user to administrator by manipulating the database. ([\matrix-org#13230](matrix-org#13230))
- Fix wrong headline for `url_preview_accept_language` in documentation. ([\matrix-org#13437](matrix-org#13437))
- Remove redundant 'Contents' section from the Configuration Manual. Contributed by @dklimpel. ([\matrix-org#13438](matrix-org#13438))
- Update documentation for config setting `macaroon_secret_key`. ([\matrix-org#13443](matrix-org#13443))
- Update outdated information on `sso_mapping_providers` documentation. ([\matrix-org#13449](matrix-org#13449))
- Fix example code in module documentation of `password_auth_provider_callbacks`. ([\matrix-org#13450](matrix-org#13450))
- Make the configuration for the cache clearer. ([\matrix-org#13481](matrix-org#13481))

Internal Changes
----------------

- Extend the release script to automatically push a new SyTest branch, rather than having that be a manual process. ([\matrix-org#12978](matrix-org#12978))
- Make minor clarifications to the error messages given when we fail to join a room via any server. ([\matrix-org#13160](matrix-org#13160))
- Enable Complement CI tests in the 'latest deps' test run. ([\matrix-org#13213](matrix-org#13213))
- Fix long-standing bugged logic which was never hit in `get_pdu` asking every remote destination even after it finds an event. ([\matrix-org#13346](matrix-org#13346))
- Faster room joins: avoid blocking when pulling events with partially missing prev events. ([\matrix-org#13355](matrix-org#13355))
- Instrument `/messages` for understandable traces in Jaeger. ([\matrix-org#13368](matrix-org#13368))
- Remove an unused argument to `get_relations_for_event`. ([\matrix-org#13383](matrix-org#13383))
- Add a `merge-back` command to the release script, which automates merging the correct branches after a release. ([\matrix-org#13393](matrix-org#13393))
- Adding missing type hints to tests. ([\matrix-org#13397](matrix-org#13397))
- Faster Room Joins: don't leave a stuck room partial state flag if the join fails. ([\matrix-org#13403](matrix-org#13403))
- Refactor `_resolve_state_at_missing_prevs` to compute an `EventContext` instead. ([\matrix-org#13404](matrix-org#13404), [\matrix-org#13431](matrix-org#13431))
- Faster Room Joins: prevent Synapse from answering federated join requests for a room which it has not fully joined yet. ([\matrix-org#13416](matrix-org#13416))
- Re-enable running Complement tests against Synapse with workers. ([\matrix-org#13420](matrix-org#13420))
- Prevent unnecessary lookups to any external `get_event` cache. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13435](matrix-org#13435))
- Add some tracing to give more insight into local room joins. ([\matrix-org#13439](matrix-org#13439))
- Rename class `RateLimitConfig` to `RatelimitSettings` and `FederationRateLimitConfig` to `FederationRatelimitSettings`. ([\matrix-org#13442](matrix-org#13442))
- Add some comments about how event push actions are stored. ([\matrix-org#13445](matrix-org#13445), [\matrix-org#13455](matrix-org#13455))
- Improve rebuild speed for the "synapse-workers" docker image. ([\matrix-org#13447](matrix-org#13447))
- Fix `@tag_args` being off-by-one with the arguments when tagging a span (tracing). ([\matrix-org#13452](matrix-org#13452))
- Update type of `EventContext.rejected`. ([\matrix-org#13460](matrix-org#13460))
- Use literals in place of `HTTPStatus` constants in tests. ([\matrix-org#13463](matrix-org#13463), [\matrix-org#13469](matrix-org#13469))
- Correct a misnamed argument in state res v2 internals. ([\matrix-org#13467](matrix-org#13467))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster joins: check for partial state when handling backfill
2 participants