Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist large gappy state blocks as a single snapshot #211

Closed
DMRobertson opened this issue Jul 18, 2023 · 13 comments
Closed

Persist large gappy state blocks as a single snapshot #211

DMRobertson opened this issue Jul 18, 2023 · 13 comments

Comments

@DMRobertson
Copy link
Contributor

Ideally, we would instead make the events in the state block part of the state snapshot and then add new timeline events in Accumulate only. This:

  • more accurately maps onto what upstream is telling us.
  • is much faster as we aren't calculating state snapshots for each 10,000s event.

This implies a third function in the accumulator between Initialise and Accumulate which can both snapshot and roll forward the timeilne in a single txn.

Originally posted by @kegsay in #196 (comment)

Original gappy state impl in #71.

@kegsay kegsay added the poller label Jul 20, 2023
@kegsay kegsay self-assigned this Jul 21, 2023
@kegsay
Copy link
Member

kegsay commented Jul 21, 2023

Currently:

  • Initialise -> generate state snapshot, return ID but does nothing with it. Ignore if we have a state snapshot already.
  • Accumulate -> load most recent snapshot, roll forward state.

This only works because we process Initialise first prior to Accumulate and assume that either:
A) there is no Initialise because no history or
B) history exists and this is a room join

and that we won't process new timeline events until we've Initialise'd at least once. These assumptions break
for gappy state.

It breaks because you can have someone:
A) join the room with events ahead of the users already joined to the room (slow poller)
B) come back after a period of inactivity (token refresh or idle timeout) which will cause a state block to include events we have not seen before.

If we do not process the state blocks, we'll end up with bad room state. We partly fixed this by prepending unknown state
events to the timeline section, to force snapshot calculation and ensure we converge on correct room state. This breaks the timeline
though (it won't show the right events). This also caused huge latency spikes (~9min) when synapse occasionally sends 13k unknown state
events in one sync response :/ which wedges the executor entirely. This was because we were prepending these as timeline events, calculating
state snapshots for each event. Instead, we want to calculate a single state snapshot and use that for the timeline[0] event.

This complicates matters because of the following scenario:

  • Slow poller plodding through the room on timeline event T with room snapshot S.
  • Fast poller / catchup causes a new snapshot at S+100 (100 unknown events) with T+200:250.

We want the slow poller to use snapshot S when it processes T+1.
We want the fast poller to use snapshot S+100 when it processing T+200 (and when it gets to the next sync loop, T+251).

This means we need to do Initialise/Accumulate in a single atomic operation, otherwise you could imagine a scenario where we process the fast poller's
Initialise S+100 and then the executor executes the slow poller's T+1, corrupting state even with a queue of work to do.

Separately, we need to have some way to know what the previous (last) state snapshot was for each room for each poller so we can ensure we base our
Accumulate operations on the right baseline. Currently we always assume that we operate on the latest snapshot if we see new events. This would mean:

  • Slow poller remembers it is on snapshot S when it receives T+1.
  • Fast poller remembers it is on state S+100 when it receives T+251.

When the slow poller catches up to T+200, it will see that we have event T+200 in the DB already, and do no further work.

@kegsay
Copy link
Member

kegsay commented Jul 21, 2023

Problem: the timeline ordering is messed up with this solution because the slow poller will be inserting unknown old events which have a higher event NID, which will be caught by the nid range calcs.

We could maybe say "if you're an old state snapshot then ignore your timeline events" BUT how do you detect who is old and who is not?

We cannot store max(event_nid) for the state snapshot to tell new from old because slower pollers could get old state which will have a higher event nid as it is an unknown event.

@kegsay
Copy link
Member

kegsay commented Jul 21, 2023

dmr sez we need an authoritative ordering from the HS to fix this for sure. Others want this too e.g https://github.com/matrix-org/matrix-spec-proposals/blob/andybalaam/event-thread-and-order/proposals/4033-event-thread-and-order.md

@kegsay
Copy link
Member

kegsay commented Jul 21, 2023

Conclusion: use origin server ts to drop old events, but how?

@DMRobertson
Copy link
Contributor Author

dmr sez we need an authoritative ordering from the HS to fix this for sure.

Otherwise we're just going to be guessing and using heuristics.

@kegsay
Copy link
Member

kegsay commented Jul 24, 2023

Assuming we had matrix-org/matrix-spec-proposals#4033 - what would this look like?

  • Remember the highest ordering for each room, and drop timeline events < this ordering.
  • Could potentially cache this in the poller to reduce DB write frequency.

Something like:

  • On startup, load latest event nid and pull out unsigned.ordering for each room. Remember them in map[string]int64.
  • When get new event, ignore timeline events <= that value. This would replace the logic to drop "old" events based on timeline event anchors.

@kegsay
Copy link
Member

kegsay commented Jul 28, 2023

Problem: the timeline ordering is messed up with this solution because the slow poller will be inserting unknown old events which have a higher event NID, which will be caught by the nid range calcs.

We're going to mark this as a known problem, and wait for MSC4033 to land to fix it properly, as we then have an authoratative ordering. Because of this, we HAVE to handle the fact that the proxy can diverge from the HS wrt state (this has always been true due to sync v2 not being able to convey state resets correctly). To provide an escape hatch, I propose we do #232 so a leave/re-join would self-heal the room.

The problem with that is now we have two concurrent snapshots: whatever was there before and now the new join event. We want to make sure we drop the old snapshot and use the "latest" snapshot (read: most recent). To do this without racing, we need to handle:

This means we need to do Initialise/Accumulate in a single atomic operation, otherwise you could imagine a scenario where we process the fast poller's Initialise S+100 and then the executor executes the slow poller's T+1, corrupting state even with a queue of work to do.

@kegsay
Copy link
Member

kegsay commented Jul 28, 2023

Conclusion, make 2 PRs:

  • 1: Do not make snapshots for lone leave events #235 rejig how invite rejection is handled. Do not insert the leave event into the events table. Instead, if len(timeline) == 1 && state_key == user_id && membership == leave && there is no snapshot ID for $room_id then bundle the leave event into the V2LeftRoom payload, and thread it through to connstate, which will feed it down like we do invites. Rationale: the proxy was never in the room so it isn't safe to make snapshots / timeline entries.
  • 2: Checkpoint room state #232 and WIP: create new state snapshots on join #248 refactor initialise/accumulate into 1 function processRoomEvents which atomically makes state snapshots and assigns them to the timeline. Two tiers of state blocks: with create event == roll forward from empty set [], without create event == gappy state block, roll forward from whatever we currently think is the latest state. Create event snapshots are marked in the db as authoratitive for debugging purposes. State block events mark as is_state=true to not get them in the timeline. IGNORE TIMELINE EVENTS IF WE HAVE NO SNAPSHOT. Do not make a snapshot if we lack a create event and don't have a create event from an existing snapshot.

This fixes the problem with prependStateEvents for all four scenarios:

  • 1: Malformed events. We fixed this in Handle malformed events more gracefully #224 - previously we'd fail to insert the state block but blithely continue on, meaning the timeline would make a new snapshot from the empty set.
  • 2: Invite rejections. The leave event would make a new snapshot for the room from the empty set. Now we won't be calling processRoomEvents for this edge case.
  • 3: Stray events sent by Synapse from federation. These events are all in the timeline. If we were previously joined to the room we'd add it to the snapshot, but this is fine as the poller isn't in the room anymore. If people are in the room then their pollers will keep things in sync.
  • 4: Genuine gappy state syncs (e.g rejoin a room after period of absence, expired OIDC token). The state block will roll forward correctly as there will be no create event. Those state events won't be in the timeline. The N timeline events will be added correctly.

@kegsay
Copy link
Member

kegsay commented Jul 31, 2023

#235 for point 1.

@kegsay kegsay added the roadmap label Aug 1, 2023
kegsay added a commit that referenced this issue Aug 11, 2023
As per #211 - this combines Initialise and Accumulate into a single
ProcessRoomEvents functions which can do snapshots / timelines.

Implements the "brand new snapshot on create event" logic, which
currently does not correctly invalidate caches.

E2E tests pass, but integ tests are broken.
@DMRobertson
Copy link
Contributor Author

We think the approach in #248 is the correct way to fix this. But it touches a lot of core code; it is a high-risk change.

Taking a step back, we know about two ways to hit this:

  1. reject an invite, then someone else later joins the room.
  2. leave a room, have your token expire for ages, then rejoin the room. We'll get a gappy sync v2 response with the entire room state.

@kegsay says that we could detect (1) pretty easily and use a different means to avoid the large mass of prepended state events. That is a much lower risk change and might be preferable in the short term.

That wouldn't handle (2) though.

@DMRobertson
Copy link
Contributor Author

Current thinking: only create a brand-new snapshot in Accumulate if the timeline is the start of the room. That avoids the pain of (2) and (3) of #211 (comment). (4) isn't changed: it'll work but will still be slow.

There's also the race described in #211 (comment) which would need rethinking to solve.

DMRobertson pushed a commit that referenced this issue Aug 17, 2023
Described in #211 (comment).

This is essentially cherry picked from #248, in particular the commit
    8c7046e

This should prevent creating new snapshots that don't reflect the state
of the room. We'll need a followup task to clean up bad snapshots.
@kegsay
Copy link
Member

kegsay commented Aug 22, 2023

#255 should fix this for the case where we get stray events from upstream.

It still means that if a poller is expired and reconnects months later and has 30k+ state events in their gappy sync it will be slow, but this is an improvement on the status quo.

@kegsay
Copy link
Member

kegsay commented Aug 24, 2023

Closing this, as this issue is far too large and consists of many moving parts, most of which are now fixed.

The remaining issue is:

  • 4: Genuine gappy state syncs (e.g rejoin a room after period of absence, expired OIDC token).

Which can be fixed by #270 in the case where the poller is deactivated due to inactivity. This is trickier for OIDC refreshed tokens though, as we don't want an expired token to cause rooms to be deleted.

Assuming we delete the device and tokens when we expire after 30 days, and only that triggers room deletions, then we may get away with just #270 because then expired tokens won't delete rooms, and we allow a grace period of 30 days for the client to reappear before nuking the rooms. This assumes that the state block for a room isn't larger after 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants