Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Remove _get_events_cache check optimisation from _have_seen_events_dict #14161

Merged
merged 4 commits into from
Oct 18, 2022

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Oct 12, 2022

Fixes #11521, credit to @richvdh for the idea!

When we join a room in Synapse (/send_join), we receive the auth chain and the current set of state from the room. Those events are passed through EventWorkerStore.have_seen_events() (via FederationEventHandler._auth_and_persist_outliers), and any events that we think we've already seen, we drop. Seems sensible.

As an optimisation, _have_seen_events_dict (called from have_seen_events) checks the _get_events_cache before checking the database:

# if the event cache contains the event, obviously we've seen it.
cache_results = {
event_id
for event_id in event_ids
if await self._get_event_cache.contains((event_id,))
}
results = dict.fromkeys(cache_results, True)
remaining = [
event_id for event_id in event_ids if event_id not in cache_results
]
if not remaining:
return results

Unfortunately, due to #13476, the _get_event_cache does not have entries for events invalidated in case of a room purge. What that means is that you'll:

  • have events from a room in your _get_event_cache
  • purge the room; the events are gone from the database but still live in the _get_event_cache ⚠️
  • rejoin the room over federation; you get a bunch of state from the remote homeserver, which you subsequently drop on the floor because have_seen_events thinks those events are already persisted
  • you are left with a broken room; the database only contains the (new) events since you left the room (you won't even have a m.room.create event!).

Ideally we'd fix things so that entries in _get_event_cache are correctly invalidated when a room is purged. There is a WIP plan to do so, but it's a big job. For now, we can just remove this optimisation as a quick win, as it's causing more harm than good. (The optimisation was originally added in #9601).

(If you look closely, you'll notice that _have_seen_events_dict has a cache as well. Not to worry, that cache is correctly cleared when a room is purged.)


Note that when you backfill (for non-state events) from a remote homeserver, those also go through _get_event_cache and will still be dropped on the floor. #14164 is the fix for that part.

Checking this cache is currently an invalid assumption, as the _get_event_cache is not correctly
invalidated when purging events from a room. Remove this optimisation for now as its causing
more harm than good.

We can re-add it after fixing _get_event_cache.
@anoadragon453 anoadragon453 marked this pull request as ready for review October 12, 2022 17:38
@anoadragon453 anoadragon453 requested a review from a team as a code owner October 12, 2022 17:38
@anoadragon453
Copy link
Member Author

anoadragon453 commented Oct 12, 2022

I wasn't really sure how to test this automatically... thus I did so manually. The steps I took were:

  1. Using the demo scripts, set up 2 federating homeservers. Alice and Bob are on different homeservers.
  2. Alice creates a room and invites Bob to it. They exchange a few messages.
  3. Bob leaves and forgets the room, then purges it from the database (I used the manhole here).
  4. Bob then decides that that was all a mistake and wants to rejoin. Alice re-invites him and he joins.

Before this change, Bob was met with a broken room. Here is how it looked in the database:

sqlite> select * from events where room_id = '!GoQuefrUNXjlKNLbGr:localhost:8481' order by depth;
55|14|$vewcwPGTiprnoUsQkdLCSYcyucl_VyBdVl1uTCu2sYg|m.room.member|!GoQuefrUNXjlKNLbGr:localhost:8481|||1|1|14|1665588462315|1665588462350|@admin:localhost:8481|0|master|@admin:localhost:8480|
56|15|$_uc0Q0oK_qRroV8Iv5a6RYh_H59-I0_qy3TvRWmMvYI|m.room.member|!GoQuefrUNXjlKNLbGr:localhost:8481|||1|0|15|1665588464994|1665588465064|@admin:localhost:8480|0|master|@admin:localhost:8480|
sqlite>

and if we explore the room state in Element:

Screenshot from 2022-10-12 16-29-25

There's only 2 events total! Bob's invite and his subsequent join. All other events were in the _get_event_cache, so weren't persisted.

After this PR, we instead see (most) of the events we were expecting. All state events are present, but messages are missing.

This is because backfilling events also checks the event cache. This case is fixed in #14164.

@anoadragon453
Copy link
Member Author

I successfully tested the following on my homeserver on develop (2c63cdc) + #14164 + #14161 (this PR):

  • Leaving Synapse Admins (#synapse:matrix.org)
  • Purging the room via the manhole
  • Rejoining the room
  • Sending and receiving messages

Worked as expected 🎉

@anoadragon453 anoadragon453 merged commit 828b550 into develop Oct 18, 2022
@anoadragon453 anoadragon453 deleted the anoa/have_seen_events_no_cache branch October 18, 2022 09:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

completely broken room after purge_room/rejoin
2 participants