-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Missing sync events during historical batch imports #12281
Comments
Based upon reviewing the MSC and room batch code the For the sync itself it seems something is confusing the sync handler because none of the batch events should ever come down via sync (noted as an issue on the MSC). In our live tests we do see batches come down with sync requests which I now believe to be incorrect. Have just checked and replaying the same sync request (account is untouched since the original issue) does include the missing |
Very curious, so you're saying that replaying the sync request is giving you all the expected state events now? Would you be happy to share a new trimmed extract of it? Also, for the sake of tracking what's going on, would you be able to post the event content of the MSC2716 batch and insertion events? I don't think they're critical but it may help to show how this fits together; I'm still getting up to speed with MSC2716 so having the real world example would be beneficial personally. |
My apologies it does miss the same events (phew!), so the issue is consistent. Must've copied the wrong since token somewhere. Here's the sync response when I re-create it today: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1#file-2022-03-24-trimmed-sync-s2633573_17_338_6732159_1082514_541479_274711_265584_1-json Working on getting the event content too will update here. |
👍 Just pulled this down: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1#file-events_content-json |
@Fizzadar It's very hard to follow that gist and what's expected vs missing. Is it possible to more summarize that in terms of requests that happened. ex.
Then I can try to create a Complement test to reproduce and understand better.
They can come down an initial sync. But they shouldn't come down an incremental sync (with |
@MadLittleMods sure thing, describing the order below and I've also included the raw bridge logs specific for the room in question:
Note that this is not consistent - of 20 rooms being created + backfilled in parallel there may be one or two that have missing state as described above. I've added the raw logs from the bridge specific to the affected room in the gist. |
I'm curious if the issue is using a sync token at a particular point in the event graph, and the inconsistency of whether we can reproduce it or not is just the frequency of the client syncing and where the graph for a particular room gets divided between syncs. I wonder if we could take a room that "worked" for us and then manually bump back the sync token up the event graph until we get a sync that doesn't include the m.bridge event anymore and should. |
This sounds pretty plausible.
When I look at the Synapse database dump of events to get the full picture and compare with the It's possible I'm interpreting those stream tokens wrong too 🤷
Lines 436 to 439 in 3c41d87
@Fizzadar I'm not confident in the correct case though. Can you post the Synapse db dump and the two Actual reproduction gist questionsI'm still fuzzy on some details:
Complement testI wrote a Complement test to try to reproduce the problem but I am seeing some differences:
|
All correct.
Yes - for the bot user, the sync user
I expect to see the However this actually appears to be ad-odds with the spec. According to the spec I would not expect to see either the
This looks correct - I think the only thing missing is the
Agreed, I think it's just for display purposes but it's not clear. What is clear is any |
@Fizzadar Just to make sure we're not chasing after an X/Y problem, what is the actual root problem that you are dealing with? Why does it matter that @Fizzadar Can you post the Synapse db dump and the two
Diving into this more on Synapse, it looks like
👍 For my own reference, here are the API docs that explain that:
I think it's fine with We do the same |
We're using this to display the correct bridge icon in the UI which is how we discovered the issue.
Yep sorry my bad, the check is indeed against
Absolutely, I've added the events and two syncs of a room that worked during the same bridge setup in this gist. I have also added the annotations as you did above with the sync positions and highlighting where the state came down. This is where things get interesting - the second sync contains only events under It does feel like when joining a room for the first time all previous state should be included to allow clients to build a full picture of the rooms state. This is indeed what appears to happen in synapse (excluding this specific issue/situation). However, this does not appear to be defined within the spec at all… |
This feels like a hole in the spec, no? If you're joining a room for the first time, what's the expected behaviour of clients with respect to fetching the room state? It appears that clients are just assuming they'll get it based as part of sync (this is what we're seeing with our matrix-js-sdk fork) but that doesn't seem to be specified. |
I think I've just identified the situation this is occurring - it appears to happen in rooms where the first joined sync is not limited (ie Was looking at the codebase but it appears newly joined rooms should always get a full state (and the limited flag should make no difference): synapse/synapse/handlers/sync.py Lines 2041 to 2043 in 8fe930c
|
Note: when I replay the working gist example above it no longer works (no In fact - this appears to be the case for every room that has been backfilled using Had some time for some in-depth investigations; I've bootstrapped a local sync worker pointing at our development database used in all these examples and tracked down the problem to the synapse/synapse/handlers/sync.py Lines 1764 to 1774 in 8fe930c
In the failing sync from the original message in this issue ( This situation seems to occur when a sync starts inbetween the invite event and join event. I can reliably replicate this for other rooms by rolling back the sync token to between the two events. The problem appears to be that synapse/synapse/handlers/sync.py Lines 686 to 692 in 8fe930c
This is where MSC2716 possibly comes in because this returns the first I made a small change here to handle this and the issue appears to be resolved in all my testing. |
Great find @Fizzadar! Here are my own notes for understanding the problem as well: Why is Synapse not sending all of the initial state when
|
@Fizzadar Make a PR with your fix so we can get some more feedback from maintainers. Would be good to add a test for this so it doesn't regress either (other MSC2716 Any interest in taking it further to fix |
@MadLittleMods great write up, thank you for pulling that all together! I've gone ahead and change implemented the |
…rovided Part of #12281 Context: #12319 (comment)
Split out from #9424 (comment) onwards. I have inlined the relevant context from that issue below.
We have re-created the following issue; I've included the event list and trimmed sync responses in this gist: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1. Some notes:
Notably, this is only occuring during initial backfill using MSC2716 batch sends. When this is disabled on the bridge we have not yet replicated the problem seen here.
The DAG:
Note that this DAG looks similar to rooms where this problem did not occur.
The text was updated successfully, but these errors were encountered: