Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Missing sync events during historical batch imports #12281

Closed
Fizzadar opened this issue Mar 24, 2022 · 18 comments · Fixed by #12319
Closed

Missing sync events during historical batch imports #12281

Fizzadar opened this issue Mar 24, 2022 · 18 comments · Fixed by #12319
Labels
A-Sync defects related to /sync S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Fizzadar
Copy link
Contributor

Fizzadar commented Mar 24, 2022

Split out from #9424 (comment) onwards. I have inlined the relevant context from that issue below.

We have re-created the following issue; I've included the event list and trimmed sync responses in this gist: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1. Some notes:

  • The first sync containing the invite seems to correctly pull everything
  • The second sync includes member information but misses out the prior state
  • The invite also appears to arrive after the room creation event
  • Prior syncs do not reference the room at all (can provide JSON dumps if useful)

Notably, this is only occuring during initial backfill using MSC2716 batch sends. When this is disabled on the bridge we have not yet replicated the problem seen here.

The DAG:

synapse=> SELECT stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!SjekxtmsBigrUMcMkX:beeper-dev.com' ORDER BY stream_ordering ASC;
 stream_ordering |             type             |                          state_key                           |                   event_id                   |                   prev_events
-----------------+------------------------------+--------------------------------------------------------------+----------------------------------------------+--------------------------------------------------
         -113203 | org.matrix.msc2716.insertion |                                                    | $G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws | []
         -113202 | m.room.encrypted             |                                                    | $I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE | ["$G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws"]
         -113201 | m.room.encrypted             |                                                    | $hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY | ["$I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE"]
         -113200 | m.room.encrypted             |                                                    | $-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4 | ["$hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY"]
         -113199 | m.room.encrypted             |                                                    | $5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs | ["$-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4"]
         -113198 | m.room.encrypted             |                                                    | $5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4 | ["$5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs"]
         -113197 | org.matrix.msc2716.batch     |                                                    | $KkSRlRidPiOkkLEuiHs3gWLyCqCdQTHAVYJxzhDIzcM | ["$5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4"]
         -111135 | org.matrix.msc2716.insertion |                                                    | $LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks | []
         -111134 | m.room.encrypted             |                                                    | $RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs | ["$LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks"]
         -111133 | org.matrix.msc2716.batch     |                                                    | $l_jGmPdlA73wDzS3_gRwC67co1pTnx9-QBufta10xWI | ["$RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs"]
         -111132 | org.matrix.msc2716.insertion |                                                    | $_mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633404 | m.room.create                |                                                    | $ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM | []
         2633421 | m.room.member                | @_adam4_whatsapp_xxx:beeper-dev.com                | $a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ | ["$ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM"]
         2633434 | m.room.power_levels          |                                                    | $S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg | ["$a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ"]
         2633448 | m.room.join_rules            |                                                    | $5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0 | ["$S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg"]
         2633461 | m.room.history_visibility    |                                                    | $nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo | ["$5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0"]
         2633469 | m.room.guest_access          |                                                    | $iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU | ["$nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo"]
         2633478 | m.bridge                     | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8 | ["$iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU"]
         2633492 | uk.half-shot.bridge          | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk | ["$p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8"]
         2633503 | m.room.avatar                |                                                    | $BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs | ["$tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk"]
         2633515 | m.room.encryption            |                                                    | $qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y | ["$BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs"]
         2633523 | m.room.name                  |                                                    | $joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g | ["$qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y"]
         2633533 | m.room.topic                 |                                                    | $WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4 | ["$joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g"]
         2633547 | m.room.member                | @_adam4_whatsapp_bot:beeper-dev.com                | $hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg | ["$WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4"]
         2633571 | m.room.member                | @adam4:beeper-dev.com                              | $DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc | ["$hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg"]
         2633582 | m.room.member                | @adam4:beeper-dev.com                              | $51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU | ["$DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc"]
         2633590 | m.room.member                | @_adam4_whatsapp_bot:beeper-dev.com                | $Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc | ["$51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU"]
         2633601 | fi.mau.dummy.portal_created  |                                                    | $a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68 | ["$Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc"]
         2633614 | fi.mau.dummy.pre_backfill    |                                                    | $E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633633 | fi.mau.dummy.backfill_end    |                                                    | $NILDi9SfVDgaW72alIEsc5ESzdJSoLip_c7klMAeuPo | ["$E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY"]
(30 rows)

Note that this DAG looks similar to rooms where this problem did not occur.

@Fizzadar
Copy link
Contributor Author

Fizzadar commented Mar 24, 2022

Based upon reviewing the MSC and room batch code the prev_event field should be present for all insertion batches. I think this is possibly a separate issue.

For the sync itself it seems something is confusing the sync handler because none of the batch events should ever come down via sync (noted as an issue on the MSC). In our live tests we do see batches come down with sync requests which I now believe to be incorrect.

Have just checked and replaying the same sync request (account is untouched since the original issue) does include the missing m.bridge events as expected. It also includes the very first MSC2716 event, the only one with any prev_event content ($_mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU).

@reivilibre reivilibre self-assigned this Mar 24, 2022
@reivilibre
Copy link
Contributor

reivilibre commented Mar 24, 2022

Have just checked and replaying the same sync request (account is untouched since the original issue) does include the missing m.bridge events as expected. It also includes the very first MSC2716 event, the only one with any prev_event content ($_mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU).

Very curious, so you're saying that replaying the sync request is giving you all the expected state events now?

Would you be happy to share a new trimmed extract of it?

Also, for the sake of tracking what's going on, would you be able to post the event content of the MSC2716 batch and insertion events? I don't think they're critical but it may help to show how this fits together; I'm still getting up to speed with MSC2716 so having the real world example would be beneficial personally.

@reivilibre
Copy link
Contributor

reivilibre commented Mar 24, 2022

I've been making scribbles on this graph of the DAG (→ = prev-events of, so newer events are generally at the top), thought attaching it here may be of use if others want to look at this too.

DAG

2022_03_12281_dagger

The scribbles themselves are probably less useful to others for the moment..

@Fizzadar
Copy link
Contributor Author

Very curious, so you're saying that replaying the sync request is giving you all the expected state events now?
Would you be happy to share a new trimmed extract of it?

My apologies it does miss the same events (phew!), so the issue is consistent. Must've copied the wrong since token somewhere. Here's the sync response when I re-create it today: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1#file-2022-03-24-trimmed-sync-s2633573_17_338_6732159_1082514_541479_274711_265584_1-json

Working on getting the event content too will update here.

@Fizzadar
Copy link
Contributor Author

Also, for the sake of tracking what's going on, would you be able to post the event content of the MSC2716 batch and insertion events? I don't think they're critical but it may help to show how this fits together; I'm still getting up to speed with MSC2716 so having the real world example would be beneficial personally.

👍 Just pulled this down: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1#file-events_content-json

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Mar 24, 2022

@Fizzadar It's very hard to follow that gist and what's expected vs missing. Is it possible to more summarize that in terms of requests that happened.

ex.

  1. Room create
  2. /batch_send?prev_event_id=(which event)
    • Includes a few messages
  3. /batch_send?batch_id=continue
  4. /sync
    • Everything as expected
  5. /sync?since=a
    • Missing $xxx (m.room.member invite)
  6. etc (with more detail)

Then I can try to create a Complement test to reproduce and understand better.


For the sync itself it seems something is confusing the sync handler because none of the batch events should ever come down via sync (noted as an issue on the MSC). In our live tests we do see batches come down with sync requests which I now believe to be incorrect.

They can come down an initial sync. But they shouldn't come down an incremental sync (with ?since) because the stream-ordering token will be after the negative stream_ordering historical batch

@Fizzadar
Copy link
Contributor Author

@MadLittleMods sure thing, describing the order below and I've also included the raw bridge logs specific for the room in question:

  1. Client starts, initalSync happens before any bridge or backfill
  2. Bridge starts, client signin/etc flow
  3. Bridge starts backfill process - note this creates + backfills up to 20 rooms in parallel
  4. Room create, initial state includes the m.bridge event as expected
  5. /sync?since=s2633508_17_338_6732159_1082514_541479_274711_265584_1
    a. this file in the gist, contains all the room state as expected in the invite state dict
  6. Bridge adds the user to the room via m.room.member state
  7. First /batch_send?prev_event_id=$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68
  8. /sync?since=s2633573_17_338_6732159_1082514_541479_274711_265584_1
    a. this file in the gist, this is where the m.bridge state event ($p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8 is missing + more)
    b. replaying this request today gives the same result
  9. Second /batch_send?batch_id=IepWImCQ&prev_event_id=$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68

Note that this is not consistent - of 20 rooms being created + backfilled in parallel there may be one or two that have missing state as described above.

I've added the raw logs from the bridge specific to the affected room in the gist.

@bradtgmurray
Copy link
Contributor

I'm curious if the issue is using a sync token at a particular point in the event graph, and the inconsistency of whether we can reproduce it or not is just the frequency of the client syncing and where the graph for a particular room gets divided between syncs. I wonder if we could take a room that "worked" for us and then manually bump back the sync token up the event graph until we get a sync that doesn't include the m.bridge event anymore and should.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Mar 25, 2022

I'm curious if the issue is using a sync token at a particular point in the event graph, and the inconsistency of whether we can reproduce it or not is just the frequency of the client syncing and where the graph for a particular room gets divided between syncs. I wonder if we could take a room that "worked" for us and then manually bump back the sync token up the event graph until we get a sync that doesn't include the m.bridge event anymore and should.

This sounds pretty plausible. rooms.join.<roomId>.state.events should only include:

Updates to the state, between the time indicated by the since parameter, and the start of the timeline (or all state up to the start of the timeline, if since is not given, or full_state is true).

-- https://spec.matrix.org/v1.1/client-server-api/#get_matrixclientv3sync

When I look at the Synapse database dump of events to get the full picture and compare with the /sync responses, I'm not quite seeing that behavior. I've included and sorted in the stream tokens from (sync 1 ...) and (sync 2 ...) into the list. And I've added (1) and (2) if they appeared in the /sync state list accordingly (invite_state in the first sync case).

It's possible I'm interpreting those stream tokens wrong too 🤷

  • s2633508_17_338_6732159_1082514_541479_274711_265584_1 -> 2633508
  • s2633573_17_338_6732159_1082514_541479_274711_265584_1 ->2633573

synapse/synapse/types.py

Lines 436 to 439 in 3c41d87

Live tokens start with an "s" followed by the "stream_ordering" id of the
event it comes after. Historic tokens start with a "t" followed by the
"topological_ordering" id of the event it comes after, followed by "-",
followed by the "stream_ordering" id of the event it comes after.

synapse=> SELECT stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!SjekxtmsBigrUMcMkX:beeper-dev.com' ORDER BY stream_ordering ASC;
 stream_ordering |             type             |                          state_key                           |                   event_id                   |                   prev_events
-----------------+------------------------------+--------------------------------------------------------------+----------------------------------------------+--------------------------------------------------
         -113203 | org.matrix.msc2716.insertion |                                                    | $G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws | []
         -113202 | m.room.encrypted             |                                                    | $I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE | ["$G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws"]
         -113201 | m.room.encrypted             |                                                    | $hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY | ["$I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE"]
         -113200 | m.room.encrypted             |                                                    | $-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4 | ["$hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY"]
         -113199 | m.room.encrypted             |                                                    | $5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs | ["$-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4"]
         -113198 | m.room.encrypted             |                                                    | $5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4 | ["$5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs"]
         -113197 | org.matrix.msc2716.batch     |                                                    | $KkSRlRidPiOkkLEuiHs3gWLyCqCdQTHAVYJxzhDIzcM | ["$5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4"]
         -111135 | org.matrix.msc2716.insertion |                                                    | $LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks | []
         -111134 | m.room.encrypted             |                                                    | $RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs | ["$LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks"]
         -111133 | org.matrix.msc2716.batch     |                                                    | $l_jGmPdlA73wDzS3_gRwC67co1pTnx9-QBufta10xWI | ["$RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs"]
         -111132 | org.matrix.msc2716.insertion |                                                    | $_mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633404 | m.room.create (1)            |                                                    | $ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM | []
         2633421 | m.room.member (1) (2)        | @_adam4_whatsapp_xxx:beeper-dev.com                | $a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ | ["$ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM"]
         2633434 | m.room.power_levels          |                                                    | $S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg | ["$a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ"]
         2633448 | m.room.join_rules (1)        |                                                    | $5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0 | ["$S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg"]
         2633461 | m.room.history_visibility    |                                                    | $nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo | ["$5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0"]
         2633469 | m.room.guest_access          |                                                    | $iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU | ["$nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo"]
         2633478 | m.bridge (1)                 | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8 | ["$iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU"]
         2633492 | uk.half-shot.bridge (1)      | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk | ["$p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8"]
         2633503 | m.room.avatar (1)            |                                                    | $BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs | ["$tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk"]
-------- 2633508 (sync 1 while invited)
         2633515 | m.room.encryption (1)        |                                                    | $qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y | ["$BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs"]
         2633523 | m.room.name (1)              |                                                    | $joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g | ["$qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y"]
         2633533 | m.room.topic (1)             |                                                    | $WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4 | ["$joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g"]
         2633547 | m.room.member (2)            | @_adam4_whatsapp_bot:beeper-dev.com                | $hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg | ["$WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4"]
         2633571 | m.room.member (1)            | @adam4:beeper-dev.com                              | $DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc | ["$hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg"]
-------- 2633573 (sync 2 after being joined)
         2633582 | m.room.member (2)            | @adam4:beeper-dev.com                              | $51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU | ["$DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc"]
         2633590 | m.room.member                | @_adam4_whatsapp_bot:beeper-dev.com                | $Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc | ["$51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU"]
         2633601 | fi.mau.dummy.portal_created  |                                                    | $a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68 | ["$Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc"]
         2633614 | fi.mau.dummy.pre_backfill    |                                                    | $E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633633 | fi.mau.dummy.backfill_end    |                                                    | $NILDi9SfVDgaW72alIEsc5ESzdJSoLip_c7klMAeuPo | ["$E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY"]
(30 rows)

@Fizzadar I'm not confident in the correct case though. Can you post the Synapse db dump and the two /sync results when it's working correctly?

Actual reproduction gist questions

I'm still fuzzy on some details:

  1. Who is the creator of the room?
    • I assume @_adam4_whatsapp_xxx:beeper-dev.com (is this the same as @_adam4_whatsapp_bot:beeper-dev.com?)
  2. Who is the bot?
    • I assume @_adam4_whatsapp_bot:beeper-dev.com
  3. Who is doing the /sync'ing?
    • I assume @adam4:beeper-dev.com
  4. Are we using the invite field with /createRoom?
  5. In the /sync responses, are we expecting the m.bridge event to be in rooms.join.<roomId>.timeline.events or rooms.join.<roomId>.state.events?

Complement test

I wrote a Complement test to try to reproduce the problem but I am seeing some differences:

  1. Does the test setup look correct to what you're doing? Test case to try to reproduce missing state in sync (possibly related to MSC2716) complement#352
  2. With the current test, in the first incremental sync, I'm not seeing the m.bridge as part of the first rooms.invite.<roomId>.invite_state.events like you do.
    • Is this necessary?
    • The spec isn't clear to me what should be included in invite_state

@Fizzadar
Copy link
Contributor Author

1. Who is the creator of the room?
   * I assume `@_adam4_whatsapp_xxx:beeper-dev.com` (is this the same as `@_adam4_whatsapp_bot:beeper-dev.com`?)
2. Who is the bot?
   * I assume `@_adam4_whatsapp_bot:beeper-dev.com`
3. Who is doing the `/sync`'ing? 
   * I assume `@adam4:beeper-dev.com`

All correct.

4. Are we using the `invite` field with `/createRoom`?

Yes - for the bot user, the sync user @adam4:beeper-dev.com is invited after the room is created.

5. In the `/sync` responses, are we expecting the `m.bridge` event to be in `rooms.join.<roomId>.timeline.events` or `rooms.join.<roomId>.state.events`?

I expect to see the m.bridge state within the rooms.join.<roomid>.state.events - this is the behavior we see elsewhere when not using MSC2716.

However this actually appears to be ad-odds with the spec. According to the spec I would not expect to see either the m.bridge event nor $a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ (which is included). I'm unsure how, if the spec is followed exactly, an invited room member is ever able to receive any room state prior to their join (assuming one should ignore invite_state). Maybe I'm misinterpreting something here?

I wrote a Complement test to try to reproduce the problem but I am seeing some differences:

1. Does the test setup look correct to what you're doing? [Test case to try to reproduce missing state in sync (possibly related to MSC2716) complement#352](https://github.com/matrix-org/complement/pull/352)

This looks correct - I think the only thing missing is the state_key for the m.bridge event which would explain why it never appears in the sync state I think.

   * The spec isn't clear to me what should be included in `invite_state`

Agreed, I think it's just for display purposes but it's not clear. What is clear is any invite_state should not replace rooms.join.<roomid>.state.events.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Mar 25, 2022

@Fizzadar Just to make sure we're not chasing after an X/Y problem, what is the actual root problem that you are dealing with? Why does it matter that m.bridge is missing from the /sync state response?

@Fizzadar Can you post the Synapse db dump and the two /sync results when it's working correctly? It would be good to compare what's completely expected for you.


The spec isn't clear to me what should be included in invite_state

Diving into this more on Synapse, it looks like invite_state depends on what you have room_prejoin_state (previously room_invite_state_types) configured as in homeserver.yaml. Can you share whether you have something configured here? Although none of the invite_state stuff probably matters.

What is clear is any invite_state should not replace rooms.join.<roomid>.state.events.

👍 For my own reference, here are the API docs that explain that:

invite_state

These events do not replace any state that the client already has for the room, for example if the client has archived the room. Instead the client should keep two separate copies of the state: the one from the invite_state and one from the archived state. If the client joins the room then the current state will be given as a delta against the archived state not the invite_state.

-- https://spec.matrix.org/v1.1/client-server-api/#get_matrixclientv3sync


I think the only thing missing is the state_key for the m.bridge event which would explain why it never appears in the sync state I think.

I think it's fine with "state_key": "", which stores it as the sorta root global one for the room. It's unclear to me why you define it as "state_key": "net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net" in your case? Why should it be namespaced there?

We do the same "state_key": "", with bridged Gitter rooms for example.

@MadLittleMods MadLittleMods added the A-Sync defects related to /sync label Mar 26, 2022
@Fizzadar
Copy link
Contributor Author

Fizzadar commented Mar 26, 2022

@Fizzadar Just to make sure we're not chasing after an X/Y problem, what is the actual root problem that you are dealing with? Why does it matter that m.bridge is missing from the /sync state response?

We're using this to display the correct bridge icon in the UI which is how we discovered the issue.

We do the same "state_key": "", with bridged Gitter rooms for example.

Yep sorry my bad, the check is indeed against None explicitly.

@Fizzadar Can you post the Synapse db dump and the two /sync results when it's working correctly? It would be good to compare what's completely expected for you.

Absolutely, I've added the events and two syncs of a room that worked during the same bridge setup in this gist. I have also added the annotations as you did above with the sync positions and highlighting where the state came down.

This is where things get interesting - the second sync contains only events under timeline.events - including events prior to the since token! In fact it looks like every room event is sent down in the timeline.

It does feel like when joining a room for the first time all previous state should be included to allow clients to build a full picture of the rooms state. This is indeed what appears to happen in synapse (excluding this specific issue/situation). However, this does not appear to be defined within the spec at all…

@bradtgmurray
Copy link
Contributor

This feels like a hole in the spec, no? If you're joining a room for the first time, what's the expected behaviour of clients with respect to fetching the room state? It appears that clients are just assuming they'll get it based as part of sync (this is what we're seeing with our matrix-js-sdk fork) but that doesn't seem to be specified.

@Fizzadar
Copy link
Contributor Author

Fizzadar commented Mar 28, 2022

I think I've just identified the situation this is occurring - it appears to happen in rooms where the first joined sync is not limited (ie timeline.limited=False). Looking at every room/sync we have a record of in our testing this is holding true.

Was looking at the codebase but it appears newly joined rooms should always get a full state (and the limited flag should make no difference):

full_state = (
room_builder.full_state or newly_joined or sync_result_builder.full_state
)

@Fizzadar
Copy link
Contributor Author

Note: when I replay the working gist example above it no longer works (no m.bridge state event included). Dropping the sync token back one by one, as soon as both the invite + join events are ahead, things work as expected.

In fact - this appears to be the case for every room that has been backfilled using batch_send. If I replay the sync using a token between invite and join events, the state is consistently missing. Given that most of the sync requests do not show this problem in our live client<>bridge setups (on which the gists are based), this must be related to the live ongoing batch sending at the time, which explains why this only happens to some rooms.


Had some time for some in-depth investigations; I've bootstrapped a local sync worker pointing at our development database used in all these examples and tracked down the problem to the full_state flag (previous comment), specifically where newly_joined is set. The following block is executed:

if room_id in sync_result_builder.joined_room_ids or has_join:
old_state_ids = await self.get_state_at(room_id, since_token)
old_mem_ev_id = old_state_ids.get((EventTypes.Member, user_id), None)
old_mem_ev = None
if old_mem_ev_id:
old_mem_ev = await self.store.get_event(
old_mem_ev_id, allow_none=True
)
if not old_mem_ev or old_mem_ev.membership != Membership.JOIN:
newly_joined_rooms.append(room_id)

In the failing sync from the original message in this issue (since=s2633573_17_338_6732159_1082514_541479_274711_265584_1) the old_mem_ev_id is the same as the latest membership event for the room ($51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU). Thus the check on L1773 does not match when it should.

This situation seems to occur when a sync starts inbetween the invite event and join event. I can reliably replicate this for other rooms by rolling back the sync token to between the two events. The problem appears to be that get_state_at(since_token) is returning state that exists after the token. Following that leads to this:

# FIXME this claims to get the state at a stream position, but
# get_recent_events_for_room operates by topo ordering. This therefore
# does not reliably give you the state at the given stream position.
# (https://github.com/matrix-org/synapse/issues/3305)
last_events, _ = await self.store.get_recent_events_for_room(
room_id, end_token=stream_position.room_key, limit=1
)

This is where MSC2716 possibly comes in because this returns the first org.matrix.msc2716.insertion event, _mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU, and the get_state_after_event call based on that gets back the join event $51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU.

I made a small change here to handle this and the issue appears to be resolved in all my testing.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Mar 28, 2022

Great find @Fizzadar! Here are my own notes for understanding the problem as well:

Why is Synapse not sending all of the initial state when /sync?since=s123 between an invite and join event?

get_state_at(since_token) -> get_recent_events_for_room -> _paginate_room_events_txn will return events the most recent events by topological ordering (same as depth in Synapse) (DESC, highest to lowest), then tie-break with stream_ordering (DESC, highest to lowest). The events are also filtered to be below the given stream_ordering determined by the since token.

The SQL ends up looking like:

SELECT
    event.event_id, event.instance_name,
    event.topological_ordering, event.stream_ordering
FROM events AS event

WHERE event.outlier = FALSE AND event.room_id = ? AND (2633573 >= event.stream_ordering)
ORDER BY event.topological_ordering DESC,
event.stream_ordering DESC LIMIT 2

For the problem /sync missing state, at this point in time while backfilling, fi.mau.dummy.portal_created ($a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68) is the most recent event by depth in the room but it's excluded because of its stream_ordering (2633573 >= 2633601 -> False). However, the /batch_send history is inserted next to it via /batch_send?prev_event_id=$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68 which will inherit the same depth as that event but all of them will have a negative stream_ordering so they are included in the results as the latest in the room below the given since stream_ordering (2633573 >= negative any number -> True). You can see why the events are returned this way if we sort our database dump the same way get_state_at(since_token) is sorting:

$ psql synapse

synapse=> SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!SjekxtmsBigrUMcMkX:beeper-dev.com'
ORDER BY depth DESC, stream_ordering DESC;

...

And since historical messages will resolve all of the state from where they were inserted at plus whatever was set in /batch_send state_events_at_start, it will include the m.room.member join event for the syncing user which the old_mem_ev logic is getting confused about.

The problem will occur for anyone who joins the room before the /batch_send and hasn't synced before that to get the initial state.

Is this unique to MSC2716?

This problem seems like it wouldn't be unique to using MSC2716 though. It seems like it could happen with federation backfilled messages as well since any backfilled message over federation also gets a negative stream_ordering (note that this is federation backfill which is different from the "backfill" you're doing to fill out the room history with MSC2716).

And incremental sync is already know to have problems in this area: #3305

Normal room where it works ✅

Reproduction

Reproduction steps
  1. Add an application service to your Synapse instance
  2. With the application service as_token, create a room and invite a normal user
    POST http://localhost:8008/_matrix/client/v3/createRoom
    {
        "preset":       "public_chat",
        "name":         "test-synapse-12281-v0.0.1",
        "room_version": "9",
        "invite": ["@ericgittertester:my.synapse.server"],
        "initial_state": [
            {
                "type":      "m.other.state",
                "state_key": "",
                "content": {
                    "foo": "bar"
                }
            }
        ]
    }
  3. In another room, send some messages to advance the stream_ordering in your Synapse instance. This will add some space for us to make stream tokens in between and use with /sync?since=abc
  4. Join the room as the normal user
  5. Dump the database, to easily see the gap in stream_ordering between the m.room.member invite and join events.
    $ psql synapse
    
    SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
    LEFT JOIN event_json AS ej USING (event_id)
    WHERE events.room_id ='!VpJGWzRaYnBJHwCaWd:my.synapse.server' ORDER BY stream_ordering ASC;
  6. As the normal user, sync in the problem gap spot, GET http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s877_205797_22_82_396_8_19_160_1
  7. Notice that rooms.join.<roomId>.timeline.events includes all initial state, including the m.other.state from the initial_state during room creation.

Notes

Here is an overview of the events in the room:

$ psql synapse

SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!VpJGWzRaYnBJHwCaWd:my.synapse.server' ORDER BY stream_ordering ASC;


depth | stream_ordering |           type            |              state_key              |                   event_id                   |                   prev_events
-------+-----------------+---------------------------+-------------------------------------+----------------------------------------------+--------------------------------------------------
     1 |             863 | m.room.create             |                                     | $i0U5s25xbRoLLqancRLFOUcVEaLwNjZSzvvT5bU4_sw | []
     2 |             864 | m.room.member             | @erictroupetester:my.synapse.server | $RQXO-19yjf_zprhDkN4CFjj76zNyLYVwHSJRgGAarrE | ["$i0U5s25xbRoLLqancRLFOUcVEaLwNjZSzvvT5bU4_sw"]
     3 |             865 | m.room.power_levels       |                                     | $vxSLpdP0dYGLfY4SF5ajpzm4HIGBSl8ShXFngP6dwNc | ["$RQXO-19yjf_zprhDkN4CFjj76zNyLYVwHSJRgGAarrE"]
     4 |             866 | m.room.join_rules         |                                     | $Nzph2kpEaIG2FBl-1_N7Yu7Wq7q-0pLG1o-sjPOqYL0 | ["$vxSLpdP0dYGLfY4SF5ajpzm4HIGBSl8ShXFngP6dwNc"]
     5 |             867 | m.room.history_visibility |                                     | $4Hqnlf9LuYYN28G6yjFz50WKJEDI8m5vOCirNAK4vcU | ["$Nzph2kpEaIG2FBl-1_N7Yu7Wq7q-0pLG1o-sjPOqYL0"]
     6 |             868 | m.bridge                  |                                     | $EOa43fMXoJHizHb_s7VwLhRdLs6FMqjIxpGMhDC7tjo | ["$4Hqnlf9LuYYN28G6yjFz50WKJEDI8m5vOCirNAK4vcU"]
     7 |             869 | m.other.state             |                                     | $6T8IYDLBdLGfYcrHW3Ntj9XL1WCJAN7u670hNgDhz2A | ["$EOa43fMXoJHizHb_s7VwLhRdLs6FMqjIxpGMhDC7tjo"]
     8 |             870 | m.room.name               |                                     | $YCc5eyjIjx_qNQPtsnklJ7prZmhRluimQ246ytO9B6s | ["$6T8IYDLBdLGfYcrHW3Ntj9XL1WCJAN7u670hNgDhz2A"]
     9 |             871 | m.room.member             | @ericgittertester:my.synapse.server | $us9V0eqQbu_AWtxyquIDn-QFVdRL0KBnZE_AnXjefSw | ["$YCc5eyjIjx_qNQPtsnklJ7prZmhRluimQ246ytO9B6s"]
    10 |             880 | m.room.member             | @ericgittertester:my.synapse.server | $Usp1bf1j4SLhZLQ1IvGFhr2o-CyBjtISkBTOEnyf7z8 | ["$us9V0eqQbu_AWtxyquIDn-QFVdRL0KBnZE_AnXjefSw"]
    11 |             889 | m.room.message            |                                     | $Pzm2WJDUBhja2cwjEXR_suCAsd_SsV8pGdIQZ1ljXQQ | ["$Usp1bf1j4SLhZLQ1IvGFhr2o-CyBjtISkBTOEnyf7z8"]
(11 rows)

/sync?since=s877_xxx (anything between the invite 871 and the join 880)

Using the get_state_at(since_token) sort ( ORDER BY depth DESC, stream_ordering DESC;), you can see that the m.room.member invite is the most recent event in the room but will resolve just the invite state since the join happens later and is exlcuded because it it's stream_ordering is too high (see the filter note below).

  • The - red messages are filtered out because of the WHERE 877 >= event.stream_ordering in the SQL.
  • The green + is the selected event and is correct as expected
  $ psql synapse
  
  SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
  LEFT JOIN event_json AS ej USING (event_id)
  WHERE events.room_id ='!VpJGWzRaYnBJHwCaWd:my.synapse.server'
  ORDER BY depth DESC, stream_ordering DESC;
  
  
  depth | stream_ordering |           type            |              state_key              |                   event_id                   |                   prev_events
  -------+-----------------+---------------------------+-------------------------------------+----------------------------------------------+--------------------------------------------------
-     11 |             889 | m.room.message            |                                     | $Pzm2WJDUBhja2cwjEXR_suCAsd_SsV8pGdIQZ1ljXQQ | ["$Usp1bf1j4SLhZLQ1IvGFhr2o-CyBjtISkBTOEnyf7z8"]
-     10 |             880 | m.room.member             | @ericgittertester:my.synapse.server | $Usp1bf1j4SLhZLQ1IvGFhr2o-CyBjtISkBTOEnyf7z8 | ["$us9V0eqQbu_AWtxyquIDn-QFVdRL0KBnZE_AnXjefSw"]
+      9 |             871 | m.room.member             | @ericgittertester:my.synapse.server | $us9V0eqQbu_AWtxyquIDn-QFVdRL0KBnZE_AnXjefSw | ["$YCc5eyjIjx_qNQPtsnklJ7prZmhRluimQ246ytO9B6s"]
       8 |             870 | m.room.name               |                                     | $YCc5eyjIjx_qNQPtsnklJ7prZmhRluimQ246ytO9B6s | ["$6T8IYDLBdLGfYcrHW3Ntj9XL1WCJAN7u670hNgDhz2A"]
       7 |             869 | m.other.state             |                                     | $6T8IYDLBdLGfYcrHW3Ntj9XL1WCJAN7u670hNgDhz2A | ["$EOa43fMXoJHizHb_s7VwLhRdLs6FMqjIxpGMhDC7tjo"]
       6 |             868 | m.bridge                  |                                     | $EOa43fMXoJHizHb_s7VwLhRdLs6FMqjIxpGMhDC7tjo | ["$4Hqnlf9LuYYN28G6yjFz50WKJEDI8m5vOCirNAK4vcU"]
       5 |             867 | m.room.history_visibility |                                     | $4Hqnlf9LuYYN28G6yjFz50WKJEDI8m5vOCirNAK4vcU | ["$Nzph2kpEaIG2FBl-1_N7Yu7Wq7q-0pLG1o-sjPOqYL0"]
       4 |             866 | m.room.join_rules         |                                     | $Nzph2kpEaIG2FBl-1_N7Yu7Wq7q-0pLG1o-sjPOqYL0 | ["$vxSLpdP0dYGLfY4SF5ajpzm4HIGBSl8ShXFngP6dwNc"]
       3 |             865 | m.room.power_levels       |                                     | $vxSLpdP0dYGLfY4SF5ajpzm4HIGBSl8ShXFngP6dwNc | ["$RQXO-19yjf_zprhDkN4CFjj76zNyLYVwHSJRgGAarrE"]
       2 |             864 | m.room.member             | @erictroupetester:my.synapse.server | $RQXO-19yjf_zprhDkN4CFjj76zNyLYVwHSJRgGAarrE | ["$i0U5s25xbRoLLqancRLFOUcVEaLwNjZSzvvT5bU4_sw"]
       1 |             863 | m.room.create             |                                     | $i0U5s25xbRoLLqancRLFOUcVEaLwNjZSzvvT5bU4_sw | []
  (11 rows)

Problem case where it fails ❌

Reproduction

Reproduction steps
  1. Add an application service to your Synapse instance
  2. With the application service as_token, create a room and invite a normal user
    POST http://localhost:8008/_matrix/client/v3/createRoom
    {
        "preset":       "public_chat",
        "name":         "test-synapse-12281-v0.0.1",
        "room_version": "9",
        "invite": ["@ericgittertester:my.synapse.server"],
        "initial_state": [
            {
                "type":      "m.other.state",
                "state_key": "",
                "content": {
                    "foo": "bar"
                }
            }
        ]
    }
  3. In another room, send some messages to advance the stream_ordering in your Synapse instance. This will add some space for us to make stream tokens in between and use with /sync?since=abc
  4. Join the room as the normal user
  5. As the application service, send a message in the room which we will insert our history next to
    PUT http://localhost:8008/_matrix/client/r0/rooms/!oFSfJJEqEAtDvOnelc:my.synapse.server/send/m.room.message/123transaction
    {
        "msgtype": "m.text",
        "body": "portal_created"
    }
    
  6. Use MSC2716 to insert a batch of historical messages
    POST http://localhost:8008/_matrix/client/unstable/org.matrix.msc2716/rooms/!oFSfJJEqEAtDvOnelc:my.synapse.server/batch_send?prev_event_id=%24BhECpS4WhzG8H3QbCohaiQeBVoyK4FmBS2XYrFMkQ0c
    {
      "state_events_at_start": [
        {
          "content": {
            "displayname": "some-display-name-for-@maria-01234:my.synapse.server",
            "membership": "join"
          },
          "origin_server_ts": 1642055555924,
          "sender": "@maria-01234:my.synapse.server",
          "state_key": "@maria-01234:my.synapse.server",
          "type": "m.room.member"
        }
      ],
      "events": [
        {
          "content": {
            "body": "Historical 0 (batch=0)",
            "msgtype": "m.text",
            "org.matrix.msc2716.historical": true
          },
          "origin_server_ts": 1642055555924,
          "sender": "@maria-01234:my.synapse.server",
          "type": "m.room.message"
        },
        {
          "content": {
            "body": "Historical 1 (batch=0)",
            "msgtype": "m.text",
            "org.matrix.msc2716.historical": true
          },
          "origin_server_ts": 1642055555925,
          "sender": "@maria-01234:my.synapse.server",
          "type": "m.room.message"
        },
        {
          "content": {
            "body": "Historical 2 (batch=0)",
            "msgtype": "m.text",
            "org.matrix.msc2716.historical": true
          },
          "origin_server_ts": 1642055555926,
          "sender": "@maria-01234:my.synapse.server",
          "type": "m.room.message"
        },
        {
          "content": {
            "body": "Historical 3 (batch=0)",
            "msgtype": "m.text",
            "org.matrix.msc2716.historical": true
          },
          "origin_server_ts": 1642055555927,
          "sender": "@maria-01234:my.synapse.server",
          "type": "m.room.message"
        }
      ]
    }
  7. Dump the database, to easily see the gap in stream_ordering between the m.room.member invite and join events.
    $ psql synapse
    
    SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
    LEFT JOIN event_json AS ej USING (event_id)
    WHERE events.room_id ='!oFSfJJEqEAtDvOnelc:my.synapse.server' ORDER BY stream_ordering ASC;
  8. As the normal user, sync in the problem gap spot, GET http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s935_205797_22_82_396_8_19_160_1
  9. Notice that rooms.join.<roomId>.timeline.events and rooms.join.<roomId>.state.events do not include the m.other.state from the initial_state during room creation.

Notes

Here is an overview of the events in the room:

$ psql synapse

SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!oFSfJJEqEAtDvOnelc:my.synapse.server' ORDER BY stream_ordering ASC;

depth | stream_ordering |             type             |              state_key              |                   event_id                   |                   prev_events
-------+-----------------+------------------------------+-------------------------------------+----------------------------------------------+--------------------------------------------------
    11 |         -139034 | org.matrix.msc2716.insertion |                                     | $XwhXMeXCey1qng3_K2R-YhRwkYwyAo5CYnop1A-NiBA | []
    11 |         -139033 | m.room.message               |                                     | $-MW3opp5jrUNMZb82JizCmIKEHCJOImQZxJxWJiZdro | ["$XwhXMeXCey1qng3_K2R-YhRwkYwyAo5CYnop1A-NiBA"]
    11 |         -139032 | m.room.message               |                                     | $d8CvNegAiiFG8N-VlsSk-v3kI2MvCa8HZ2L7MC5WZ_8 | ["$-MW3opp5jrUNMZb82JizCmIKEHCJOImQZxJxWJiZdro"]
    11 |         -139031 | m.room.message               |                                     | $U20bBC5b0ItFzatm5eyQ2RXerkXJg7KFF7cfLvBACsw | ["$d8CvNegAiiFG8N-VlsSk-v3kI2MvCa8HZ2L7MC5WZ_8"]
    11 |         -139030 | m.room.message               |                                     | $NOozHNdRkxA5swkSa72gm0bA6ugPZc65AuE9M0IpMu0 | ["$U20bBC5b0ItFzatm5eyQ2RXerkXJg7KFF7cfLvBACsw"]
    11 |         -139029 | org.matrix.msc2716.batch     |                                     | $aDLWvwgiz_SIh4BVGs9ng1kst5d5f651GjWQ0Se5s-s | ["$NOozHNdRkxA5swkSa72gm0bA6ugPZc65AuE9M0IpMu0"]
    11 |         -139028 | org.matrix.msc2716.insertion |                                     | $4wxqppnSoQzMnm-oFmj_vfwDoRfVTkyU1xluSvN_-8I | ["$BhECpS4WhzG8H3QbCohaiQeBVoyK4FmBS2XYrFMkQ0c"]
     1 |         -139027 | m.room.member                | @maria-01234:my.synapse.server      | $2nAzxqa9gi6qQh_9z0jEZZLz6sTb4TGCo38dIHkBQm8 | []
     1 |             921 | m.room.create                |                                     | $p5PhsW7K1yP2ga4WMgsCk3vsmXMAhc4UpcUcsKOaZM0 | []
     2 |             922 | m.room.member                | @gitter-badger:my.synapse.server    | $NccDrVQCA1guredN-MUXT7Gf8m2n1_GP2t_fSMoveBU | ["$p5PhsW7K1yP2ga4WMgsCk3vsmXMAhc4UpcUcsKOaZM0"]
     3 |             923 | m.room.power_levels          |                                     | $hH_UQ3Qv8ZFHWg_Z5ge_nI5E35TKHL0JzYEP1dT6K-c | ["$NccDrVQCA1guredN-MUXT7Gf8m2n1_GP2t_fSMoveBU"]
     4 |             924 | m.room.join_rules            |                                     | $QPI5nUX19j4A09W8cszE4smeUESGZGM6KcHh8rzetCQ | ["$hH_UQ3Qv8ZFHWg_Z5ge_nI5E35TKHL0JzYEP1dT6K-c"]
     5 |             925 | m.room.history_visibility    |                                     | $lNHOvkQDv_0Fvh4smQNy6DrA6hkSfy_cEIJAHgamv80 | ["$QPI5nUX19j4A09W8cszE4smeUESGZGM6KcHh8rzetCQ"]
     6 |             926 | m.bridge                     |                                     | $3-VGN8ysxQilX1jq9KwM6mewL698Hytqwf6txGYDIVw | ["$lNHOvkQDv_0Fvh4smQNy6DrA6hkSfy_cEIJAHgamv80"]
     7 |             927 | m.other.state                |                                     | $9rgvaPPXxXfdwy1yEOaakCNiOfEpFbAm5eOq5j3GNkw | ["$3-VGN8ysxQilX1jq9KwM6mewL698Hytqwf6txGYDIVw"]
     8 |             928 | m.room.name                  |                                     | $tPjkQBhaRo3HkRruQQauSwExXXceQ0n5aDvmqWujXg4 | ["$9rgvaPPXxXfdwy1yEOaakCNiOfEpFbAm5eOq5j3GNkw"]
     9 |             929 | m.room.member                | @ericgittertester:my.synapse.server | $Rz0OJ7OWIEycom0xAe9Ky3mCWSeL1vQ4QiUy_9YpPmc | ["$tPjkQBhaRo3HkRruQQauSwExXXceQ0n5aDvmqWujXg4"]
    10 |             941 | m.room.member                | @ericgittertester:my.synapse.server | $RCs3RxSBLgmnxbpx-kf8m-Zy58_X68Y9FTvvuX3lbgI | ["$Rz0OJ7OWIEycom0xAe9Ky3mCWSeL1vQ4QiUy_9YpPmc"]
    11 |             946 | m.room.message               |                                     | $BhECpS4WhzG8H3QbCohaiQeBVoyK4FmBS2XYrFMkQ0c | ["$RCs3RxSBLgmnxbpx-kf8m-Zy58_X68Y9FTvvuX3lbgI"]
(19 rows)

/sync?since=s935_xxx (anything between the invite 929 and the join 941)

Using the get_state_at(since_token) sort ( ORDER BY depth DESC, stream_ordering DESC;), you can see that the historical messages are sorted in front of everything else and match the query.

  • The - red messages are filtered out because of the WHERE 935 >= event.stream_ordering in the SQL.
  • +❌ is what the SQL query selects but it's wrong
  • +✅ is what the SQL query should have selected but it didn't
  $ psql synapse

  SELECT depth, stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
  LEFT JOIN event_json AS ej USING (event_id)
  WHERE events.room_id ='!oFSfJJEqEAtDvOnelc:my.synapse.server'
  ORDER BY depth DESC, stream_ordering DESC;
  
  depth | stream_ordering |             type             |              state_key              |                   event_id                   |                   prev_events
  -------+-----------------+------------------------------+-------------------------------------+----------------------------------------------+--------------------------------------------------
-     11 |             946 | m.room.message               |                                     | $BhECpS4WhzG8H3QbCohaiQeBVoyK4FmBS2XYrFMkQ0c | ["$RCs3RxSBLgmnxbpx-kf8m-Zy58_X68Y9FTvvuX3lbgI"]
+❌    11 |         -139028 | org.matrix.msc2716.insertion |                                     | $4wxqppnSoQzMnm-oFmj_vfwDoRfVTkyU1xluSvN_-8I | ["$BhECpS4WhzG8H3QbCohaiQeBVoyK4FmBS2XYrFMkQ0c"]
      11 |         -139029 | org.matrix.msc2716.batch     |                                     | $aDLWvwgiz_SIh4BVGs9ng1kst5d5f651GjWQ0Se5s-s | ["$NOozHNdRkxA5swkSa72gm0bA6ugPZc65AuE9M0IpMu0"]
      11 |         -139030 | m.room.message               |                                     | $NOozHNdRkxA5swkSa72gm0bA6ugPZc65AuE9M0IpMu0 | ["$U20bBC5b0ItFzatm5eyQ2RXerkXJg7KFF7cfLvBACsw"]
      11 |         -139031 | m.room.message               |                                     | $U20bBC5b0ItFzatm5eyQ2RXerkXJg7KFF7cfLvBACsw | ["$d8CvNegAiiFG8N-VlsSk-v3kI2MvCa8HZ2L7MC5WZ_8"]
      11 |         -139032 | m.room.message               |                                     | $d8CvNegAiiFG8N-VlsSk-v3kI2MvCa8HZ2L7MC5WZ_8 | ["$-MW3opp5jrUNMZb82JizCmIKEHCJOImQZxJxWJiZdro"]
      11 |         -139033 | m.room.message               |                                     | $-MW3opp5jrUNMZb82JizCmIKEHCJOImQZxJxWJiZdro | ["$XwhXMeXCey1qng3_K2R-YhRwkYwyAo5CYnop1A-NiBA"]
      11 |         -139034 | org.matrix.msc2716.insertion |                                     | $XwhXMeXCey1qng3_K2R-YhRwkYwyAo5CYnop1A-NiBA | []
-     10 |             941 | m.room.member                | @ericgittertester:my.synapse.server | $RCs3RxSBLgmnxbpx-kf8m-Zy58_X68Y9FTvvuX3lbgI | ["$Rz0OJ7OWIEycom0xAe9Ky3mCWSeL1vQ4QiUy_9YpPmc"]
+✅     9 |             929 | m.room.member                | @ericgittertester:my.synapse.server | $Rz0OJ7OWIEycom0xAe9Ky3mCWSeL1vQ4QiUy_9YpPmc | ["$tPjkQBhaRo3HkRruQQauSwExXXceQ0n5aDvmqWujXg4"]
       8 |             928 | m.room.name                  |                                     | $tPjkQBhaRo3HkRruQQauSwExXXceQ0n5aDvmqWujXg4 | ["$9rgvaPPXxXfdwy1yEOaakCNiOfEpFbAm5eOq5j3GNkw"]
       7 |             927 | m.other.state                |                                     | $9rgvaPPXxXfdwy1yEOaakCNiOfEpFbAm5eOq5j3GNkw | ["$3-VGN8ysxQilX1jq9KwM6mewL698Hytqwf6txGYDIVw"]
       6 |             926 | m.bridge                     |                                     | $3-VGN8ysxQilX1jq9KwM6mewL698Hytqwf6txGYDIVw | ["$lNHOvkQDv_0Fvh4smQNy6DrA6hkSfy_cEIJAHgamv80"]
       5 |             925 | m.room.history_visibility    |                                     | $lNHOvkQDv_0Fvh4smQNy6DrA6hkSfy_cEIJAHgamv80 | ["$QPI5nUX19j4A09W8cszE4smeUESGZGM6KcHh8rzetCQ"]
       4 |             924 | m.room.join_rules            |                                     | $QPI5nUX19j4A09W8cszE4smeUESGZGM6KcHh8rzetCQ | ["$hH_UQ3Qv8ZFHWg_Z5ge_nI5E35TKHL0JzYEP1dT6K-c"]
       3 |             923 | m.room.power_levels          |                                     | $hH_UQ3Qv8ZFHWg_Z5ge_nI5E35TKHL0JzYEP1dT6K-c | ["$NccDrVQCA1guredN-MUXT7Gf8m2n1_GP2t_fSMoveBU"]
       2 |             922 | m.room.member                | @gitter-badger:my.synapse.server    | $NccDrVQCA1guredN-MUXT7Gf8m2n1_GP2t_fSMoveBU | ["$p5PhsW7K1yP2ga4WMgsCk3vsmXMAhc4UpcUcsKOaZM0"]
       1 |             921 | m.room.create                |                                     | $p5PhsW7K1yP2ga4WMgsCk3vsmXMAhc4UpcUcsKOaZM0 | []
       1 |         -139027 | m.room.member                | @maria-01234:my.synapse.server      | $2nAzxqa9gi6qQh_9z0jEZZLz6sTb4TGCo38dIHkBQm8 | []
  (19 rows)

Potential solutions

  1. The quick solution @Fizzadar came up with where we check if the new membership change matches the old membership we found, then we still consider it a new join because it is, Fizzadar@a9a6ed9
  2. Correct get_state_at(since_token) to actually query and sort by stream_ordering so it grabs the correct events in the first place

@MadLittleMods MadLittleMods added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. S-Minor Blocks non-critical functionality, workarounds exist. labels Mar 28, 2022
@MadLittleMods
Copy link
Contributor

MadLittleMods commented Mar 28, 2022

@Fizzadar Make a PR with your fix so we can get some more feedback from maintainers. Would be good to add a test for this so it doesn't regress either (other MSC2716 /batch_send test for reference). I think it would be possible write a Complement test instead but since the problem is so closely tied to stream_ordering, it would probably be best to lookup and manipulate from a trial unit test.

Any interest in taking it further to fix get_state_at(since_token)?

@Fizzadar
Copy link
Contributor Author

@MadLittleMods great write up, thank you for pulling that all together!

I've gone ahead and change implemented the get_state_at change in #12319. I don't think the other change is worth merging as it's a workaround/hack for the real bug here, but happy to PR if there's still value in it?

erikjohnston pushed a commit that referenced this issue Apr 13, 2022
Discovered after much in-depth investigation in #12281.

Closes: #12281
Closes: #3305

Signed off by: Nick Mills-Barrett nick@beeper.com
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Sync defects related to /sync S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants