Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synapse doesn't send out device list updates to previously unseen homeservers when joining a room #11374

Open
Tracked by #2411
matrixbot opened this issue Dec 19, 2023 · 5 comments

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 19, 2023

This issue has been migrated from #11374.


The spec says that:

Servers must send m.device_list_update EDUs to all the servers who share a room with a given local user, and must be sent whenever that user’s device list changes (i.e. for new or deleted devices, when that user joins a room which contains servers which are not already receiving updates for that user’s device list, or changes in device information such as the device’s human-readable name).

Which to be clear, means that upon a local user joining a room, we should:

  1. Check if any of the other homeservers in the room are new to us (their users don't share any other rooms with us)
  2. Send device list updates of the joining user to those homeservers.

It doesn't appear that Synapse actually does this anywhere, currently.

We also need to do this for presence (matrix-org/synapse#8956), but the current presence-related TODO in the code may be a good inspiration for what a device list related implementation would look like:

https://github.com/matrix-org/synapse/blob/75ca0a6168f92dab3255839cf85fb0df3a0076c3/synapse/handlers/presence.py#L1368-L1379

@matrixbot matrixbot changed the title Dummy issue Synapse doesn't send out device list updates to previously unseen homeservers when joining a room Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
@richvdh
Copy link
Member

richvdh commented Jan 12, 2024

@kegsay questions whether this is still true, and will test

@kegsay kegsay self-assigned this Jan 12, 2024
@kegsay
Copy link
Contributor

kegsay commented Jan 24, 2024

I can unfortunately confirm that Synapse does not send an m.device_list_update EDU over federation to the joinee when joining a room. At best, this is a spec violation. At worst, this could cause device lists to not sync correctly.

I'm trying to see what the impact is of this on a real homeserver next, rather than a test federation server. For example, it could be that:

  • The federated server sees the join event.
  • When a user on the federated server syncs, it puts the joined user in device_lists.changed.
  • The user sees this and hits /keys/query.
  • The server hits the origin server to satisfy this query.
  • This then synchronises the device list correctly.

If this is happening, then aside from it being a bit messy, I think this isn't a UTD cause in the general case. It would be in the edge case where the origin server is dead/offline by the time the remote server hits the origin though...

Either way, writing more tests to confirm what is happening here...

@kegsay
Copy link
Contributor

kegsay commented Jan 24, 2024

...which is exactly what is happening here. Added another Complement test to assert this.

So what does this mean?

  • Synapse (and Dendrite!) are not spec compliant here.
  • The EDU is not eagerly sent to the other server, but this is okay because we fetch the device keys when the receiver asks for them via /keys/query. This is why this isn't a big problem in the wild.
  • But... this assumes the server is reachable then. It may not be, in which case we will be unable to secure an Olm session for that user, causing a UTD.

In general, I would advocate for eager sending of necessary data (see MSC4081 for reasoned arguments to this effect).

It's hard to see how frequently this would cause UTDs. The type of UTD would be m.no_olm, but existing telemetry via m.room_key.withheld is insufficient here as we don't know the device ID to send the withheld message to! This would eventually fix itself when the sender manages to /keys/query, and then /keys/claim for each device.

@kegsay
Copy link
Contributor

kegsay commented Jan 24, 2024

The current behaviour (Alice and Bob on different homeservers):

  • Alice joins a room with Bob. Alice and Bob have previously never shared a room.
  • Alice's server does not send a m.device_list_update EDU to Bob's server.
  • Bob syncs.
  • Bob's server puts Alice's user ID in device_lists.changed in the /sync response.
  • Bob's client hits /keys/query as a result, asking about Alice's user ID.
  • Bob's server then asks Alice's server about Alice's devices and device keys, returning the response to Bob's client.
  • Bob's client now knows the keys and devices for Alice, so all is well.

Versus what the specification says:

  • Alice joins a room with Bob. Alice and Bob have previously never shared a room.
  • Alice's server sends a m.device_list_update EDU to Bob's server.
  • Bob syncs.
  • Bob's server puts Alice's user ID in device_lists.changed in the /sync response.
  • Bob's client hits /keys/query as a result, asking about Alice's user ID.
  • Bob's server has a cached copy it can return in case Alice's server is down.
  • Bob's server then asks Alice's server about Alice's devices and device keys, returning the response to Bob's client.
  • Bob's client now knows the keys and devices for Alice, so all is well.

Things break currently when:

  • Alice joins a room with Bob. Alice and Bob have previously never shared a room.
  • Bob does not sync because he is offline.
  • Alice's server goes down / is network partitioned.
  • Bob comes online and syncs.

In this scenario, Synapse does not have cached keys to give to Bob, so a UTD is inevitable. If keys were cached, there would be no UTD. In other words, the specification solution is more robust to network failures than Synapse is currently.

@kegsay
Copy link
Contributor

kegsay commented Sep 6, 2024

Simplified Sliding Sync exacerbates this currently because it seems to not return data immediately to the client. Because of this, the /keys/query request will be delayed. This is important because that request clears backoff timers on the other HS, and until that is done, any /keys/claim requests will immediately fail with failures={"hs2": Object {"status": Number(503), "message": String("Not ready for retry")}}.

See matrix-org/complement-crypto#129 for a test which failed in SSS due to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants