Epic: One-Time Keys can desync between client and server causing UTDs #2406

kegsay · 2024-04-25T16:33:18Z

To establish a secure channel between a pair of devices, all Matrix clients use a double-ratchet algorithm called Olm. As part of this establishment, a one-time key (OTK) must be "claimed" (via /keys/claim) by the sending device on the receiving device's server. OTKs are public/private keypairs where the private key is always kept on the device in a database. The public key gets uploaded to the user's homeserver, so even if the device is offline anyone can establish a secure channel to that device. OTKs are used once and then deleted from both the client and server databases. All E2EE devices will upload some amount of OTKs to the server, and try to keep that amount above a certain value as people constantly claim (and hence delete) OTKs.

This issue describes many scenarios where the private keys stored on the device do not match the public keys stored on the server: in other words the set of OTKs are no longer synchronised between client and server. If this happens, there will be at least 1 unable to decrypt message. If clients do not automatically recover from wedged Olm sessions then this will be potentially more than 1 message.

Client Issues

Clients may forget which OTKs were uploaded to the server, either through idempotency failures, usage of multiple clients in multiple processes which do not talk to each other or other unknown reasons. If this happens, things desync as the client does not have the private key to the OTKs stored on the server.

Clients should only delete the private key when the OTK was successfully used. There have been bugs where the client has deleted the private key incorrectly on decryption failures.

Server Issues

The server database could become corrupted, through a hardware failure or via a bad homeserver update. Server admins will usually rollback to a "known good" version of the database when this happens. However, if anyone has claimed a OTK during this period between the bad homeserver update and the rollback, this will be forgotten by the server, causing a desync. The server will then potentially provide the same OTK to a different user in response to /keys/claim. This is #2155

Protocol Issues

Databases are not infinite. This is especially true on mobile devices. As a result, clients will often keep around the most recent N private OTKs, where N is very large (Vodozemac sets this to const MAX_ONE_TIME_KEYS: usize = 100 * PUBLIC_MAX_ONE_TIME_KEYS; which is currently 100*50 = 5,000). If this number is reached, the client will delete OTKs. Unfortunately, there is no API to tell the server to also delete these OTKs, meaning things will desync. This is #2356

Multiple users could hit /keys/claim at the same time the target user is uploading keys. This creates race conditions which can cause the same key to be used multiple times.

What do we do about this?

Some of these issues are just gnarly bugs. Some of these issues are failures in the protocol, but some are fundamental problems that cannot be resolved easily. The thing all these bugs share in common though is a desync between client and server. We can do something to detect this and take corrective action.

For example, the server could send a hash of all the public keys (or some other O(1) checksum which does not scale with the number of OTKs on the server) down /sync, which would allow clients to check that the keys stored server-side are in fact the same keys they have stored. If there is a mismatch, there needs to be some API for the client to control which keys are kept and which keys are discarded. If such an API existed, this would address the vast majority of these issues in one go. There are tradeoffs however:

It risks papering over race conditions and idempotency failures in client/server implementations, but this feels no worse than automatically recovering "wedged" Olm sessions: it improves reliability of the crypto stack.
It's unclear how to handle "in-flight" OTKs where a remote user has claimed a OTK (so deleted from the server) but not yet sent a message encrypted with said OTK (so the private key must be kept on the client). This relates to Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356

The text was updated successfully, but these errors were encountered:

richvdh · 2024-04-26T18:50:41Z

There's a lot of words here, and a lot of links to closed issues. I think the actual issues still outstanding are:

Lost OTK, leading to "OneTime key already exists" error and later UTDs matrix-org/matrix-rust-sdk#1415
Rolling a homeserver's database back via backup could cause duplicate OTKs and hence UISIs #2155
Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356
One-time-key upload/claim is racy matrix-org/matrix-spec#1124

There is also #1992, but that is more of a discussion of recovering the situation than a root cause.

Is all that fair so far?

Of the four issues linked above:

Lost OTK, leading to "OneTime key already exists" error and later UTDs matrix-org/matrix-rust-sdk#1415 seems to be some sort of concurrency fail in EX iOS, which we really need to get to the bottom of anyway.
I don't think any sort of syncing process is going to solve One-time-key upload/claim is racy matrix-org/matrix-spec#1124.
As you say yourself it doesn't help with Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356.

So is your proposal really all about dealing with #2155 ?

kegsay · 2024-04-29T14:47:58Z

A desync solution could help with Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356 if the server included in-flight key IDs.
Re concurrency fails, there are other sources which we may end up hitting e.g multi-tab in Element-Web. We could fix each and every one of these issues, but that's not the point of this meta-issue. It's the same reason why we recover from wedged Olm sessions rather than you know, "just" fixing the wedging in the first place.
Yes, this solution does help with server rollbacks, in addition to client rollbacks (e.g restoring phone from a backup).

So overall no, it's not really just about dealing with server rollbacks.

richvdh · 2024-04-29T16:44:59Z

A desync solution could help with #2356 if the server included in-flight key IDs.

How would the server know which keys were in-flight though? The server knows that a given OTK has been claimed, but, particularly in the case of federation, it doesn't know any better than the receiving client whether that means there is a message on its way using that OTK, or if the client that claimed the OTK exploded in a ball of flames just after the claim, so the OTK will never actually be used. We could maybe have the server do some introspection of the to-device-message queue, but... eeeehrgh.

Incidentally, the main problem with #2356 is that the server can hand out OTKs in any order it likes, so the client has no idea whatsoever which keys it should consider useless. Fixing that is easy, and doesn't require any fancy synchronisation system.

Re concurrency fails, there are other sources which we may end up hitting e.g multi-tab in Element-Web. We could fix each and every one of these issues, but that's not the point of this meta-issue. It's the same reason why we recover from wedged Olm sessions rather than you know, "just" fixing the wedging in the first place.

Well, in both cases I'd say that we wouldn't (and shouldn't) have the recovery mechanism if the only reason for it was to cover up concurrency bugs. The downsides (papering over implementation bugs) would outweigh the advantages in that situation. The recovery mechanism is inherently racy so you're still going to have failures; you just make the underlying bugs harder to track down and fix.

So sure, we can implement a "resync" mechanism, but IMHO "it will help cover up concurrency bugs" is not a compelling reason to do so.

One of the reasons I'm trying to get a handle on what we're actually trying to fix here is that I think it will determine the design of a fix:

For example, the server could send a hash of all the public keys (or some other O(1) checksum which does not scale with the number of OTKs on the server) down /sync, which would allow clients to check that the keys stored server-side are in fact the same keys they have stored.

I'm not convinced this hash mechanism will work. Given there can be a significant delay between an OTK being claimed and it being used in a message, it's not unusual for clients to have more keys stored than are still available on the server.

If we could agree that the thing we're trying to fix here is client- or server- rollbacks, then I think possibly what we need is a sequence number: the server can say "the most recent OTK I knew about was number 563". The client can then tell if the server has been rolled back and forgotten some (and the client should re-upload them), or if the client has been rolled back and forgotten some (and needs to tell the server to forget the lost keys asap). I think there's a lot of fiddly detail to sort out, but the principle is something like that.

kegsay · 2024-06-10T07:28:47Z

Regarding server-rollbacks, @uhoreg had a rather simple and elegant solution: just drop the OTKs table on rollbacks. This will cause 0 OTK counts to be sent down /sync, causing clients to upload new OTKs. Claiming keys in the interim would need to use the fallback key.

kegsay · 2024-06-20T09:09:03Z

Thinking more on this, we could also do a similar solution for clients - if the /keys/upload endpoint returns a HTTP 400 indicating "key already exists" -> delete all OTKs on the server for that device. This will then cause a fresh upload by the client, resyncing things.

We are almost certainly going to need this (or something similar) because we have a growing list of devices who have desynced their OTKs. Even if we fix the cause of the desync, it doesn't help that there is this ticking time bomb unless we delete the OTKs.

richvdh · 2024-06-20T10:00:32Z

if the /keys/upload endpoint returns a HTTP 400 indicating "key already exists" -> delete all OTKs on the server for that device.

I think you've proposed this, or something very like it, before (matrix-org/matrix-rust-sdk#1415 (comment)). I remain hard against it, at least until we have reasonable confidence that we've fixed the underlying bugs.

it doesn't help that there is this ticking time bomb unless we delete the OTKs.

Which ticking time bomb are you referring to? #2356? I don't think that deleting OTKs on a "key already exists" upload error would do anything to help with that.

kegsay · 2024-07-02T13:23:16Z

I've made matrix-org/matrix-spec-proposals#4162 which I think addresses this.

kegsay added Team: Crypto Z-UISI Unable to decrypt errors labels Apr 25, 2024

kegsay mentioned this issue Apr 25, 2024

Gracefully recover from wedged session (discard session for next distribution, or reshare as m.room.key) #1992

Open

kegsay mentioned this issue May 8, 2024

Crypto: Improve logging around OTKs matrix-org/matrix-rust-sdk#3388

Open

richvdh mentioned this issue Sep 13, 2024

Lost OTK, leading to "OneTime key already exists" error and later UTDs matrix-org/matrix-rust-sdk#1415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: One-Time Keys can desync between client and server causing UTDs #2406

Epic: One-Time Keys can desync between client and server causing UTDs #2406

kegsay commented Apr 25, 2024 •

edited by richvdh

Loading

richvdh commented Apr 26, 2024

kegsay commented Apr 29, 2024

richvdh commented Apr 29, 2024

kegsay commented Jun 10, 2024

kegsay commented Jun 20, 2024 •

edited

Loading

richvdh commented Jun 20, 2024

kegsay commented Jul 2, 2024

Epic: One-Time Keys can desync between client and server causing UTDs #2406

Epic: One-Time Keys can desync between client and server causing UTDs #2406

Comments

kegsay commented Apr 25, 2024 • edited by richvdh Loading

Client Issues

Server Issues

Protocol Issues

What do we do about this?

richvdh commented Apr 26, 2024

kegsay commented Apr 29, 2024

richvdh commented Apr 29, 2024

kegsay commented Jun 10, 2024

kegsay commented Jun 20, 2024 • edited Loading

richvdh commented Jun 20, 2024

kegsay commented Jul 2, 2024

kegsay commented Apr 25, 2024 •

edited by richvdh

Loading

kegsay commented Jun 20, 2024 •

edited

Loading