-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356
Comments
The specific scenario I'm worrying about here is:
Synapse's algorithm for picking a key for /keys/claim is:
Since that's totally unsorted, you've got a good chance of some very old keys sitting around for ages. |
@richvdh Is this issue still useful to you? Besides fixing the issue itself, has logging around this been added with the latest crypto updates (perhaps in the Rust implementation)? Is this area easier to trace now? |
It's still a potential problem. I don't really remember if more logging got added; the fact that nothing is linked here suggests that it didn't. Element-R won't necessarily fix the problem in itself but is rewriting this area, so it's not worth investigating in the legacy implementation. |
Rust SDK is keeping a lot more OTK (50 * 100). Lib olm appears to keep 100. |
This is mitigated in Element R, because the Rust SDK keeps up to 5000 OTKs before it starts throwing them away. However, it is still a problem, because that 5000 limit will eventually fill up (from network blips etc which cause keys to be claimed but not used), and then the Rust SDK will start throwing keys away. Whilst the Rust SDK will throw away the "oldest" key, we have no guarantee that that key is not still active on the server, so the problem recurs. |
To further clarify why this is a problem, consider what happens when all 5000 keys are used on the client and 50 of those keys are on the server and someone calls
In this case, when the caller encrypts for that device, the pre-key message will be undecryptable because the client threw away the OTK private key. |
This is a ticking time bomb. We are likely to see this as the age of client's DBs get older and more OTKs accumulate which are never claimed, consuming the 5000 buffer. A heuristic like "oldest" could work, or perhaps more usefully, the server could send back the key IDs it gave out rather than just decrementing the count by 1. The latter is more invasive though. |
Moving as this seems to apply to all clients, including the Rust SDK based ones |
It's actually worse than "totally unsorted". In practice, postgres will use the first matching entry from the Now, the key ID is a base64-encoding of a 32 bit int; in particular that means that, for example, the key ids for the 208th through 255th keys ( Key IDs later in the alphabet ( |
element-hq/synapse#17267 can cause us to have indefinitely claimed OTKs because the receiver timed out the request. |
What do you mean by "indefinitely claimed OTKs"? An OTK that is claimed but never used, causing it to get "stuck" on the client? That's true; the same can result from clients issuing a Edit: well, it's not a separate issue: it's the reason that clients end up having to throw away the private part of OTKs. What I really mean is: it's pretty much expected behaviour. |
Indeed, the reason why I'm flagging it is because it outlines a real-world example of how we can accumulate keys on the client. |
Currently, one-time-keys are issued in a somewhat random order. (In practice, they are issued according to the lexicographical order of their key IDs.) That can lead to a situation where a client gives up hope of a given OTK ever being used, whilst it is still on the server. Fixes: element-hq/element-meta#2356
Currently, one-time-keys are issued in a somewhat random order. (In practice, they are issued according to the lexicographical order of their key IDs.) That can lead to a situation where a client gives up hope of a given OTK ever being used, whilst it is still on the server. Related: element-hq/element-meta#2356
Background: On first login, clients generate about 50 one-time-keys. Each key consists of a public part and a secret part. The public parts are uploaded to the server, and the secret parts are retained on the device. A public one-time-key can then be "claimed" later by other devices, to initiate a secure Olm channel between the two devices. (The secure channel is then used for sharing Megolm message keys.) Whenever a client finds that there are less than 50 keys on the server, it will generate more key pairs and upload the public parts.
However, there is a limit to the number of secret keys that a client can keep hold of, and over time it may accumulate unused secret keys that appear to have never been used. The client has little alternative but to throw away such old keys.
Problem: there is nothing in the spec about the order in which servers give out one-time keys. In particular, Synapse does not give them out in the obvious first-in first-out order (see below). It is therefore possible for some very old public keys to hang around on the server. By the time those keys are eventually claimed by the sender, it is quite possible that the receiving device has forgotten the secret part of the corresponding key.
The net effect is that the recipient cannot read the messages sent over the Olm channel, does not receive the message keys, and the user sees an Unable To Decrypt error.
(Worse: we have mitigations to "unwedge" such broken Olm channels, but they don't actually work very well in this case due to matrix-org/matrix-rust-sdk#3427.)
Original description:
element-hq/element-web#2782 was a special-case of this, but there are other cases when we can throw away important one-time keys. It may be impossible to make this completely water-tight, but I'd like to add some logging when we throw away one-time keys, and reconsider the heuristic. (Does /claim give out the oldest key? which is the one we expire first?)
The text was updated successfully, but these errors were encountered: