-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
e2e: fetch cross-signing keys when device is signed but no key cached #10668
e2e: fetch cross-signing keys when device is signed but no key cached #10668
Conversation
I'm not sure if the signature device being in the |
8f8c6f2
to
23511b1
Compare
when for a remote user we don't have cached `e2e_cross_signing_keys` but we do have cached the devices in `device_lists_remote_cache` and the user has cross-signing set up, missing keys are not fetched and cross-signing verification will fail. with this patch, the missing cross-signing keys are requested if the cached devices list does have cross-signing signatures. Signed-off-by: Jonas Jelten <jj@sft.lol>
23511b1
to
f24b399
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the contribution!
This code is really hard to follow - a single function should not be over 250 lines long, and some of the code has 11 levels of indentation. That's not entirely your fault - it was pretty bad before - but we have to draw a line somewhere and I don't think I want to let it get any worse. In short, I'm afraid I think we need to refactor it a bit to split it up.
However, more to the point, I'm not convinced this is a correct change. As far as I can see, there is nothing in the spec that says that server implementations are expected to parse the signtatures
field and fetch cross-signing keys - rather, it is left up to the clients to request the cross-signing keys if they are required. @uhoreg: do you have any memory of how this stuff is supposed to work?
[If this is a correct change, then we ought to add system tests to sytest or complement to ensure that other server implementations do it correctly.]
if not signatures: | ||
continue | ||
|
||
for sig_user, sigs in signatures.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this is pulled over federation from a potentially untrusted remote server, which may itself not have validated the content of the key object, what guarantee do we have that signatures
is actually a dict
? We should handle invalid responses more gracefully than throwing no attribute 'items'
exceptions). Likewise we should not make assumptions about the content of the dict.
Thanks for your thoughts, I agree my patch is pretty hacky, but maybe we can come up with a better solution by the symptoms of the bug:
My solution approach now was to invalidate the cache when devices have xsign-sigs, but these keys are not cached yet. I mean the critical point is to determine the cache is invalid - devices+keys are cached but xkeys are not. |
The spec doesn't say that implementations should parse the signatures and see there's a cross-signing key, but I think it's allowable for servers to do so. IIRC, even if clients request the cross-signing keys, the server won't re-fetch them if it thinks that it's already up to date. |
So what is your suggestion on how we solve this bug? :) I know at least 5 users affected by this, so I guess it's plenty more. |
@uhoreg: I'm not really following this, I'm afraid. Surely it must be up to either the client or the server to decide if it is missing a cross-signing key. If the spec doesn't say that the server needs to do it, then the clients can't rely on the server doing it, so the clients have to do it? |
If it's already up to date, why would the server need to re-fetch them? |
It may not be up to date, but it may think that it's up to date if it missed a message notifying it of new keys.
The current spec kind of assumes that EDUs don't get dropped, so it doesn't really define a way of dealing with that situation. (It has a way of dealing with dropped EDUs, but only after you've received another key update message -- it doesn't help in the situation where the dropped message is the last one that was sent.) This is similar to the situation where the device list gets out of sync and you fail to encrypt to new devices because you don't know they exist. IIRC Erik added a thing a while back where if you get a message claiming to be from a device you don't know about, Synapse will automagically re-fetch the user's devices from the remote server. |
wait? what? Which EDUs are getting dropped? Surely that's not right? And we should be fixing that rather than layering more and more unspecified hacks on top of it? I still don't think we should be doing this unless it's specced, so if we should be doing it, then it needs to be specced. |
I don't know the details of how EDUs can get dropped. Maybe Erik remembers something from investigating the device list? |
@erikjohnston Could you please weigh in on this? |
We have a bunch of logic to try and ensure that device list update EDUs don't get dropped and will be retried, though the cross signing stuff was tacked on at a later date so the logic may be subtly broken somehow. I.e. if we're seeing cross signing keys not being correctly cached then that's a bug we should investigate. It's probably worth noting we do have some paranoia code when we receive encrypted events over federation to check if our cache for the sending device is stale: https://github.com/matrix-org/synapse/blob/develop/synapse/handlers/federation_event.py#L918 |
so to be clear - you don't think we should be taking this approach of inspecting the signatures, and instead should find out why the cross-signing keys are going missing? |
Yes. I'm not super against having extra checks to try and detect when things go wrong, but if we do that we should still investigate occurrences of it happening. |
Thanks. I'm mostly opposed to adding yet more complication and magic to this already complicated, magical bit of code. If there is any way we can avoid doing so I think we should take it. |
Sure, we should find the root cause, but still we need a way to clean up the out-of-sync state. And my signature check was the most obvious thing to me :) But i agree that doesn't seem like a good solution. What we should keep though is the bottom code hunk where newly fetched xsignkeys are actually returned, because currently only cached keys make it to the |
So how should we proceed here? If we're sure now state is synched properly, just flushing the cache once in some db migration might do the trick already? |
In terms of hunting down the root cause: step 1 would be to try and reproduce this reliably (and produce a test case). If that doesn't work out then we resort to trying to add enough logging such that if it happens again we have some hope of at least narrowing down what's going on. Unfortunately, we're a bit swamped at the minute so we likely won't have a chance to work on that ourselves right now, even though flakey cross signing is a big concern.
If there are things in this PR that would be useful besides the inspection of signature stuff then it's probably easiest to open them in a separate PR? Naively that sounds like just a good bug fix that we should land regardless of the anything else (which I'd missed) 🙂 |
Ok, I've submitted the return-fetched-keys pull request in #10912. About finding the root cause: I'm not sure this is the right approach. It may be that this bug may no longer occur, but there is said cache corruption for the keys in 3 homeservers i administer. It may be from the early days of the cross-signing implementation. Which is why we should try to detect the corruption and refetch the correct state, and that was my attempt in the patch. |
I think all the changes we wanted to land in this PR have been landed separately, so I'm going to go ahead and close this (shout if that's wrong!). Thanks @TheJJ for spending the time investigating this and coming up with the various solutions 👍 |
Thanks! I'll test if it now works :) |
No sorry, my cross-signing-keys have still not been stored in the other homeserver:
|
when for a remote user we don't have cached
e2e_cross_signing_keys
but we do have cached the devices in
device_lists_remote_cache
and the user has cross-signing set up, missing keys are not fetched
and cross-signing verification will fail.
with this patch, the missing cross-signing keys are requested
if the cached devices list does have cross-signing signatures.
this is a follow-up for #8455.