Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We don't retry E2EE to-device messages at all. #12851

Closed
Tracked by #673 ...
ara4n opened this issue Mar 24, 2020 · 10 comments · Fixed by matrix-org/matrix-js-sdk#2549
Closed
Tracked by #673 ...

We don't retry E2EE to-device messages at all. #12851

ara4n opened this issue Mar 24, 2020 · 10 comments · Fixed by matrix-org/matrix-js-sdk#2549
Assignees
Labels
A-E2EE O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Chronic

Comments

@ara4n
Copy link
Member

ara4n commented Mar 24, 2020

Looking at this series of utter failures to send E2E messages, it doesn't look healthy:

Screenshot 2020-03-24 at 17 14 57

specifically, the 500s never get retried. @uhoreg thinks this may in turn desync and wedge the olm sessions in question.

@t3chguy
Copy link
Member

t3chguy commented Oct 5, 2020

Does the order of the to_device messages need maintaining?

@uhoreg
Copy link
Member

uhoreg commented Oct 5, 2020

Sort of. Weird things may happen if you try to send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message. (If you don't have the "receive a to_device message from that same device" step, then everything should be fine.)

But it would probably be fine if we just retry sending immediately one or two times.

@t3chguy
Copy link
Member

t3chguy commented Oct 6, 2020

But it would probably be fine if we just retry sending immediately one or two times.

The issue I'm thinking is that if one request fails there is a decent probability an immediate retry will too, which might mean we need to queue and issue in-order retrying with a back-off or something and persisting so that we don't forget when the app gets restarted

@t3chguy
Copy link
Member

t3chguy commented Oct 16, 2020

send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message.

What should happen in this case
Should the retry be dropped?
Should the another message be queued to land after the retry?

@uhoreg
Copy link
Member

uhoreg commented Oct 16, 2020

Hmm. The problem with that case is that if you receive a message from a device, and then send a new message, then the ratchet would be advanced in such a way that any old messages that you had intended to send will be undecryptable. But if you retry the old message (and it succeeds) before sending the new message, then all should be fine. So perhaps one way to solve this would be to put all the to-be-sent to-device messages in a queue, and don't send the next one until the current one has been sent (or we've given up)?

@dkasak
Copy link
Member

dkasak commented Apr 19, 2022

The problem with that case is that if you receive a message from a device, and then send a new message, then the ratchet would be advanced in such a way that any old messages that you had intended to send will be undecryptable.

At least vodozemac would happily handle this case since it retains a number of old chains so that it can decrypt messages encrypted with the old chain but arriving late. So things should still work out even if this scenario happens.

@t3chguy t3chguy added S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels May 17, 2022
@ara4n
Copy link
Member Author

ara4n commented Jun 22, 2022

send an encrypted to_device message to some device, receive a to_device message from that same device, send another to_device message to that device, and then retry sending the first message.

What should happen in this case?
Should the retry be dropped?
Should the another message be queued to land after the retry?

I think all this was trying to say is:

  • We need to queue and retry outbound to-device messages
  • Retry schedule can be exponential back-off
  • If we receive a message from a given device, we should probably retry our queue of messages destined to that device.

There is no need to drop todevice messages on the floor.

Separately, in future we'll need the ability to cancel to-device messages which are no longer relevant - both from being stored on the server, and being queued up to be sent to the server in the first place. For instance, MSC3401 needs this for killing off irrelevant stale VoIP events. So, if the application knows a message should be unqueued/deleted, it can do so. But the stack itself should just do reliable delivery by default.

This causes a large number of my most obnoxious UISIs fwiw - c.f. the backlinks above. It can also undermine MSC3401 pretty badly.

@ara4n ara4n removed the X-Blocked label Jun 22, 2022
@kittykat kittykat removed the P1 label Jul 20, 2022
@dbkr
Copy link
Member

dbkr commented Jul 22, 2022

Having looked at this for a bit, an initial plan would be to just make the to-device API match the event sending API and make it retry automatically for a period of time, then reject with a mechanism to initiate a retry manually. However, if we really make it match then this probably wouldn't help a great deal since regular messages fail after about 30 seconds, after which the user manually retries sending them. The to-device messages would therefore either need to retry for (much) longer automatically, or we'd need to automatically retry the to-device message queue when retrying a regular message.

At the moment I'm a bit torn on whether to try & retry them maintaining API compatibility or whether to make a new API altogether (the current one is very basic and not very pretty). The main problem is that the current API is a bit fire-and-forget, so we need to avoid to-device messages building up in a queue if we started retrying them without changing the API.

@fkwp
Copy link
Contributor

fkwp commented Jul 27, 2022

Can we double-check that we also have reliable to-device messages on mobile.

@dbkr
Copy link
Member

dbkr commented Jul 29, 2022

Have created equivalent bugs on iOS & Android.

su-ex added a commit to SchildiChat/matrix-js-sdk that referenced this issue Aug 22, 2022
* Add txn_id support to sliding sync ([\matrix-org#2567](matrix-org#2567)).
* Emit an event when the client receives TURN servers ([\matrix-org#2529](matrix-org#2529)).
* Add support for stable prefixes for MSC2285 ([\matrix-org#2524](matrix-org#2524)).
* Remove stream-replacement ([\matrix-org#2551](matrix-org#2551)).
* Add support for sending user-defined encrypted to-device messages ([\matrix-org#2528](matrix-org#2528)).
* Retry to-device messages ([\matrix-org#2549](matrix-org#2549)). Fixes element-hq/element-web#12851.
* Sliding sync: add missing filters from latest MSC ([\matrix-org#2555](matrix-org#2555)).
* Use stable prefixes for MSC3827 ([\matrix-org#2537](matrix-org#2537)).
* Fix: Handle parsing of a beacon info event without asset ([\matrix-org#2591](matrix-org#2591)). Fixes element-hq/element-web#23078.
* Fix finding event read up to if stable private read receipts is missing ([\matrix-org#2585](matrix-org#2585)). Fixes element-hq/element-web#23027.
* Fixed a sliding sync issue where history could be interpreted as live events. ([\matrix-org#2583](matrix-org#2583)).
* Don't load the sync accumulator if there's already a sync persist in flight ([\matrix-org#2569](matrix-org#2569)).
t3chguy added a commit that referenced this issue Oct 17, 2024
Co-authored-by: github-merge-queue <118344674+github-merge-queue@users.noreply.github.com>
Co-authored-by: Michael Telatynski <7t3chguy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Chronic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants