Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

to_device messages are going missing on matrix.org, causing UISIs 🔥 #6450

Closed
ara4n opened this issue Dec 2, 2019 · 4 comments
Closed
Labels
z-bug (Deprecated Label)

Comments

@ara4n
Copy link
Member

ara4n commented Dec 2, 2019

See #6433.

tl;dr: Just had pantalaimon in #moderation suddenly get UISIs from my riot-web. I can see my riot sending a to_device message with the to_devices, but the to_device never gets received by pantalaimon. Discarding the outbound session in riot-web via /discardsession resolves the problem.

Here's my riot-web trying to start a new outbound session and relay it to pan (from https://github.com/matrix-org/riot-web-rageshakes/issues/1989):

2019-12-02T18:08:12.821Z I sendEvent of type m.room.message in !NeoYFhhzYBFGYYuEau:matrix.org with txnId m1575310092821.248
2019-12-02T18:08:12.852Z I setting pendingEvent status to encrypting in !NeoYFhhzYBFGYYuEau:matrix.org
2019-12-02T18:08:12.874Z I Starting to encrypt event for !NeoYFhhzYBFGYYuEau:matrix.org
2019-12-02T18:08:12.874Z I downloadKeys: already have all necessary keys
2019-12-02T18:08:12.874Z I Starting new megolm session for room !NeoYFhhzYBFGYYuEau:matrix.org
2019-12-02T18:08:12.896Z I Started new megolm session 4GXWIRt76/jikfNg2E/aAMNz9A8MzTD2t05YluGQxzU for room !NeoYFhhzYBFGYYuEau:matrix.org

We try to share it with pan:

...
2019-12-02T18:08:14.243Z I share keys with device @abuse:matrix.org:BYJTXGGWIL
...
2019-12-02T18:08:15.018Z I Using sessionid x+KcQ3ZVqDbgTnM57gISv88bRcBjfha2OzNgCNWcLCk for device @abuse:matrix.org:BYJTXGGWIL
...
2019-12-02T18:08:15.028Z I encryptMessage: Olm Session ID x+KcQ3ZVqDbgTnM57gISv88bRcBjfha2OzNgCNWcLCk to mlEXv71Vm1vGhxdOLV/zZmojuU0jvOjPT4DWziDTjRw: sender chain index: 2 receiver chain indices: 1 1 3 6 2 skipped message keys:
...
2019-12-02T18:08:15.350Z I PUT /sendToDevice/m.room.encrypted/m1575310095350.250 {"@abuse:matrix.org":["BYJTXGGWIL","RWAQHVCHCJ"],"@travis:t2l.io":["HTONRWZZML","AOWVGVZKJR"],"@uhoreg:matrix.org":["PCIVCOGYIC","HAVWLGWFYJ","ODRMFSSXPK","RDYUSNUGZC"],"@x:riot.ovh":["XLOTVUUERG","YNLARNZJVE","ZNXPHZZFKQ"],"@jryans:matrix.org":["DQTIQSURIN","ACMSAVQJDN","BILFXFPGTB","RKGMEDDEPB","HLYGGVJBFF","MZMQZEKJJW","MYEMQYNXTU","JXYIPLZQEE","URATTINTVE"],"@Amandine:matrix.org":["ONLYYPJTBS"]}
...

Meanwhile, Synapse sees this to_device put:

homeserver.log:2019-12-02 18:08:15,537 - synapse.access.http.8080 - 233 - INFO - PUT-1905889 - 80.250.101.234 - 8080 - Received request: PUT /_matrix/client/r0/sendToDevice/m.room.encrypted/m1575310095350.250
homeserver.log:2019-12-02 18:08:15,653 - synapse.access.http.8080 - 302 - INFO - PUT-1905889 - 80.250.101.234 - 8080 - {@matthew:matrix.org} Processed request: 0.112sec/0.003sec (0.018sec, 0.002sec) (0.001sec/0.098sec/1) 2B 200 "PUT /_matrix/client/r0/sendToDevice/m.room.encrypted/m1575310095350.250 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" [0 dbevts]

...which looks fine; the same as all the other shares in that slice. However, pantalaimon simply doesn't see it. If it had, it would have logged something like this (as seen after invoking /discardsession):

Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.459825] DEBUG: nio.encryption: Decrypting event of type OlmEvent
Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.461057] INFO: nio.encryption: Trying to decrypt olm message using existing session for @matthew:matrix.org and sender_key pX4igM5BYBKpPnKF9GfU7putQpeLXsDWvCuBBXQiN34
Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.477972] INFO: nio.encryption: Succesfully decrypted olm message using existing session
Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.479700] INFO: nio.encryption: Recieved Olm event of type: m.room_key
Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.481020] INFO: nio.encryption: Recieved new group session key for room !NeoYFhhzYBFGYYuEau:matrix.org from @matthew:matrix.org
Dec  2 19:17:32 tenes docker-pantalaimon[598]: [2019-12-02 19:17:32.481799] INFO: nio.encryption: Creating inbound group session for !NeoYFhhzYBFGYYuEau:matrix.org from pX4igM5BYBKpPnKF9GfU7putQpeLXsDWvCuBBXQiN34

But instead there is not even 'Decrypting event of type OlmEvent', or any errors from a wedged olm session. The to_device message has simply gone missing.

This could also explain https://github.com/matrix-org/riot-web-rageshakes/issues/1943.

@ara4n
Copy link
Member Author

ara4n commented Dec 2, 2019

pan's logs at the time of the missing to_device are:

Dec  2 18:08:07 tenes docker-pantalaimon[598]: [2019-12-02 18:08:07.147847] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:07 tenes docker-pantalaimon[598]: [2019-12-02 18:08:07.148076] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:07 tenes docker-pantalaimon[598]: [2019-12-02 18:08:07.148203] INFO: pantalaimon: Room Matrix HQ is not encrypted skipping...
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.234623] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.236359] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.236721] INFO: pantalaimon: Room Matrix HQ is not encrypted skipping...
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.402855] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.403107] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.403229] INFO: pantalaimon: Room Matrix HQ is not encrypted skipping...
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.771468] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.771711] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:09 tenes docker-pantalaimon[598]: [2019-12-02 18:08:09.773567] INFO: pantalaimon: Room Reputation is not encrypted skipping...
Dec  2 18:08:17 tenes docker-pantalaimon[598]: [2019-12-02 18:08:17.610630] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:17 tenes docker-pantalaimon[598]: [2019-12-02 18:08:17.610876] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:17 tenes docker-pantalaimon[598]: [2019-12-02 18:08:17.611215] INFO: pantalaimon: Room Matrix HQ is not encrypted skipping...
Dec  2 18:08:18 tenes docker-pantalaimon[598]: [2019-12-02 18:08:18.590372] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:18 tenes docker-pantalaimon[598]: [2019-12-02 18:08:18.590643] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:18 tenes docker-pantalaimon[598]: [2019-12-02 18:08:18.590808] INFO: pantalaimon: Room Synapse Admins is not encrypted skipping...
Dec  2 18:08:18 tenes docker-pantalaimon[598]: [2019-12-02 18:08:18.590937] INFO: pantalaimon: Room Matrix HQ is not encrypted skipping...
Dec  2 18:08:19 tenes docker-pantalaimon[598]: [2019-12-02 18:08:19.335734] INFO: pantalaimon: Trying to decrypt sync
Dec  2 18:08:19 tenes docker-pantalaimon[598]: [2019-12-02 18:08:19.335996] INFO: pantalaimon: Decrypting sync
Dec  2 18:08:19 tenes docker-pantalaimon[598]: [2019-12-02 18:08:19.336977] WARNING: nio.encryption: Error decrypting megolm event, no session found with session id 4GXWIRt76/jikfNg2E/aAMNz9A8MzTD2t05YluGQxzU for room !NeoYFhhzYBFGYYuEau:matrix.org
Dec  2 18:08:19 tenes docker-pantalaimon[598]: [2019-12-02 18:08:19.337129] WARNING: pantalaimon: Error decrypting megolm event, no session found with session id 4GXWIRt76/jikfNg2E/aAMNz9A8MzTD2t05YluGQxzU for room !NeoYFhhzYBFGYYuEau:matrix.org
Dec  2 18:08:19 tenes docker-pantalaimon[598]: [2019-12-02 18:08:19.337250] INFO: pantalaimon: Error decrypting sync, waiting for next pan sync

In other words, it looks like /sync doesn't even return for the missing to_device msg.

@ara4n
Copy link
Member Author

ara4n commented Dec 3, 2019

Looking at the synchrotron logs (sync3) is a bit more revealing - it looks like pan is maintaining two separate overlapping /sync streams - one with full_state=false and a 10s timeout, and the other with lazy-loading on and a 30s timeout.

Here are the logs for each /sync request, grouped together (and thus separating all the overlap).

2019-12-02 18:08:09,370 - synapse.access.http.8088 - 233 - INFO - GET-1265378 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618
2019-12-02 18:08:09,371 - synapse.util.caches.response_cache - 148 - INFO - GET-1265378 - [sync]: no cached result for [(@abuse:matrix.org, 10000, 's1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618', None, False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:09,371 - synapse.handlers.sync - 940 - INFO - GET-1265378 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373716', presence_key='757269739', typing_key='286253', receipt_key='390812579', account_data_key='237789178', push_rules_key='1005260', to_device_key='36557672', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373723', presence_key=757269739, typing_key=286263, receipt_key=390812584, account_data_key=237789182, push_rules_key=1005260, to_device_key=36557673, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:09,396 - synapse.access.http.8088 - 302 - INFO - GET-1265378 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.014sec/0.012sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 882B 200 "GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618 HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:09,357 - synapse.access.http.8088 - 233 - INFO - GET-1265375 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D
2019-12-02 18:08:09,361 - synapse.util.caches.response_cache - 148 - INFO - GET-1265375 - [sync]: no cached result for [(@abuse:matrix.org, 30000, 's1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618', '{"room":{"state":{"lazy_load_members":true}}}', False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:09,361 - synapse.handlers.sync - 940 - INFO - GET-1265375 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373716', presence_key='757269739', typing_key='286253', receipt_key='390812579', account_data_key='237789178', push_rules_key='1005260', to_device_key='36557672', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373723', presence_key=757269739, typing_key=286263, receipt_key=390812584, account_data_key=237789182, push_rules_key=1005260, to_device_key=36557673, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:09,396 - synapse.access.http.8088 - 302 - INFO - GET-1265375 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.035sec/0.004sec (0.008sec, 0.001sec) (0.005sec/0.003sec/1) 882B 200 "GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373716_757269739_286253_390812579_237789178_1005260_36557672_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:09,469 - synapse.access.http.8088 - 233 - INFO - GET-1265390 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D
2019-12-02 18:08:09,474 - synapse.util.caches.response_cache - 148 - INFO - GET-1265390 - [sync]: no cached result for [(@abuse:matrix.org, 30000, 's1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618', '{"room":{"state":{"lazy_load_members":true}}}', False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:09,475 - synapse.handlers.sync - 940 - INFO - GET-1265390 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373723', presence_key='757269739', typing_key='286263', receipt_key='390812584', account_data_key='237789182', push_rules_key='1005260', to_device_key='36557673', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373726', presence_key=757269739, typing_key=286266, receipt_key=390812589, account_data_key=237789185, push_rules_key=1005260, to_device_key=36557673, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:09,563 - synapse.access.http.8088 - 302 - INFO - GET-1265390 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.090sec/0.004sec (0.014sec, 0.000sec) (0.028sec/0.020sec/1) 881B 200 "GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:09,593 - synapse.access.http.8088 - 233 - INFO - GET-1265401 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618
2019-12-02 18:08:09,594 - synapse.util.caches.response_cache - 148 - INFO - GET-1265401 - [sync]: no cached result for [(@abuse:matrix.org, 10000, 's1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618', None, False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:09,594 - synapse.handlers.sync - 940 - INFO - GET-1265401 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373723', presence_key='757269739', typing_key='286263', receipt_key='390812584', account_data_key='237789182', push_rules_key='1005260', to_device_key='36557673', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373727', presence_key=757269739, typing_key=286267, receipt_key=390812589, account_data_key=237789186, push_rules_key=1005260, to_device_key=36557673, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:09,619 - synapse.access.http.8088 - 302 - INFO - GET-1265401 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.008sec/0.018sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 881B 200 "GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373723_757269739_286263_390812584_237789182_1005260_36557673_19116394_64618 HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:09,685 - synapse.access.http.8088 - 233 - INFO - GET-1265406 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373726_757269739_286266_390812589_237789185_1005260_36557673_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D
2019-12-02 18:08:09,689 - synapse.util.caches.response_cache - 148 - INFO - GET-1265406 - [sync]: no cached result for [(@abuse:matrix.org, 30000, 's1175373726_757269739_286266_390812589_237789185_1005260_36557673_19116394_64618', '{"room":{"state":{"lazy_load_members":true}}}', False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:15,129 - synapse.handlers.sync - 940 - INFO - GET-1265406 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373726', presence_key='757269739', typing_key='286266', receipt_key='390812589', account_data_key='237789185', push_rules_key='1005260', to_device_key='36557673', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373756', presence_key=757269739, typing_key=286284, receipt_key=390812599, account_data_key=237789194, push_rules_key=1005260, to_device_key=36557675, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:18,370 - synapse.access.http.8088 - 302 - INFO - GET-1265406 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 6.870sec/1.813sec (0.006sec, 0.000sec) (0.004sec/0.041sec/2) 896B 200 "GET /_matrix/client/r0/sync?access_token=<redacted>&since=s1175373726_757269739_286266_390812589_237789185_1005260_36557673_19116394_64618&timeout=30000&filter=%7B%22room%22:%7B%22state%22:%7B%22lazy_load_members%22:true%7D%7D%7D HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:17,877 - synapse.access.http.8088 - 233 - INFO - GET-1265601 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373756_757269739_286284_390812599_237789194_1005260_36557675_19116394_64618
2019-12-02 18:08:17,878 - synapse.util.caches.response_cache - 148 - INFO - GET-1265601 - [sync]: no cached result for [(@abuse:matrix.org, 10000, 's1175373756_757269739_286284_390812599_237789194_1005260_36557675_19116394_64618', None, False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:17,878 - synapse.handlers.sync - 940 - INFO - GET-1265601 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373756', presence_key='757269739', typing_key='286284', receipt_key='390812599', account_data_key='237789194', push_rules_key='1005260', to_device_key='36557675', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373882', presence_key=757269739, typing_key=286377, receipt_key=390812688, account_data_key=237789265, push_rules_key=1005260, to_device_key=36557682, device_list_key=19116397, groups_key=64618)
2019-12-02 18:08:18,936 - synapse.access.http.8088 - 302 - INFO - GET-1265601 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.292sec/0.766sec (0.011sec, 0.000sec) (0.043sec/0.052sec/3) 26586B 200 "GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373756_757269739_286284_390812599_237789194_1005260_36557675_19116394_64618 HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:09,956 - synapse.access.http.8088 - 233 - INFO - GET-1265411 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373727_757269739_286267_390812589_237789186_1005260_36557673_19116394_64618
2019-12-02 18:08:09,957 - synapse.util.caches.response_cache - 148 - INFO - GET-1265411 - [sync]: no cached result for [(@abuse:matrix.org, 10000, 's1175373727_757269739_286267_390812589_237789186_1005260_36557673_19116394_64618', None, False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:15,128 - synapse.handlers.sync - 940 - INFO - GET-1265411 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373727', presence_key='757269739', typing_key='286267', receipt_key='390812589', account_data_key='237789186', push_rules_key='1005260', to_device_key='36557673', device_list_key='19116394', groups_key='64618') and StreamToken(room_key='s1175373756', presence_key=757269739, typing_key=286284, receipt_key=390812599, account_data_key=237789194, push_rules_key=1005260, to_device_key=36557675, device_list_key=19116394, groups_key=64618)
2019-12-02 18:08:19,157 - synapse.access.http.8088 - 302 - INFO - GET-1265411 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 6.603sec/2.596sec (0.008sec, 0.000sec) (0.017sec/0.047sec/2) 896B 200 "GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373727_757269739_286267_390812589_237789186_1005260_36557673_19116394_64618 HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

2019-12-02 18:08:18,938 - synapse.access.http.8088 - 233 - INFO - GET-1265796 - 94.237.61.73 - 8088 - Received request: GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373882_757269739_286377_390812688_237789265_1005260_36557682_19116397_64618
2019-12-02 18:08:18,938 - synapse.util.caches.response_cache - 148 - INFO - GET-1265796 - [sync]: no cached result for [(@abuse:matrix.org, 10000, 's1175373882_757269739_286377_390812688_237789265_1005260_36557682_19116397_64618', None, False, 'BYJTXGGWIL')], calculating new one
2019-12-02 18:08:18,939 - synapse.handlers.sync - 940 - INFO - GET-1265796 - Calculating sync response for @abuse:matrix.org between StreamToken(room_key='s1175373882', presence_key='757269739', typing_key='286377', receipt_key='390812688', account_data_key='237789265', push_rules_key='1005260', to_device_key='36557682', device_list_key='19116397', groups_key='64618') and StreamToken(room_key='s1175373905', presence_key=757269739, typing_key=286388, receipt_key=390812699, account_data_key=237789276, push_rules_key=1005260, to_device_key=36557683, device_list_key=19116398, groups_key=64618)
2019-12-02 18:08:19,577 - synapse.access.http.8088 - 302 - INFO - GET-1265796 - 94.237.61.73 - 8088 - {@abuse:matrix.org} Processed request: 0.171sec/0.466sec (0.011sec, 0.001sec) (0.062sec/0.042sec/5) 1490B 200 "GET /_matrix/client/r0/sync?full_state=false&timeout=10000&since=s1175373882_757269739_286377_390812688_237789265_1005260_36557682_19116397_64618 HTTP/1.1" "Python/3.7 aiohttp/3.6.2" [0 dbevts]

The contract for to_device messages is that once the next /sync in a given stream has been received, the server considers the to_device messages delivered and can delete them serverside (and otherwise not deliver them again). So if your client has two parallel sync streams, if the to_device comes down on the wrong stream and gets discarded, it's game over. I assume this is what's happening here.

Alternatively, we have enough logging here to figure out where in the to_device stream the syncs kept asking from, and where the synchotrons calculated up to. Specifically:

Received Calc Processed GET Timeout From to_device_key To to_device_key
18:08:09,357 18:08:09,361 18:08:09,396 1265375 30 36557672 36557673
18:08:09,370 18:08:09,371 18:08:09,396 1265378 10 36557672 36557673
18:08:09,469 18:08:09,475 18:08:09,563 1265390 30 36557673 36557673
18:08:09,593 18:08:09,594 18:08:09,619 1265401 10 36557673 36557673
18:08:09,685 18:08:15,129 18:08:18,370 1265406 30 36557673 36557675
18:08:09,956 18:08:15,128 18:08:19,157 1265411 10 36557673 36557675
18:08:17,877 18:08:17,878 18:08:18,936 1265601 10 36557675 36557682
18:08:18,938 18:08:18,939 18:08:19,577 1265796 10 36557682 36557683

(ordered by the receive timestamp, but note that the processed timestamps have quite a different ordering).

Correlating this with the actual missing to_device message from the devices_inbox table on matrix.org, we can see a plausible to_device message with stream_id 36557679 to @abuse on device RWAQHVCHCJ from sender @matthew:matrix.org (pX4igM5BYBKpPnKF9GfU7putQpeLXsDWvCuBBXQiN34). However, the one we're looking for (device BYJTXGGWIL) is nowhere to be seen - it's either been deleted having been misdelivered to the wrong /sync stream, or never got there in the first place. The remaining items in the devices_inbox table are presumably for devices who haven't been seen for a while and so haven't synced. For instance, RWAQHVCHCJ hasn't been seen since Nov 14th.

TL;DR: i think the to_device msg went down the wrong sync stream and got discarded by pan.

Paging @poljar...

@neilisfragile neilisfragile added z-bug (Deprecated Label) p1 labels Dec 3, 2019
@ara4n
Copy link
Member Author

ara4n commented Dec 3, 2019

So this turns out to be a bug in pantalaimon - it's reusing the same access_token for the /sync streams for both its client & pan itself, which means that sometimes to_device messages get erroneously routed to the client which ignores them rather than pan. The fix is for pan to process to_device messages on both streams. @poljar is on the case.

@ara4n
Copy link
Member Author

ara4n commented Dec 3, 2019

moved over to matrix-org/pantalaimon#30

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

2 participants