Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Cross-signing signatures not being always federated correctly #7418

Closed
babolivier opened this issue May 5, 2020 · 14 comments · Fixed by #7453
Closed

Cross-signing signatures not being always federated correctly #7418

babolivier opened this issue May 5, 2020 · 14 comments · Fixed by #7453
Assignees
Labels
z-bug (Deprecated Label) z-cross-signing (Deprecated Label) z-p2 (Deprecated Label)

Comments

@babolivier
Copy link
Contributor

babolivier commented May 5, 2020

Over the past couple of days I've seen a few occurrences of people on other servers verifying a new device, but my server not receiving the resulting signature, thus leaving them with a red shield from my pov. I've also seen people saying they were seeing others with a red shield whereas I would see them with a green one.

@bwindels had a look at my Riot logs upon failing to see one of @anoadragon453's devices as verified, and could see that it was indeed missing a signature, with the signatures on Andrew's device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw",
    "ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y": "FV5bCKEdp1q9SFvy7MzUgTj3SzAeHwuK4vvtgnIp/WFxdnYQ4XL2C64dW28oJsK4yJ95NkuFcL9uCTPaDKPoAw"
  }
}

and the ones my Riot would see for that same device being:

"signatures": {
  "@andrewm:amorgan.xyz": {
    "ed25519:ZYMXYYQZTP": "aBAnNnO3Qzz8xMLXtSjnwDummc4F4vewygUBvdRKJWd2/326Se4knYkJkp/TKa0sXjeGKIUko8JXNorN2m9TDw"
  }
}

Looking at the device_lists_remote_cache on my homeserver's database, I can see that it's indeed missing the signature from Andrew's self-signing key (ed25519:QjSD8srN17RiDzIBgzVbncj+NMdvDHRY4N2b8w+oq9Y).

https://github.com/matrix-org/riot-web-rageshakes/issues/2740#issuecomment-623992322 provides more info about that specific occurrence.

@babolivier babolivier added z-bug (Deprecated Label) z-p2 (Deprecated Label) z-cross-signing (Deprecated Label) labels May 5, 2020
@babolivier
Copy link
Contributor Author

Possibly related: #7350

@jryans jryans added the story:1 label May 6, 2020
@babolivier babolivier self-assigned this May 6, 2020
@babolivier
Copy link
Contributor Author

Since this issue was opened, Andrew connected a new device to his account and verified it from Riot Web. I can now see ZYMXYYQZTP as verified, but not the new device.

afau two things could be happening here:

  • either we're using the wrong stream ID to send device updates and we're one ID short of the entry containing the new signature
  • or we don't correctly update the devices cache upon receiving an update

@babolivier
Copy link
Contributor Author

Right, I think I've figured out what's happening here.

I've noticed that Andrew's device_lists_outbound_last_success table had a higher stream ID for my homeserver than the one mine had in its device_lists_remote_extremeties table. This likely means that the latest update was sent to my homeserver, that something happened while my homeserver was processing the update, causing it to abort it, but it still returned 200 to the Andrew's HS.

So I think the explanation is the following: when server A sends out a device list update to server B, server B will look at the updates' stream ids to see if it has missed any, if it did it will try to get the missing keys via /_matrix/federation/v1/user/devices/....

If that request fails, server B will:

  1. Not retry it (as pointed out by this comment)
  2. Abort the update (because otherwise Synapse forgets that it's out of sync and will never care about updates it might have missed)
  3. Return a 200 to server A (because it's returning here rather than raising an exception), causing server A to record that update as successful and not resend it ever

In that specific issue it looks like abolivier.bzh's /devices/... requests to amorgan.xyz timed out, which produced this failure mode:

2020-05-04 18:15:50,575 - synapse.federation.transport.server - 406 - INFO - PUT-1487720- Received txn 1588613504503 from amorgan.xyz. (PDUs: 0, EDUs: 1)
2020-05-04 18:15:50,581 - synapse.handlers.device - 625 - INFO - PUT-1487720- Received device list update for @andrewm:amorgan.xyz, requiring resync: True. Devices: ZYMXYYQZTP
2020-05-04 18:15:50,582 - synapse.http.matrixfederationclient - 408 - INFO - PUT-1487720- {GET-O-118918} [amorgan.xyz] Sending request: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz; timeout 30.000000s
2020-05-04 18:15:50,584 - synapse.http.federation.matrix_federation_agent - 242 - INFO - PUT-1487720- Connecting to matrix.amorgan.xyz:8448
2020-05-04 18:16:20,588 - synapse.http.matrixfederationclient - 434 - INFO - PUT-1487720- Failed to send request: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
2020-05-04 18:16:20,589 - synapse.http.matrixfederationclient - 491 - WARNING - PUT-1487720- {GET-O-118918} [amorgan.xyz] Request failed: GET matrix://amorgan.xyz/_matrix/federation/v1/user/devices/%40andrewm%3Aamorgan.xyz: ResponseNeverReceived:[CancelledError()]
2020-05-04 18:16:20,589 - synapse.util.retryutils - 218 - INFO - PUT-1487720- Connection to amorgan.xyz was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >])); backoff now 600000
2020-05-04 18:16:20,590 - synapse.handlers.device - 696 - WARNING - PUT-1487720- Failed to handle device list update for @andrewm:amorgan.xyz
2020-05-04 18:16:20,601 - synapse.access.http.8008 - 302 - INFO - PUT-1487720- 127.0.0.1 - 8008 - {amorgan.xyz} Processed request: 30.028sec/0.001sec (0.011sec, 0.001sec) (0.004sec/0.008sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1588613504503 HTTP/1.1" "Synapse/1.12.4" [0 dbevts]

Thus server B is stuck with an outdated devices list that it'll never try to refresh, except if server A sends another update and isn't under too much load that it wouldn't time out on a /devices/... request at that specific time.

The correct fix here would either be to fail the whole transaction (which would also fail the PDUs so might not be the correct one) or have server B still send a 200 back and retry /devices/... periodically (which would give server A a wrong idea of the updates server B has successfully processed).

I believe the latter solution might be the most preferable one. I'm unfamiliar with the existing backoff/retry mechanism, so I might be forcing through an open door here, but we should absolutely persist that retry schedule in the database otherwise just restarting Synapse could lead to device lists staying out of sync.

@babolivier
Copy link
Contributor Author

This issue would be fixed by #7453

@ara4n
Copy link
Member

ara4n commented May 12, 2020

woo, thanks for hunting this down. device list sync is the most wobbly bit of synapse, imo :(

@babolivier
Copy link
Contributor Author

device list sync is the most wobbly bit of synapse, imo :(

Agreed. Hopefully #7453 will make it way less wobbly.

@flackr
Copy link

flackr commented May 13, 2020

Not sure if this is related, but I recently set up a new homeserver B and am missing everyone's master_keys and self_signing_keys except for people who have since set up their E2E cross signing keys.

When I compared a sample user in device_lists_remote_extremeties I found that their stream_id was much larger (59005256) than on my long-running homeserver A (26308567). Unfortunately I can't check the other end (as it's matrix.org).

I tested checking out #7453 and inserted their user_id into device_lists_remote_resync but after it was removed from the table I still didn't have master_keys or self_signing_keys for the user however I did notice a timeout entry for the user_id in the log:

2020-05-12 22:04:53,641 - synapse.http.matrixfederationclient - 408 - INFO -  - {GET-O-3986} [matrix.org] Sending request: GET matrix://matrix.org/_matrix/federation/v1/user/devices/<user_id>; timeout 30.000000s
2020-05-12 22:04:53,651 - synapse.handlers.presence - 343 - INFO - persist_presence_changes-1 - Persisting 3 unpersisted presence updates
2020-05-12 22:04:53,809 - synapse.http.matrixfederationclient - 164 - INFO -  - {GET-O-3986} [matrix.org] Completed: 200 OK
2020-05-12 22:04:53,810 - synapse.storage.database - 527 - WARNING -  - Starting db txn 'update_remote_device_list_cache' from sentinel context
2020-05-12 22:04:53,810 - synapse.storage.database - 566 - WARNING -  - Starting db connection from sentinel context: metrics will be lost

@babolivier
Copy link
Contributor Author

babolivier commented May 13, 2020

Not sure if your problem is the same issue - however thanks to your logs it's obvious that I didn't do the right thing wrt logging contexts in #7453, I'll try to fix that soon.

@flackr
Copy link

flackr commented May 13, 2020

Glad I could help. I am currently stuck in this situation of missing many people's signing keys, let me know if there's other things I can check.

If you determine it's not the same we should reopen #7350.

@flackr
Copy link

flackr commented May 14, 2020

I checked with one of my friend's running a homeserver. My server does not exist in their device_lists_outbound_last_success table, but their server's users do exist in my device_lists_remote_extremeties, though their keys aren't on my homeserver.

Is there any way to safely insert the keys into the database or force a refresh?

@babolivier
Copy link
Contributor Author

Hey, so after thinking about it, I don't think your issue share a common cause with this one (nor with #7350). I've opened #7504 to track its progress.

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run #7453 and do the insertion mentioned at #7453 (comment), but if that doesn't solve it I can't think of anything else without more investigation.

@flackr
Copy link

flackr commented May 14, 2020

Is there any way to safely insert the keys into the database or force a refresh?

The only way I see of forcing a refresh would be to run #7453 and do the insertion mentioned at #7453 (comment), but if that doesn't solve it I can't think of anything else without more investigation.

Thanks, sadly it didn't seem to solve the issue though I'll try again when this lands. I'm a little hesitant to run development versions of synapse against my regular HS since it seems to perform a non-reversible upgrade to the database (which I'm guessing could also get federation out of sync when I revert the db) and I'd like to be able to run the stable version ongoing, however if there are other things you'd like me to try I could set up a development HS. Feel free to reach out to me, @flackr:serializer.ca for higher bandwidth discussion.

@babolivier
Copy link
Contributor Author

Will do when I'll get to investigating it 👍

babolivier added a commit that referenced this issue May 21, 2020
When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.

#6776 introduced some code infrastructure to mark device lists as stale/out of sync.

This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.

It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`.

Fixes #7418
@babolivier
Copy link
Contributor Author

babolivier commented May 21, 2020

For people getting bitten by this issue after 1.14.0 is out, or people seeing users with stale device lists from before 1.14.0, just run the following SQL in Synapse's database:

INSERT INTO device_lists_remote_resync
VALUES ('USER_ID', (EXTRACT(epoch FROM NOW()) * 1000)::BIGINT);

where USER_ID is the Matrix ID of the user whose device list has gone stale. I've tried this successfully on PostgreSQL, not yet on SQLite.

phil-flex pushed a commit to phil-flex/synapse that referenced this issue Jun 16, 2020
When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.

matrix-org#6776 introduced some code infrastructure to mark device lists as stale/out of sync.

This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.

It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`.

Fixes matrix-org#7418
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label) z-cross-signing (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants