Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uploading signed devices gets propagated over federation is flaky #1167

Closed
anoadragon453 opened this issue Nov 15, 2021 · 7 comments · Fixed by #1169
Closed

uploading signed devices gets propagated over federation is flaky #1167

anoadragon453 opened this issue Nov 15, 2021 · 7 comments · Fixed by #1169
Assignees
Labels
Z-Flaky Tests which seem to fail at random

Comments

@anoadragon453
Copy link
Member

https://github.com/matrix-org/synapse/runs/4213207252?check_suite_focus=true

@anoadragon453 anoadragon453 added the Z-Flaky Tests which seem to fail at random label Nov 15, 2021
@DMRobertson
Copy link
Contributor

@DMRobertson
Copy link
Contributor

Related to #1166?

@clokep
Copy link
Member

clokep commented Nov 16, 2021

@DMRobertson
Copy link
Contributor

Reproducable with

docker run --rm -v /home/dmr/workspace/synapse:/src:ro -v /home/dmr/workspace/sytest/:/sytest -v /tmp/logs:/logs -e POSTGRES=1 -e WORKERS=1 -e MULTI_POSTGRES=1 matrixdotorg/sytest-synapse:testing --torture-replication 50 -s tests/41end-to-end-keys/08-cross-signing.pl
2021-11-16T19:45:20.1749961Z 2021-11-16 19:45:20.174 UTC [1262] ERROR:  could not serialize access due to concurrent update
2021-11-16T19:45:20.1753150Z 2021-11-16 19:45:20.174 UTC [1262] STATEMENT:  INSERT INTO monthly_active_users (user_id, timestamp) VALUES ('@anon-20211116_193439-673:localhost:8881', 1637091920172) ON CONFLICT (user_id) DO UPDATE SET timestamp=EXCLUDED.timestamp
2021-11-16T19:45:21.8620459Z     Test 545 uploading signed devices gets propagated over federation... FAIL

@DMRobertson DMRobertson self-assigned this Nov 17, 2021
@DMRobertson
Copy link
Contributor

I can see

  • user2 uploading signatures.
  • server1 tries to notify server0 and logs
    * 2021-11-17 14:32:59,893 - synapse.federation.sender - 671 - WARNING - process-replication-data-251 - DMR: send_device_message please
    * 2021-11-17 14:32:59,676 - synapse.federation.sender.transaction_manager - 119 - INFO - federation_transaction_transmission_loop-0 - TX [localhost:8840] {1637159575489} Sending transaction [1637159575489], (PDUs: 0, EDUs: 1)
      2021-11-17 14:32:59,705 - synapse.http.matrixfederationclient - 291 - INFO - federation_transaction_transmission_loop-0 - {PUT-O-1} [localhost:8840] Completed request: 200 OK in 0.03 secs, got 11 bytes - PUT matrix://localhost:8840/_matrix/federation/v1/send/1637159575489
    * 2021-11-17 14:32:59,705 - synapse.federation.sender.transaction_manager - 158 - INFO - federation_transaction_transmission_loop-0 - TX [localhost:8840] {1637159575489} got 200 response
    * 2021-11-17 14:32:59,893 - synapse.federation.sender - 671 - WARNING - process-replication-data-251 - DMR: send_device_message please
    * 2021-11-17 14:32:59,920 - synapse.federation.sender.transaction_manager - 119 - INFO - federation_transaction_transmission_loop-1 - TX [localhost:8840] {1637159575490} Sending transaction [1637159575490], (PDUs: 0, EDUs: 1)
      2021-11-17 14:32:59,957 - twisted - 275 - INFO - sentinel - Received SIGINT, shutting down.
    
    
  • Looks like the second transaction doesn't complete. Does that explain things?

@DMRobertson
Copy link
Contributor

The second transaction is received by server-0. It contains the signatures. I can see the federation_reader posting the data to replication, at which point the process is killed.

So I think the CS query to server 0 is racing against the signature upload + federation + replication. I'm not sure the latter has any indirect side effects. So maybe just a "retry until true" will do.

  2021-11-17 15:37:49,640 - synapse.federation.transport.server._base - 106 - DEBUG - PUT-2 - Request from localhost:8881
  2021-11-17 15:37:49,640 - synapse.federation.transport.server.federation - 88 - DEBUG - PUT-2 - Decoded 1637163467237: {'edus': [{'content': {'device_id': 'XETGMHMAJD', 'keys': {'algorithms': ['m.olm.curve25519-aes-sha256', 'm.megolm.v1.aes-sha'], 'device_id': 'XETGMHMAJD', 'keys':   {'curve25519:XETGMHMAJD': 'IQ/Hu4GGOaxIpMavovFYGouVJeIP2miSfysv+Db3NXg', 'ed25519:XETGMHMAJD': 'MKkClRdltZlOHyCzxiDrm7MsDAsohXmAyeu2cYO6how'}, 'signatures': {'@anon-20211117_153737-2:localhost:8881': {'ed25519:EmkqvokUn8p+vQAGZitOk4PWjp7Ukp3txV2TbMPEiBQ': '62NDAYAi/fB1PQblpLU0zOec  KOsWCMD2RPquR8AHRLQYOW6xqNcpvRXAYqWPFByfh+WK8zKb0Gr3KmMHaDx8BQ'}}, 'user_id': '@anon-20211117_153737-2:localhost:8881'}, 'org.matrix.opentracing_context': '{}', 'prev_id': [], 'stream_id': 7, 'user_id': '@anon-20211117_153737-2:localhost:8881'}, 'edu_type': 'm.device_list_update'}]  , 'origin': 'localhost:8881', 'origin_server_ts': 1637163469628, 'pdus': []}
* 2021-11-17 15:37:49,640 - synapse.federation.transport.server.federation - 90 - INFO - PUT-2 - Received txn 1637163467237 from localhost:8881. (PDUs: 0, EDUs: 1)
  2021-11-17 15:37:49,640 - synapse.federation.federation_server - 237 - DEBUG - PUT-2 - [1637163467237] Got transaction
  2021-11-17 15:37:49,640 - synapse.util.caches.response_cache - 194 - DEBUG - PUT-2 - [fed_txn_handler]: no cached result for [('localhost:8881', '1637163467237')], calculating new one
  2021-11-17 15:37:49,641 - synapse.federation.persistence - 37 - DEBUG - PUT-2 - Invoked 'have_responded' with args: self=<synapse.federation.persistence.TransactionActions..., origin=localhost:8881, transaction=Transaction(transaction_id='1637163467237', origin...
  2021-11-17 15:37:49,642 - synapse.storage.txn - 548 - DEBUG - PUT-2 - [TXN START] {get_received_txn_response-16}
  2021-11-17 15:37:49,642 - synapse.storage.SQL - 320 - DEBUG - PUT-2 - [SQL] {get_received_txn_response-16} SELECT transaction_id, origin, ts, response_code, response_json, has_been_referenced FROM received_transactions WHERE transaction_id = ? AND origin = ?
  2021-11-17 15:37:49,642 - synapse.storage.SQL - 325 - DEBUG - PUT-2 - [SQL values] {get_received_txn_response-16} ['1637163467237', 'localhost:8881']
  2021-11-17 15:37:49,643 - synapse.storage.SQL - 346 - DEBUG - PUT-2 - [SQL time] {get_received_txn_response-16} 0.000691 sec
  2021-11-17 15:37:49,643 - synapse.storage.txn - 651 - DEBUG - PUT-2 - [TXN END] {get_received_txn_response-16} 0.001409 sec
  2021-11-17 15:37:49,644 - synapse.federation.federation_server - 314 - DEBUG - PUT-2 - [1637163467237] Transaction is new
  2021-11-17 15:37:49,644 - synapse.replication.http._base - 187 - WARNING - PUT-2 - DMR: send_request
  2021-11-17 15:37:49,644 - synapse.http.client - 538 - DEBUG - PUT-2 - HTTP POST b'{"content":{"device_id":"XETGMHMAJD","keys":{"algorithms":["m.olm.curve25519-aes-sha256","m.megolm.v1.aes-sha"],"device_id":"XETGMHMAJD","keys":{"curve25519:XETGMHMAJD":"IQ/Hu4GGOaxIpMavovFYGouVJeIP2m  iSfysv+Db3NXg","ed25519:XETGMHMAJD":"MKkClRdltZlOHyCzxiDrm7MsDAsohXmAyeu2cYO6how"},"signatures":{"@anon-20211117_153737-2:localhost:8881":{"ed25519:EmkqvokUn8p+vQAGZitOk4PWjp7Ukp3txV2TbMPEiBQ":"62NDAYAi/fB1PQblpLU0zOecKOsWCMD2RPquR8AHRLQYOW6xqNcpvRXAYqWPFByfh+WK8zKb0Gr3KmMHaDx8BQ"}  },"user_id":"@anon-20211117_153737-2:localhost:8881"},"org.matrix.opentracing_context":"{}","prev_id":[],"stream_id":7,"user_id":"@anon-20211117_153737-2:localhost:8881"},"origin":"localhost:8881"}' -> http://localhost:8839/_synapse/replication/fed_send_edu/m.device_list_update/JEG  RhGqhhC
  2021-11-17 15:37:49,644 - synapse.http.client - 398 - DEBUG - PUT-2 - Sending request POST http://localhost:8839/_synapse/replication/fed_send_edu/m.device_list_update/JEGRhGqhhC
  2021-11-17 15:37:49,645 - synapse.http.proxyagent - 223 - DEBUG - PUT-2 - Requesting b'http://localhost:8839/_synapse/replication/fed_send_edu/m.device_list_update/JEGRhGqhhC' via <HostnameEndpoint localhost:8839>
  2021-11-17 15:37:49,673 - twisted - 275 - INFO - sentinel - Received SIGINT, shutting down.

@clokep
Copy link
Member

clokep commented Nov 23, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Z-Flaky Tests which seem to fail at random
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants