Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: MemoryStore wedging? #77

Closed
kegsay opened this issue Jun 4, 2024 · 10 comments
Closed

rust: MemoryStore wedging? #77

kegsay opened this issue Jun 4, 2024 · 10 comments

Comments

@kegsay
Copy link
Member

kegsay commented Jun 4, 2024

See https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9351153389/job/25736195990?pr=3496

Times out on pushNotifEventID = bob.SendMessage(t, roomID, text).

Upon investigation, the logs show that the rust SDK did not send the event. It seems to get stuck after GET /state/m.room.encryption/ as there is no log line for "Sending encrypted event because the room is encrypted" but there is a 200 OK response to that GET. There is 1 lock that is taken between the two bits of code:

https://github.com/matrix-org/matrix-rust-sdk/blob/2cf36c7a5ed60c75824bea80422a8dbf829b1bed/crates/matrix-sdk/src/room/mod.rs#L486

I guess something is stopping the send_raw span from taking that lock, probably the next_sync_with_lock > sync_once{pos="1"} span which seems to be doing stuff at the same time.

@kegsay
Copy link
Member Author

kegsay commented Jun 6, 2024

TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1} has the same failure mode: key_backup_test.go:102: SendMessage(rust) @user-24-alice:hs1: timed out after 11s - it fails on the sanity check that you can send messages into an encrypted room... fails in exactly the same way:

2024-06-04T12:01:27.286925Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1](rust) SendMessage !GtBcsajlajtqiFKSSY:hs1 => An encrypted message | rust.go:0
2024-06-04T12:01:27.287076Z DEBUG log: timeline    
2024-06-04T12:01:27.287153Z DEBUG log: message_event_content_from_html    
2024-06-04T12:01:27.287180Z DEBUG log: send    
2024-06-04T12:01:27.287594Z DEBUG matrix_sdk_ui::timeline::event_handler: Handling local event | crates/matrix-sdk-ui/src/timeline/event_handler.rs:296 | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287700Z DEBUG log: processing CharacterTokens(NotSplit, Tendril<UTF8>(shared: \"An encrypted message\")) in insertion mode InBody     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287748Z DEBUG log: processing EOF in state Data     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287784Z DEBUG log: processing EOFToken in insertion mode InBody     | spans: send{room_id="!GtBcsajlajtqiFKSSY:hs1"} > handle_local_event > handle_local_event > handle_event{txn_id="8232fdcf818d424da8ce3ef774741505"}
2024-06-04T12:01:27.287995Z DEBUG matrix_sdk_ui::timeline::queue: Spawning message-sending task | crates/matrix-sdk-ui/src/timeline/queue.rs:233 | spans: send_queued_messages{room_id="!GtBcsajlajtqiFKSSY:hs1"}
2024-06-04T12:01:27.288072Z DEBUG log: log_event    

local echo:
2024-06-04T12:01:27.288103Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]AddTimelineListener[!GtBcsajlajtqiFKSSY:hs1] TimelineDiff len=1 | rust.go:0
2024-06-04T12:01:27.288344Z DEBUG log: change    

check if room is encrypted:
2024-06-04T12:01:27.288338Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: send_raw{room_id="!GtBcsajlajtqiFKSSY:hs1" transaction_id="8232fdcf818d424da8ce3ef774741505"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-12" method=GET uri="http://127.0.0.1:32794/_matrix/client/v3/rooms/!GtBcsajlajtqiFKSSY:hs1/state/m.room.encryption/"}
2024-06-04T12:01:27.288381Z DEBUG log: push_back    
2024-06-04T12:01:27.288438Z DEBUG log: as_event    
2024-06-04T12:01:27.288478Z DEBUG log: event_id    
2024-06-04T12:01:27.288543Z DEBUG log: sender    
2024-06-04T12:01:27.288680Z DEBUG log: content    
2024-06-04T12:01:27.288717Z DEBUG log: kind    
2024-06-04T12:01:27.288746Z DEBUG log: content    
2024-06-04T12:01:27.288766Z DEBUG log: as_message    
2024-06-04T12:01:27.288795Z DEBUG log: body    
2024-06-04T12:01:27.288844Z DEBUG log: log_event    
2024-06-04T12:01:27.288870Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]_______ PUSH BACK &{ID: Text:An encrypted message Sender:@user-24-alice:hs1 Target: Membership: FailedToDecrypt:false}
 | rust.go:0
2024-06-04T12:01:27.288910Z DEBUG log: log_event    
2024-06-04T12:01:27.288930Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1]TimelineDiff change: &{ID: Text:An encrypted message Sender:@user-24-alice:hs1 Target: Membership: FailedToDecrypt:false} | rust.go:0
2024-06-04T12:01:27.292769Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.292871Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.292937Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.293129Z DEBUG matrix_sdk_crypto::olm::account: The fallback key either expired or we didn't have one: generated a new fallback key. removed_fallback_key=None | crates/matrix-sdk-crypto/src/olm/account.rs:596 | spans: next_sync_with_lock > sync_once > handle_response > preprocess_to_device_events > receive_sync_changes
2024-06-04T12:01:27.294744Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-11" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="420 B" request_body=b"{\"conn_id\":\"room-list\",\"txn_id\":\"ec7022eeceea4f709c53dd22075ed086\",\"lists\":{\"all_rooms\":{\"ranges\":[[0,0]]},\"visible_rooms\":{\"ranges\":[[0,19]],\"sort\":[\"by_recency\",\"by_name\"],\"required_state\":[[\"m.room.encryption\",\"\"],[\"m.room.member\",\"$LAZY\"]],\"timeline_limit\":20,\"include_heroes\":true,\"filters\":{\"is_tombstoned\":false,\"not_room_types\":[\"m.space\"]},\"bump_event_types\":[\"m.room.message\",\"m.room.encrypted\",\"m.sticker\"]}}}" status=200 response_size="1.2 kiB"}
2024-06-04T12:01:27.294852Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.294945Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.294995Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: sync_once{pos="1"} > handle_response

got response to encrypted room check, this is the last line which has send_raw in the span.
NB: we do not see "Sending encrypted event because the room is encrypted" anywhere here:
2024-06-04T12:01:27.296228Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: send_raw{room_id="!GtBcsajlajtqiFKSSY:hs1" transaction_id="8232fdcf818d424da8ce3ef774741505"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-12" method=GET uri="http://127.0.0.1:32794/_matrix/client/v3/rooms/!GtBcsajlajtqiFKSSY:hs1/state/m.room.encryption/" status=200 response_size="36 B"}
2024-06-04T12:01:27.300082Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 12.593µs | crates/matrix-sdk-base/src/store/memory_store.rs:477 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300196Z DEBUG matrix_sdk::sync: Ran event handlers in 17.663µs | crates/matrix-sdk/src/sync.rs:228 | spans: next_sync_with_lock > sync_once > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.300209Z  WARN matrix_sdk_crypto::backups: Trying to backup room keys but no backup key was found | crates/matrix-sdk-crypto/src/backups/mod.rs:520
2024-06-04T12:01:27.300238Z DEBUG matrix_sdk::sync: Ran notification handlers in 7.995µs | crates/matrix-sdk/src/sync.rs:253 | spans: next_sync_with_lock > sync_once > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.300298Z DEBUG matrix_sdk::sliding_sync: Sliding Sync response has been handled by the client sync_response=SyncResponse { rooms: Rooms { leave: {}, join: {}, invite: {} }, account_data: [], to_device: [], notifications: {}, .. } | crates/matrix-sdk/src/sliding_sync/mod.rs:355 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300354Z DEBUG matrix_sdk::sliding_sync: Update lists lists={} | crates/matrix-sdk/src/sliding_sync/mod.rs:420 | spans: next_sync_with_lock > sync_once > handle_response
2024-06-04T12:01:27.300448Z DEBUG matrix_sdk::sliding_sync: Done handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:704 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.300575Z DEBUG matrix_sdk::sliding_sync: Sync stream is running | crates/matrix-sdk/src/sliding_sync/mod.rs:729 | spans: next_sync_with_lock
2024-06-04T12:01:27.300665Z DEBUG matrix_sdk::sliding_sync::cache: loading sliding sync sliding_sync_store::encryption::@user-24-alice:hs1 state from DB finished in 0ms | crates/matrix-sdk/src/sliding_sync/cache.rs:200 | spans: next_sync_with_lock > sync_once
2024-06-04T12:01:27.300709Z DEBUG matrix_sdk::sliding_sync: Sending request | crates/matrix-sdk/src/sliding_sync/mod.rs:597 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.300890Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: next_sync_with_lock > sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-13" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="65 B" request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"0\"}}}"}
2024-06-04T12:01:27.301180Z DEBUG matrix_sdk_crypto::session_manager::sessions: Collected user/device pairs that are missing an Olm session missing_session_devices_by_user={} timed_out_devices_by_user={} | crates/matrix-sdk-crypto/src/session_manager/sessions.rs:301 | spans: get_missing_sessions
2024-06-04T12:01:27.301863Z DEBUG matrix_sdk::encryption: Uploading public encryption keys device_keys=false one_time_key_count=0 | crates/matrix-sdk/src/encryption/mod.rs:446 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"}
2024-06-04T12:01:27.302058Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-14" method=POST uri="http://127.0.0.1:32794/_matrix/client/v3/keys/upload" request_size="269 B"}
2024-06-04T12:01:27.303197Z DEBUG matrix_sdk_base::read_receipts: Starting. read_receipts=RoomReadReceipts { num_unread: 0, num_notifications: 0, num_mentions: 0, latest_active: Some(LatestReadReceipt { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" }), pending: RingBuffer { inner: [] } } | crates/matrix-sdk-base/src/read_receipts.rs:463 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1}
2024-06-04T12:01:27.303391Z DEBUG matrix_sdk_base::read_receipts: saving better | crates/matrix-sdk-base/src/read_receipts.rs:335 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1} > try_match_implicit > try_select_later{event_id="$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" event_pos=0 prev_pos=Some(0) prev_receipt=None}
2024-06-04T12:01:27.303502Z DEBUG matrix_sdk_base::read_receipts: after finding a better receipt read_receipts=RoomReadReceipts { num_unread: 0, num_notifications: 0, num_mentions: 0, latest_active: Some(LatestReadReceipt { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M" }), pending: RingBuffer { inner: [] } } | crates/matrix-sdk-base/src/read_receipts.rs:510 | spans: sync_once{pos="1"} > handle_response > compute_unread_counts{room_id=!GtBcsajlajtqiFKSSY:hs1}
2024-06-04T12:01:27.303736Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 143.069µs | crates/matrix-sdk-base/src/store/memory_store.rs:477 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.304635Z DEBUG matrix_sdk::sync: Ran event handlers in 765.124µs | crates/matrix-sdk/src/sync.rs:228 | spans: sync_once{pos="1"} > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.304674Z DEBUG matrix_sdk::sync: Ran notification handlers in 8.706µs | crates/matrix-sdk/src/sync.rs:253 | spans: sync_once{pos="1"} > handle_response > call_sync_response_handlers
2024-06-04T12:01:27.304718Z DEBUG matrix_sdk::sliding_sync: Sliding Sync response has been handled by the client sync_response=SyncResponse { rooms: Rooms { leave: {}, join: {"!GtBcsajlajtqiFKSSY:hs1": JoinedRoom { unread_notifications: UnreadNotificationsCount { highlight_count: 0, notification_count: 0 }, timeline: Timeline { limited: false, prev_batch: Some("t7-165_32_0_1_34_1_17_111_0_1"), events: [SyncTimelineEvent { event: RawEvent { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M", event_type: "m.room.name", .. } }] }, state: [RawEvent { event_id: "$ridkUXUuqyRKn3I0lf8H1eXxzZxi9X2uv2wbuMZnCYM", event_type: "m.room.member", .. }, RawEvent { event_id: "$BokDim-BS45fEQFB-M3T8EBBcj67bhi9E-1fL3SGexQ", event_type: "m.room.encryption", .. }, RawEvent { event_id: "$QUbAnbhviGjhGbtISSqQdxY0XYDZ3RH5sGHlJUnVP1M", event_type: "m.room.name", .. }], account_data: [], ephemeral: [], ambiguity_changes: {} }}, invite: {} }, account_data: [], to_device: [], notifications: {}, .. } | crates/matrix-sdk/src/sliding_sync/mod.rs:355 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.304900Z DEBUG matrix_sdk::sliding_sync: Update lists lists={"all_rooms": SyncList { ops: [], count: 1 }, "visible_rooms": SyncList { ops: [SyncOp { op: Sync, range: Some((0, 0)), index: None, room_ids: ["!GtBcsajlajtqiFKSSY:hs1"], room_id: None }], count: 1 }} | crates/matrix-sdk/src/sliding_sync/mod.rs:420 | spans: sync_once{pos="1"} > handle_response
2024-06-04T12:01:27.305343Z DEBUG matrix_sdk::sliding_sync: Done handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:704 | spans: sync_once{pos="1"}
2024-06-04T12:01:27.305484Z DEBUG matrix_sdk::sliding_sync: Sync stream is running | crates/matrix-sdk/src/sliding_sync/mod.rs:729
2024-06-04T12:01:27.305733Z DEBUG matrix_sdk::sliding_sync: Sending request | crates/matrix-sdk/src/sliding_sync/mod.rs:597 | spans: sync_once{pos="2"}
2024-06-04T12:01:27.306048Z DEBUG matrix_sdk::http_client: Sending request | crates/matrix-sdk/src/http_client/mod.rs:186 | spans: sync_once{pos="2"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-15" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="100 B" request_body=b"{\"conn_id\":\"room-list\",\"lists\":{\"all_rooms\":{\"ranges\":[[0,0]]},\"visible_rooms\":{\"ranges\":[[0,19]]}}}"}
2024-06-04T12:01:27.309572Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: next_sync_with_lock > sync_once{pos="1"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s } request_id="REQ-13" method=POST uri="http://127.0.0.1:32792/_matrix/client/unstable/org.matrix.msc3575/sync" request_size="65 B" request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"0\"}}}" status=200 response_size="181 B"}
2024-06-04T12:01:27.310776Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:192 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 30s } request_id="REQ-14" method=POST uri="http://127.0.0.1:32794/_matrix/client/v3/keys/upload" request_size="269 B" status=200 response_size="48 B"}
2024-06-04T12:01:27.310856Z DEBUG matrix_sdk_crypto::olm::account: Marking one-time keys as published | crates/matrix-sdk-crypto/src/olm/account.rs:1203 | spans: keys_upload{request_id="2980871ad43647e88d5e7e900aff89cc"}
2024-06-04T12:01:27.316778Z DEBUG matrix_sdk::sliding_sync: Received response | crates/matrix-sdk/src/sliding_sync/mod.rs:662 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.316838Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: next_sync_with_lock > sync_once{pos="1"}
2024-06-04T12:01:27.316871Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: next_sync_with_lock > sync_once{pos="1"} > handle_response

no logs here for 11s as things time out

2024-06-04T12:01:38.287481Z DEBUG log: stop    
2024-06-04T12:01:38.288200Z DEBUG matrix_sdk::sliding_sync: Sync stream has received an internal message internal_message=Ok(SyncLoopStop) | crates/matrix-sdk/src/sliding_sync/mod.rs:737
2024-06-04T12:01:38.288237Z DEBUG matrix_sdk::sliding_sync: Sync stream has exited. | crates/matrix-sdk/src/sliding_sync/mod.rs:778
2024-06-04T12:01:38.288406Z DEBUG matrix_sdk::sliding_sync: Sync stream has received an internal message internal_message=Ok(SyncLoopStop) | crates/matrix-sdk/src/sliding_sync/mod.rs:737 | spans: next_sync_with_lock
2024-06-04T12:01:38.288450Z DEBUG matrix_sdk::sliding_sync: Sync stream has exited. | crates/matrix-sdk/src/sliding_sync/mod.rs:778 | spans: next_sync_with_lock
2024-06-04T12:01:38.288704Z DEBUG log: log_event    
2024-06-04T12:01:38.288749Z  INFO TestBackupWrongRecoveryKeyFails/{rust_hs1}|{rust_hs1}: [@user-24-alice:hs1](rust) Close | rust.go:0

Just before the encryption event response is returned, https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L294 is logged, which later on acquires the offending lock https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L331 so I'm fairly sure this is still blocked and is blocking send_raw from making forward progress somehow. It does seem to release the lock given we see https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk/src/sliding_sync/mod.rs#L355 later in the logs though :S

@kegsay kegsay changed the title TestNSEReceive flake rust: MemoryStore wedging? Jun 14, 2024
@kegsay
Copy link
Member Author

kegsay commented Jun 14, 2024

There was another flake in https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9503969571/job/26195597830?pr=3539 - this time we got past the encryption state and see "Sending encrypted event because the room is encrypted" for Charlie's message send, but then wedge after getting the /members response. The code after getting the response calls receive_all_members which does lots of things and ends with:

        let _sync_lock = self.sync_lock().lock().await;
        let mut room_info = room.clone_info();
        room_info.mark_members_synced();
        changes.add_room(room_info);

        self.store.save_changes(&changes).await?;
        self.apply_changes(&changes, false);

..once again implicating either the sync lock or the data store (via save_changes). Further investigation on previous flakey tests show log lines in the sync code appearing which only appear after releasing the sync lock which should have allowed forward progress on the send event task, which makes myself (and @poljar ) suspect that the sync lock is blameless in this.

@kegsay
Copy link
Member Author

kegsay commented Jun 14, 2024

The sync lock seems to be consistently held throughout calls to save_changes afaict. This means it isn't possible to have 2 concurrent tasks inside save_changes at the same time, meaning there must be some other cause here.

@kegsay
Copy link
Member Author

kegsay commented Jun 27, 2024

2024-06-26T17:50:15.679813Z DEBUG matrix_sdk::sliding_sync: Start handling response | crates/matrix-sdk/src/sliding_sync/mod.rs:678 | spans: sync_once{pos="1"}
2024-06-26T17:50:15.679858Z DEBUG matrix_sdk::sliding_sync: Update position markers delta_token=None | crates/matrix-sdk/src/sliding_sync/mod.rs:294 | spans: sync_once{pos="1"} > handle_response
2024-06-26T17:50:15.681704Z DEBUG matrix_sdk::http_client: Got response | crates/matrix-sdk/src/http_client/mod.rs:193 | spans: sending_task{room_id=!QHAxGOXvXFVwmJzIiE:hs1} > send_raw{room_id="!QHAxGOXvXFVwmJzIiE:hs1" transaction_id="50da7ffd3d3042f29397de91d214848f" is_room_encrypted=true} > send{server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6] config=RequestConfig { timeout: 60s, retry_limit: 3 } request_id="REQ-12" method=GET uri="http://127.0.0.1:32781/_matrix/client/v3/rooms/!QHAxGOXvXFVwmJzIiE:hs1/members" status=200 response_size="982 B"}
2024-06-26T17:50:15.682925Z DEBUG matrix_sdk_base::store::memory_store: Saved changes in 13.846µs | crates/matrix-sdk-base/src/store/memory_store.rs:519 | spans: next_sync_with_lock > sync_once > handle_response

For https://github.com/matrix-org/matrix-rust-sdk/actions/runs/9684285011/job/26721675685 - same thing, wedging after /members response.

@kegsay
Copy link
Member Author

kegsay commented Jun 27, 2024

kegsay added a commit to matrix-org/matrix-rust-sdk that referenced this issue Jul 2, 2024
bnjbvr pushed a commit to matrix-org/matrix-rust-sdk that referenced this issue Jul 2, 2024
* Add trace logging to memory store

To help debug matrix-org/complement-crypto#77

* Missing import
* Review comments
@kegsay
Copy link
Member Author

kegsay commented Jul 5, 2024

It reproduced with trace logging. It deadlocks somewhere in this code block

@kegsay
Copy link
Member Author

kegsay commented Jul 9, 2024

2024-07-09T08:00:29.544568Z TRACE matrix_sdk_base::store::memory_store: starting | crates/matrix-sdk-base/src/store/memory_store.rs:289 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544609Z TRACE matrix_sdk_base::store::memory_store: profiles | crates/matrix-sdk-base/src/store/memory_store.rs:296 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544659Z TRACE matrix_sdk_base::store::memory_store: ambiguity maps | crates/matrix-sdk-base/src/store/memory_store.rs:319 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544685Z TRACE matrix_sdk_ui::timeline::builder: Waiting for an event. | crates/matrix-sdk-ui/src/timeline/builder.rs:188 | spans: room_update_handler{room_id="!ZdajZoYbBHcccwoqPi:hs1"}
2024-07-09T08:00:29.544709Z TRACE matrix_sdk_base::store::memory_store: account data | crates/matrix-sdk-base/src/store/memory_store.rs:331 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544739Z TRACE matrix_sdk_base::store::memory_store: room account data | crates/matrix-sdk-base/src/store/memory_store.rs:339 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544743Z TRACE matrix_sdk_ui::timeline::inner: Forcing update of sender profiles: {"@user-38-bob:hs1"} | crates/matrix-sdk-ui/src/timeline/inner/mod.rs:1062 | spans: room_update_handler{room_id="!ZdajZoYbBHcccwoqPi:hs1"}
2024-07-09T08:00:29.544763Z TRACE matrix_sdk_base::store::memory_store: room state | crates/matrix-sdk-base/src/store/memory_store.rs:352 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes
2024-07-09T08:00:29.544784Z TRACE matrix_sdk_base::store::memory_store: room state: got room_state lock | crates/matrix-sdk-base/src/store/memory_store.rs:355 | spans: sending_task{room_id=!ZdajZoYbBHcccwoqPi:hs1} > send_raw{room_id="!ZdajZoYbBHcccwoqPi:hs1" transaction_id="9af92432710442b985c20292c0c4fe4a" is_room_encrypted=true} > receive_all_members{room_id="!ZdajZoYbBHcccwoqPi:hs1"} > save_changes

..and then it wedges. So it blocks on let mut stripped_room_state = self.stripped_room_state.write().unwrap();

@kegsay
Copy link
Member Author

kegsay commented Jul 9, 2024

This implicates this block of code:

    async fn get_state_events(
        &self,
        room_id: &RoomId,
        event_type: StateEventType,
    ) -> Result<Vec<RawAnySyncOrStrippedState>> {
        fn get_events<T>(
            state_map: &HashMap<OwnedRoomId, HashMap<StateEventType, HashMap<String, Raw<T>>>>,
            room_id: &RoomId,
            event_type: &StateEventType,
            to_enum: fn(Raw<T>) -> RawAnySyncOrStrippedState,
        ) -> Option<Vec<RawAnySyncOrStrippedState>> {
            let state_events = state_map.get(room_id)?.get(event_type)?;
            Some(state_events.values().cloned().map(to_enum).collect())
        }

        Ok(get_events(
            &self.stripped_room_state.read().unwrap(),
            room_id,
            &event_type,
            RawAnySyncOrStrippedState::Stripped,
        )
        .or_else(|| {
            get_events(
                &self.room_state.read().unwrap(),
                room_id,
                &event_type,
                RawAnySyncOrStrippedState::Sync,
            )
        })
        .unwrap_or_default())
    }

because this block acquires a read lock on stripped_room_state then potentially tries to acquire a read lock on room_state which it can't do - it's an ABBA deadlock.

I need confirmation from rust folks if this is how the locking works.

@kegsay
Copy link
Member Author

kegsay commented Jul 9, 2024

@poljar confirms that the stripped_room_state lock is still held when the room_state lock is acquired. As such, this would cause a deadlock.

@kegsay
Copy link
Member Author

kegsay commented Jul 9, 2024

#115 moves us away from using the memory store
matrix-org/matrix-rust-sdk#3668 fixes the memory store.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant