Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Can't log in anymore (Initial Sync) #9216

Closed
yz55 opened this issue Jan 25, 2021 · 8 comments
Closed

Can't log in anymore (Initial Sync) #9216

yz55 opened this issue Jan 25, 2021 · 8 comments
Labels
T-Other Questions, user support, anything else.

Comments

@yz55
Copy link

yz55 commented Jan 25, 2021

Description

Server was running fine for about 6 months, today one user could not start the application anymore. We reinstalled the application and tried to login again, but still no luck. Tried with element-web, but it stucks on "Syncing.." or "Unable to connect to Homeserver. Retrying.."

Found a similar issue that is "solved", but the solution was not working on my side. #7349

Steps to reproduce

  • write messages (encrypted), delete them
  • reload session on device?

Logs

2021-01-24 22:54:11,415 - synapse.http.server - 85 - ERROR - GET-868 - Failed handle request via 'SyncRestServlet': <XForwardedForRequest at 0x7f038d5e5f28 method='GET' uri='/_matrix/client/r0/sync?filter=2&timeout=0&_cacheBuster=1611525253133' clientproto='HTTP/1.0' site=8008>
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 307, in _wait_for_sync_for_user
    sync_config, since_token, full_state=full_state
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 338, in current_sync_for_user
    return await self.generate_sync_result(sync_config, since_token, full_state)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 989, in generate_sync_result
    sync_result_builder, account_data_by_room
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 1420, in _generate_sync_entry_for_rooms
    await concurrently_execute(handle_room_entries, room_entries, 10)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/util/async_helpers.py", line 172, in _concurrently_execute_inner
    await maybe_awaitable(func(next(it)))
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 1415, in handle_room_entries
    always_include=sync_result_builder.full_state,
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 1812, in _generate_room_entry
    newly_joined_room=newly_joined,
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/sync.py", line 492, in _load_filtered_recents
    events
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/api/filtering.py", line 223, in filter_room_timeline
    return self._room_timeline_filter.filter(self._room_filter.filter(events))
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/api/filtering.py", line 368, in filter
    return list(filter(self.check, events))
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/api/filtering.py", line 311, in check
    contains_url = isinstance(content.get("url"), str)
AttributeError: 'str' object has no attribute 'get'

Because I could not find any event or room id, I changed the log level to "DEBUG". This happens everytime before the error above:

2021-01-25 00:39:50,703 - synapse.state - 406 - DEBUG - GET-596 - resolve_state_groups event_ids ['$nQnDpvWiQNr209I1UVIT4sD2hqGl47-Cr1fDeEhVTy8']
2021-01-25 00:39:50,710 - synapse.storage.txn - 585 - DEBUG - GET-596 - [TXN END] {get_tags_for_room-e9d} 0.008573 sec
2021-01-25 00:39:50,714 - synapse.state - 210 - DEBUG - GET-596 - calling resolve_state_groups from get_current_state_ids
2021-01-25 00:39:50,714 - synapse.state - 406 - DEBUG - GET-596 - resolve_state_groups event_ids ['$K6AR7a1YU1LWrY5PXwRNz5dwl5CDJlQdBOn-hp5f3K8']
2021-01-25 00:39:50,716 - synapse.storage.txn - 498 - DEBUG - GET-596 - [TXN START] {get_tags_for_room-e9e}
2021-01-25 00:39:50,723 - synapse.storage.txn - 498 - DEBUG - GET-596 - [TXN START] {get_tags_for_room-e9f}
2021-01-25 00:39:50,723 - synapse.storage.txn - 585 - DEBUG - GET-596 - [TXN END] {get_tags_for_room-e9e} 0.007724 sec
2021-01-25 00:39:50,724 - synapse.metrics - 574 - DEBUG - None - Collecting gc 0
2021-01-25 00:39:50,725 - synapse.storage.txn - 585 - DEBUG - GET-596 - [TXN END] {get_tags_for_room-e9f} 0.001848 sec
2021-01-25 00:39:50,739 - synapse.storage.txn - 498 - DEBUG - GET-596 - [TXN START] {get_tags_for_room-ea0}
2021-01-25 00:39:50,743 - synapse.http.server - 83 - ERROR - GET-596 - Failed handle request via 'SyncRestServlet': <XForwardedForRequest at 0x7f2e9c7472e8 method='GET' uri='/_matrix/client/r0/sync?filter=0&timeout=0' clientproto='HTTP/1.0' site='8008'>

Version information

My Homeserver was running on 1.21.1, I tried to upgrade to 1.25, but still the same problem.

  • Install method:
    Debian package manager

  • Platform:

  • Ubuntu 18.04.5 LTS (Linux 19997 4.15.0-132-generic Config restructuring. #136-Ubuntu SMP Tue Jan 12 14:58:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux)

  • Python 3.6.9

  • psql (PostgreSQL) 10.15

@anoadragon453
Copy link
Member

anoadragon453 commented Jan 25, 2021

@yz55 Thanks for your report. Looks like generating sync responses is choking on an event which has a string value for its content, instead of a dictionary. This broken event should only affect users who are in the room that the event is in.

The following SQL query should reveal any events that content a str-based content object (though unfortunately will take some time to run):

select * from event_json where json_typeof(json::json->'content') = 'string';

I would be surprised if this event were created by redacting an event. In any case, we should find which room it is in in order to:

  • Figure out how this event was created.
  • Redact it/have the user leave the room in order to allow them to complete /sync's again.

And then work out a mitigation to prevent this in the future.

@anoadragon453 anoadragon453 added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter labels Jan 25, 2021
@yz55
Copy link
Author

yz55 commented Jan 25, 2021

@anoadragon453 Thank you. I got 10596 results with the SQL query, that's a bit too much.

Now that you mention the event content, I remember that I executed this script #4206 for redacting events. A closer look in the script shows that the content is indeed cleared with
UPDATE events SET content = \'{}\' WHERE event_id=\'%s\'

One of the 10596 results:
$q5FvEFpd0b8yqyy1MLo5msRIvNLvJVyr8tO75hYThCc | !GCZtDipYHXuFmSdprz:xxx.com | {"token_id":17,"txn_id":"!GCZtDipYHXuFmSdprz:xxx.com-1602811824548","stream_ordering":1503} | {"auth_events": ["$PMAKRWV8F-S349nIGysyzc96ZL2dTKpLMHPF_irJ8nk", "$Tc4aIrRqS8T6x-wBAA2gFv6Ri_e5tb67FeKcTdR0wcs", "$YnxGrSYzMN1_bWmwHWhh-l7QMcA1V9FNsn-A8cFo_oM"], "prev_events": ["$j0_ShSvZ5zLz8KqUaUGdGwyO5VW0Ub3gwWTQD3ButIg"], "type": "m.room.encrypted", "room_id": "!GCZtDipYHXuFmSdprz:xxx.com", "sender": "@user1:xxx.com", "content": "", "depth": 693, "prev_state": [], "origin": "xxx.com", "origin_server_ts": 1602811870402, "hashes": {"sha256": "Lam/6oW78RS1hYe+q/fllnJMXE6YD3Trn84yilcaFYs"}, "signatures": {"xxx.com": {"ed25519:a_SmcS": "HoBFGsEl9ng78tlBs42SAAS4w72QEVwjla6Ci/cCC/KDTM0bgFVM9NgxHHqZzv7tF6YBCI88S2HH1xDoVSi+AQ"}}, "unsigned": {"age_ts": 1602811870402}}

I don't understand why the syncing problem is not affecting every user from the SQL result.

@yz55
Copy link
Author

yz55 commented Jan 26, 2021

I did some testing with my testing server and uploaded the live database. The user was still not able to login.

To see if it's because of some event, I tried to delete the whole "event_json" table. Log in still crashing.

Log in was only possible after truncating the "events" table brutally (CASCADE).

I think the script broke some event. The problem is now, how can I find the broken event and how could I fix this?

@anoadragon453
Copy link
Member

I don't understand why the syncing problem is not affecting every user from the SQL result.

Even when initial syncing, only the most recent events for each room are returned. If the affected events are old enough (aka if you only purged old events), then they won't be recalled. Users that are simply syncing (instead of initial syncing) will also only get new events.

Now that you mention the event content, I remember that I executed this script #4206 for redacting events. A closer look in the script shows that the content is indeed cleared with
UPDATE events SET content = \'{}\' WHERE event_id=\'%s\'

'{}' is correct. The problem is that content has been set to an empty string in some events.

Looking at the script, which I've copied to a gist for linkability, it seems as though the troubling lines are here:

            if "content" in obj and obj["content"] != "":
                obj["content"] = ""  <-- set content to an empty string
                #print obj
                json_str = json.dumps(obj)
                sql = 'UPDATE event_json set json = \'%s\' where event_id=\'%s\'' % (json_str, msg)
                c.execute(sql)

In general I'd recommend not running random scripts from the internet on your synapse database - at least not without taking a backup first.

To see if it's because of some event, I tried to delete the whole "event_json" table. Log in still crashing.

This was done on the testing server, right? :)


To rectify this, you should just be able to replace any empty event content with {}. I highly recommend making a database backup/trying on your testing server first.

First try this query, which sets the top-level content of a json object to {}, on a single, affected event:

update event_json
  set json = jsonb_set(
    json::jsonb,
    '{content}',
    jsonb '{}',
    false
  ) #>> '{}' where json::jsonb->'content' = '' and event_id = '$xxx.yyy';

Check that the json of the event is what you expect (look for "content": {}):

select json from event_json where event_id = '$xxx.yyy';

If it looks good, then we can expand the query to all affected events:

update event_json
  set json = jsonb_set(
    json::jsonb,
    '{content}',
    jsonb '{}',
    false
  ) #>> '{}' where json::jsonb->'content' = '';

@yz55
Copy link
Author

yz55 commented Jan 26, 2021

@anoadragon453 Thank you for investigating, appreciate this!

I tried the first query on the test database:

update event_json set json = jsonb_set(json::jsonb,'{content}',jsonb '{}',false) #>> '{}' where json::jsonb->'content' = '' and event_id = '$gfKAKGM8XEg2CBOFurU5bSPadAKb26tRil7GmvasTqg';

results in:

bb75cdcb0e809e7bc9179aa4623d7faa

@anoadragon453
Copy link
Member

Whoops, of course empty space is not valid JSON. Please replace:

where json::jsonb->'content' = ''

with

where json::jsonb->'content' = '""'

@anoadragon453 anoadragon453 added T-Other Questions, user support, anything else. and removed S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter labels Jan 26, 2021
@yz55
Copy link
Author

yz55 commented Jan 27, 2021

That fixed it, awesome. Thank you!

Never run scripts without testing them on a testserver - lesson learned.

@anoadragon453
Copy link
Member

Glad to hear it worked 🙂

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Other Questions, user support, anything else.
Projects
None yet
Development

No branches or pull requests

2 participants