Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background migration to schema version 11 failing and getting stuck permanently #6984

Open
2 tasks done
Prime-Rose opened this issue Aug 17, 2024 · 0 comments
Open
2 tasks done
Labels

Comments

@Prime-Rose
Copy link

Using a supported version?

  • I have searched searched open and closed issues for duplicates.
  • I am using Signal-Desktop as provided by the Signal team, not a 3rd-party package.

Overall summary

I would like to bring the following issue to your attention, which I have been following for months now and is affecting various Signal Desktop installations of friends and family.
This report got a bit longer, but this analysis will (hopefully) provide the information needed to pinpoint and fix this bug, saving you some cumbersome work ;-)

It seems there is a "Background Migration" task running whenever the application is idle that should convert the message entries to schema version 11.

This task, however, can get stuck when it encounters an error during the migration of a single message (in a batch of 25 messages), as it will subsequently retry migrating this batch ad infinitum.
As a consequence the migration to schema version 11 will never complete (not even for the other unproblematic messages) and that may cause unpredictable errors in future releases.

I have seen this problem on various versions of Signal Desktop all running Ubuntu-based Linux - for the first time February 2024 on version 6.47.1, over to version 7.15 this summer, to the most recent version now.
It is reasonable to assume that the same problem exists in Signal Desktop on MacOS and Windows.
Currently the only symptom is a slightly strange activity of Signal Desktop's processes when it should be idle. This made me check the logs and I found the error entries there.
Usually, the migration proceeds successfully until it hits a message with a "bad" json field (more details on "bad" below) and then gets stuck there iterating every 10 seconds on this message instead of continuing/fixing/dropping it.

  • Successful log entries (in app.log) look like:
{"level":30,"time":"REDACTED","msg":"Start idle detector"}
{"level":40,"time":"REDACTED","msg":"idleDetector/idle: fetching at most 25 for migration"}
{"level":30,"time":"REDACTED","msg":"idleDetector/idle: Upgraded messages: {\"done\":false,\"numProcessed\":25,\"fetchDuration\":10,\"upgradeDuration\":37,\"saveDuration\":22,\"totalDuration\":69}"}
{"level":30,"time":"REDACTED","msg":"Stop idle detector"}
{"level":30,"time":"REDACTED","msg":"idleDetector/idle: Background migration not complete. Pausing for 10000ms."}
  • But once the error has occurred we get:
{"level":30,"time":"REDACTED","msg":"Start idle detector"}
{"level":40,"time":"REDACTED","msg":"idleDetector/idle: fetching at most 25 for migration"}
{"level":30,"time":"REDACTED","msg":"Stop idle detector"}
{"level":30,"time":"REDACTED","msg":"idleDetector/idle: Background migration not complete. Pausing for 10000ms."}
{"level":50,"time":"REDACTED","msg":"Top-level unhandled promise rejection: Error: Error invoking remote method 'sql-channel': Error: TypeError: SQLite3 can only bind numbers, strings, bigints, buffers, and null\n    at saveMessageSync ([REDACTED]/resources/app.asar/ts/sql/Server.js:1881:7)\n    at [REDACTED]/resources/app.asar/ts/sql/Server.js:1970:9\n    at sqliteTransaction ([REDACTED]/resources/app.asar/node_modules/@signalapp/better-sqlite3/lib/methods/transaction.js:65:24)\n    at Object.saveMessages ([REDACTED]/resources/app.asar/ts/sql/Server.js:1974:5)\n    at async MessagePort.<anonymous> ([REDACTED]/resources/app.asar/ts/sql/mainWorker.js:108:22)\n    at IpcRenderer.invoke (node:electron/js2c/renderer_init:2:6995)\n    at async [REDACTED]/resources/app.asar/preload.bundle.js:88:33411\n    at async run ([REDACTED]/resources/app.asar/preload.bundle.js:76:37812)\n    at async [REDACTED]/resources/app.asar/preload.bundle.js:76:37845\n    at async saveMessages ([REDACTED]/resources/app.asar/preload.bundle.js:88:39658)\n    at async migrateMessageData ([REDACTED]/resources/app.asar/preload.bundle.js:126:247617)\n    at async Bee.<anonymous> ([REDACTED]/resources/app.asar/preload.bundle.js:164:15945)"}

In the main.log for each failed attempt there is a line:
{"level":50,"time":"REDACTED","msg":"Error occurred in handler for 'sql-channel': {}"}

Now regarding the "bad" messages: The messages seem to have been created by older versions of Signal Desktop (in my samples between January and August 2018) and the problematic messages are always not "user-content" messages, but relate to keyexchange or alike.
Several ones are of type "incoming", but have "Missing Signed PreKey for PreKeyWhisperMessage" in its json 'error' entry (the field 'body' of the tuple is NULL, though). A much smaller fraction is of type "keychange", where the 'source' field is NULL, too.

Here are two (redacted) samples:

On the checked systems there are only a couple of hundred of such messages present - certainly less than 1% of all messages.

The problematic tuples in table 'messages' contain in the json data of field 'json' a unusual sent-at timestamp, which consists of three entries ("high", "low", and "unsigned") instead of an INTEGER number.
So, it looks like "sent_at": { "high": DDDDD, "low": -DDDDD, "unsigned": true }
The entry in the 'sent-at' field in the tuple, however, is NULL. Sounds a bit like it got accidentally converted to 64-bit integer?!
It makes sense that SQLite does not tolerate having this string garbage written into the sent-at field of the tuple, whereas in the json data this was ok, of course.

I guess these "bad" messages have been introduced by a bug in an earlier version or during a faulty migration.

Deleting these "bad" messages allows the migration to proceed and complete eventually - I tested this on the recent versions of Signal Desktop.

So this problem could be adressed by:

  1. Making the migration code more robust to simply continue with the remaining messages. This would, however, leave non-migrated messages with schema version 10 in the table.

  2. Fixing these "bad" entries by replacing the sent-at timestamp in the json data field 'json' of the respective tuple by somehow reasonable data. Maybe its 'received-at' field as 'sent-at' is unfortunately NULL? (This is a bit "Frankenstein-style", but could work.) If it is indeed a 64-bit integer, one could also try to convert it back and recover the original value of the entry. (Sorry, but I did not look deeper into this.)

  3. Deleting messages with such abnormal sent-at entries (or moving them to a lost+found table) and logging this of course. Personally, I do not like deleting things behind a user's back, but this is still an option.

Hope this analysis saves you some time and helps to fix this annoying bug - your work is very much appreciated.

Finally, please note that I did not check the iOS/Android versions of Signal although it is likely that the same bug is present there.

Steps to reproduce

Start up Signal Desktop and wait for a few minutes until it becomes idle. If "bad" messages (as detailed above) are present in the database, the background migration will eventually get stuck and produce an error every 10 seconds.

Expected result

The background migration should successfully migrate all messages to schema version 11 and complete without errors.

Actual result

After a number of successful migrations, the migration fails with errors of the form:

{"level":30,"time":"REDACTED","msg":"Start idle detector"}
{"level":40,"time":"REDACTED","msg":"idleDetector/idle: fetching at most 25 for migration"}
{"level":30,"time":"REDACTED","msg":"Stop idle detector"}
{"level":30,"time":"REDACTED","msg":"idleDetector/idle: Background migration not complete. Pausing for 10000ms."}
{"level":50,"time":"REDACTED","msg":"Top-level unhandled promise rejection: Error: Error invoking remote method 'sql-channel': Error: TypeError: SQLite3 can only bind numbers, strings, bigints, buffers, and null\n    at saveMessageSync ([REDACTED]/resources/app.asar/ts/sql/Server.js:1881:7)\n    at [REDACTED]/resources/app.asar/ts/sql/Server.js:1970:9\n    at sqliteTransaction ([REDACTED]/resources/app.asar/node_modules/@signalapp/better-sqlite3/lib/methods/transaction.js:65:24)\n    at Object.saveMessages ([REDACTED]/resources/app.asar/ts/sql/Server.js:1974:5)\n    at async MessagePort.<anonymous> ([REDACTED]/resources/app.asar/ts/sql/mainWorker.js:108:22)\n    at IpcRenderer.invoke (node:electron/js2c/renderer_init:2:6995)\n    at async [REDACTED]/resources/app.asar/preload.bundle.js:88:33411\n    at async run ([REDACTED]/resources/app.asar/preload.bundle.js:76:37812)\n    at async [REDACTED]/resources/app.asar/preload.bundle.js:76:37845\n    at async saveMessages ([REDACTED]/resources/app.asar/preload.bundle.js:88:39658)\n    at async migrateMessageData ([REDACTED]/resources/app.asar/preload.bundle.js:126:247617)\n    at async Bee.<anonymous> ([REDACTED]/resources/app.asar/preload.bundle.js:164:15945)"}

This repeats every 10 seconds if Signal Desktop is idle.

Screenshots

No response

Signal version

Any version I tried since 6.47.1 up to recent

Operating system

Ubuntu-based Linux (jammy/Virginia)

Version of Signal on your phone

Various, currently 7.20 (iOS)

Link to debug log

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants