-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On un-sleeping laptop, element desktop sometimes goes entirely white (because EW has OOMed during the night) #680
Comments
I am currently getting this almost every morning. |
Yup, i've had this every single time i've unslept my laptop over the last week. |
See also #672 and #824 and electron/electron#2357. I am getting this literally every day i open my M1 mac. @novocaine says here that:
Just for some additional diagnostics, the pstree I have running while it's stuck in this state are:
although fwiw Instruments (having sampled via the Activity Monitor instrument) also claims there's MTLCompilerService and VTDecoderXPCService and VTEncoderXPCServicer with a "responsible process" of 33570 (the main Element Nightly). Meanwhile, taking a thread dump of the three main processes via the Activity Monitor app (not Instruments) gives: nightly-sample.txt Ah - I just found the crash log for the renderer process, which apparently exited with SIGTRAP (EXC_BREAKPOINT?!) at 02:43:08, when i was asleep: nightly-helper-renderer-crashdump.txt For the sake of completeness, here are the last few other crashdumps from the same in the middle of the night: Element Nightly Helper (Renderer)-2022-05-04-043838.ips.txt Looking at the JSON stackstraces in these thread dumps: ...it looks like thread ID 1248168 is the one which crashed, due to v8 running out of memory (while trying to GC): {
"triggered": true,
"id": 1248168,
"name": "ThreadPoolForegroundWorker",
"threadState": {
"x": [
{
"value": 4732437384
},
{
"value": 105553245181312
},
{
"value": 1024
},
{
"value": 6272652880
},
{
"value": 6272652032
},
{
"value": 6272651744
},
{
"value": 729647118309407400
},
{
"value": 1280
},
{
"value": 0
},
{
"value": 0
},
{
"value": 0
},
{
"value": 2
},
{
"value": 2
},
{
"value": 0
},
{
"value": 0
},
{
"value": 0
},
{
"value": 7241649484,
"symbolLocation": 0,
"symbol": "pthread_mutex_unlock"
},
{
"value": 6272692224
},
{
"value": 0
},
{
"value": 6272653200
},
{
"value": 6272653216
},
{
"value": 6272653224
},
{
"value": 6272651880
},
{
"value": 6272688863
},
{
"value": 4731625472,
"symbolLocation": 15664,
"symbol": "v8_inspector::protocol::Debugger::API::Paused::ReasonEnum::XHR"
},
{
"value": 56
},
{
"value": 8683873544,
"symbolLocation": 0,
"symbol": "__stderrp"
},
{
"value": 534836150272
},
{
"value": 134227373
}
],
"flavor": "ARM_THREAD_STATE64",
"lr": {
"value": 4651545376
},
"cpsr": {
"value": 1610616832
},
"fp": {
"value": 6272653184
},
"sp": {
"value": 6272651856
},
"esr": {
"value": 4060086272,
"description": "(Breakpoint) brk 0"
},
"pc": {
"value": 4651545904,
"matchesCrashFrame": 1
},
"far": {
"value": 6272653624
}
},
"frames": [
{
"imageOffset": 42005808,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 8767684,
"imageIndex": 3
},
{
"imageOffset": 42005280,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 8767156,
"imageIndex": 3
},
{
"imageOffset": 60872656,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 27634532,
"imageIndex": 3
},
{
"imageOffset": 13763320,
"symbol": "v8::internal::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*)",
"symbolLocation": 724,
"imageIndex": 3
},
{
"imageOffset": 13763112,
"symbol": "v8::internal::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*)",
"symbolLocation": 516,
"imageIndex": 3
},
{
"imageOffset": 15541168,
"symbol": "v8::internal::Heap::StartIncrementalMarking(int, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags)",
"symbolLocation": 1108,
"imageIndex": 3
},
{
"imageOffset": 16367628,
"symbol": "v8::internal::GlobalSafepointScope::~GlobalSafepointScope()",
"symbolLocation": 241132,
"imageIndex": 3
},
{
"imageOffset": 16183292,
"symbol": "v8::internal::GlobalSafepointScope::~GlobalSafepointScope()",
"symbolLocation": 56796,
"imageIndex": 3
},
{
"imageOffset": 16128432,
"symbol": "v8::internal::GlobalSafepointScope::~GlobalSafepointScope()",
"symbolLocation": 1936,
"imageIndex": 3
},
{
"imageOffset": 16127796,
"symbol": "v8::internal::GlobalSafepointScope::~GlobalSafepointScope()",
"symbolLocation": 1300,
"imageIndex": 3
},
{
"imageOffset": 60775440,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 27537316,
"imageIndex": 3
},
{
"imageOffset": 60770400,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 27532276,
"imageIndex": 3
},
{
"imageOffset": 42241136,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9003012,
"imageIndex": 3
},
{
"imageOffset": 42376056,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9137932,
"imageIndex": 3
},
{
"imageOffset": 42373568,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9135444,
"imageIndex": 3
},
{
"imageOffset": 42559652,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9321528,
"imageIndex": 3
},
{
"imageOffset": 42373028,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9134904,
"imageIndex": 3
},
{
"imageOffset": 42432288,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9194164,
"imageIndex": 3
},
{
"imageOffset": 42431332,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9193208,
"imageIndex": 3
},
{
"imageOffset": 42560896,
"symbol": "node::GetEnvironmentIsolateData(node::Environment*)",
"symbolLocation": 9322772,
"imageIndex": 3
},
{
"imageOffset": 29292,
"symbol": "_pthread_start",
"symbolLocation": 148,
"imageIndex": 6
},
{
"imageOffset": 8332,
"symbol": "thread_start",
"symbolLocation": 8,
"imageIndex": 6
}
]
}, So: what could be causing an explosion of v8 memory utilisation on the renderer process, and how do I stop it? |
So I just got this again, except this time the inspector was open. This didn't help much, given the inspector disconnected from the app when the renderer crashed, however I think I have an accurate memory heap size estimate from the point of the crash (701MB)... ...although this probably don't reflect the sudden memory spike which killed it. The crash happened at 05:27:15 GMT+1 according to the crash log: Element Nightly Helper (Renderer)-2022-05-10-052716.ips.txt The last thing on the console was:
...so significantly before the crash. |
Options for debugging this further:
Also discussed at https://matrix.to/#/!zKcwXhRfcrDOBEYsgu:matrix.org/$fHG1BhGvNW4-Z-LvRjVowzVHrelDewTWzJs7J7_0j6A?via=matrix.org&via=element.io&via=vector.modular.im |
@ara4n Please document any debugging steps to get any of those options working. I'm extremely interested in how to get information for scenarios like these as well. |
So it crashed again last night (at 03:22:35), but it didn't dump a core (despite following @justkwin's instructions at https://developer.apple.com/forums/thread/694233). However, did get another datapoint simply from having run Electron from the commandline, which is that its dying words were:
In other words, v8 heap spiked up to 3.1GB before it died (up from the ~600MB it was when it went to sleep). I guess i try to find out why it didn't dump a core, or failing that try to run it under lldb, or failing that do the SIGSTOP hack to try to catch it in the act. |
So i can get naive core dumps okay following Apple's instructions:
|
So it looks like the reason the core dumps failed for the renderer process is that i failed to set the entitlements correctly:
and "code object is not signed at all" is a cryptic error implying that it couldn't re-sign it for some reason. The binary absolutely is signed though;
Perhaps it needs to be re-signed with a real identity rather than an adhoc identity ("-"). Eitherway, after running
So, either i fight against the signing system or i disable SIP... |
Having poked further at
...and signing it with a real identity (just because i have one easily to hand, thanks to XCode, and because I don't want to get bitten by weird & wonderful adhoc codesigning quirks):
...looks much more plausible. So, setting it running again with |
(of course, having done this, it didn't crash overnight; v8 heap is 700MB in the morning) |
good news - it crashed overnight (2022-05-13 02:48:28.7125 +0100). bad news: it still didn't dump core. Perhaps the rlimits got reset on the child process, or perhaps there's something daft like |
Testing with a new crash test jig (infinite loop), signed with the same entitlements as used on the renderer process, it dumps core as expected when killed with signal 5:
So i can only assume Electron is setting its own rlimits to stop itself from dumping core somehow. Next stop: disable SIP; attach lldb. |
Right, off we go:
etc. However, Meanwhile, I just realised that I could have tested and debugged getting a coredump out of the renderer process just by hitting it with |
So i tried with llnode (
...which corresponds to: ...in the Chrome debugger. Without symbols, EDIT: they do eventually complete after a few minutes:
...suggests that llnode is just failing outright. On the plus side, |
Just in case I can't attach the v8 debugger successfully (because it's busy tightlooping and OOMing to death), i had a quick look around the stacks of the backtrace above, when it's breakpointed normally. I suspect there's a better way to do this with lldb, but for a poor man's frame inspector: for (0..84) {
say "frame select $_";
for (0..8) {
say 'x "*(uint64_t *)($fp + ' . ($_*8) . ')"';
}
} ...tries to deference the first 8 addresses on each frame in the stack and hexdump the first 32 bytes, thus sniffing for strings (C-style or UCS2). You run the generated commands in lldb with
for the initial frame, complete with
...and so you can see that this looks likely to be an attempt to call |
(of course, it didn't crash overnight this time) |
Other postmortem v8 debuggers are https://github.com/tjfontaine/lldb-v8 (python 2, doesn't work after porting it to python 3, thanks to Playing around with a corefile (via |
So, i caught in the act this morning, running under lldb (not using the debug electron i built yesterday). Turns out my poor man's trick of "look for string references on the stack" might be good enough for at least pointing at the problematic part of the world. Irritatingly, neither So, the backtrace of the exploding thread is:
...and so I hit it with my poor-man's stack inspector, to get the first 32 bytes of each of the first 16 addresses on the stack for each stack frame: for (0..81) {
say "frame select $_";
for (0..16) {
say 'x "*(uint64_t *)($fp + ' . ($_*8) . ')"';
}
} I'm not going to post the whole dump as it's massive and probably contains quite a lot of personal data and random collateral heap contents, but it's enough to get an idea of what's going on. The bottommost frame with relevant human-readable stringy data is frame 58:
...oh hello, that looks like like a sync token...
Ah ha! That's both a a sync token, and also the "Persisting sync data" log line from indexeddb-local-backend. Moving on up...
...right that looks like it could be doing an IndexedDB 'put' operation inside a general v8/node event loop...
Oh, that's really interesting - that looks like the parameters to the
Okay, so that looks like it could be
...and then it actually comes out trying to set a property on an object:
...and finally it OOMs while trying to do a minor GC whilst allocating heap in order to set the property:
So, this looks anecdotally an awful lot like it's trying to clone the sync accmuulator response while trying to store it in indexeddb. There's nothing very special about any of the room/user/event IDs flying around the stack; it looks to literally have been caught in the act walking over a sync response while iterating over the room read receipts in the ephemeral data for the mautrix whatsapp room, and finally exploded trying to allocate heap for one of the components of the read receipts. To dig a bit more, i re-ran the stack sniffer but grabbing more data (512 bytes), looking only at offset $fp+72, given empirically that's where most of the strings seem to be (especially for the v8 interpreted frames). I was particularly interested in grabbing the sync token that it was trying to persist in for (0..81) {
say "frame select $_";
say 'x -c 512 "*(uint64_t *)($fp + 72)"';
} This produced much juicier stuff like:
...and so we can see that it died while trying to persist sync token: Now, looking at the server logs for my account:
...are the last incremental syncs before it exploded, and it's somewhat interesting that the final sync request was from this access token. Meanwhile the console logging in the app in this timeframe was (with some irrelevant reqs removed):
...which matches the idea that it exploded while trying to persist the sync data. So, i think next steps are:
The sync accumulator approach is horrid anyway and about to be killed with sliding sync, so this problem might go away - but there is a chance that persistence exploding is just a side-effect of a memory leak somewhere else (given persistsync is always going to spike RAM by a few hundred MB), so it's worth investigating this a little more to get to the bottom of it. |
Hmm. indexeddb.ts has:
...which is all very well, but i don't think that it guards against multi-writes - it looks like it just stops writes happening more frequently than every 5 minutes. But if the write from 5 minutes ago is still running... |
Another datapoint: i set Nightly running again, breakpointed in persistSyncData, and |
An argument against overlapping persists is the fact that the memory seems to explode very rapidly (e.g. before the inspector can notice, assuming it's updating every second, as one would expect). |
So, my debug electron build is borked somehow - the Renderer immediately crashes with:
So i've set a debug element-web running in normal electron, with the following js-sdk patch to try to track what's happening when persisting sync: diff --git a/src/store/indexeddb-local-backend.ts b/src/store/indexeddb-local-backend.ts
index a7a7e2574..22f079e91 100644
--- a/src/store/indexeddb-local-backend.ts
+++ b/src/store/indexeddb-local-backend.ts
@@ -418,7 +418,7 @@ export class LocalIndexedDBStoreBackend implements IIndexedDBBackend {
nextBatch: string,
roomsData: ISyncResponse["rooms"],
): Promise<void> {
- logger.log("Persisting sync data up to", nextBatch);
+ logger.log("Persisting sync data (" + JSON.stringify(roomsData).length + " bytes) up to", nextBatch);
return utils.promiseTry<void>(() => {
const txn = this.db.transaction(["sync"], "readwrite");
const store = txn.objectStore("sync");
@@ -427,7 +427,9 @@ export class LocalIndexedDBStoreBackend implements IIndexedDBBackend {
nextBatch,
roomsData,
}); // put == UPSERT
- return txnAsPromise(txn).then();
+ return txnAsPromise(txn).then(()=>{
+ logger.log("Persisted sync data up to", nextBatch);
+ });
});
} So far it seems to be running nicely, with:
etc. Let's see what it logs when it explodes tonight... |
(It didn't crash overnight. Worth noting that the sync accumulator size has shrunk to 118103281 from 143988497 thanks to it being a fresh login - which might be enough of a difference to avoid problems. Also, the overnight syncs are all very boring:
No sign of any so far taking more than 4 secs. |
are you sure this is the crashing thread? this thread is waiting on a pthread lock, judging by
I'd expect a crashed thread to be inside some function, not waiting for a lock. In your other crash dumps, the crashing threads are inside
These json format dumps are really annoying. It contains
also
|
It appears v8 shares GC work between threads, so if the OOM is occurring during GC, the crashing thread might not be the thread making the large number of allocations that lead to memory exhaustion. Did I misunderstand what you meant by "exploding thread" and you actually meant "the thread with the absurdly deep number of frames" rather than the crashing thread? I guess that would indicate excessive stack usage but not necessarily which one is allocating too much heap memory, although if its infinitely recursing, maybe fair enough |
Also, looking again at thread 16 - V8 is a JIT compiler, not an interpreter - so I think the unannotated frames 54 - 61 might correspond to the actual JS frames, having been compiled to machine code, and the later frames correspond to execution of v8 syscalls from that JS. Trying to understand exactly what's on the stack for the JS frames might be impossible - although maybe there's some consistently placed debugging info like the name of the JS function at a consistent offset? For calls like |
matrix-org/matrix-js-sdk#2569 may help solve this; now it's merged let's see... |
nope, still happening. |
how about using interval to check whether the content of the window has become "blank" ? |
this has gone back to happening every night for the last week or so.
The reason the window goes blank is because the browser process that powers electron runs out of memory and crashes, so I don't think that an interval check is going to help us. We need to find the v8/native memory leak and fix it, alas (which in our case might be a matter of finally switching to Sliding Sync, to avoid having so much data flying around) |
fwiw, last week I was seeing crashes ~daily, and this week it's every 5-6 hours regardless of usage. All cases look like OOMs (white screen when I get back to it). My machine doesn't sleep, but the process is still crashed by the time I get to it in the morning, then crashes twice more through the day. |
Just chiming in. Had this issue with the most recent Nightly build. It was only running for 30 minutes prior to crash. This is Nightly desktop edition, running on Arch Linux. This is definitely not a new issue by any means. I was having this issue 6 months ago with the Flatpak version of Element as well. |
Both the Arch package and Flatpak are community maintained and have additional crashes specific to them |
Which ones? I'm running Nightly pulled directly from https://packages.element.io/debian/pool/main/e/element-nightly/element-nightly_2023032401_amd64.deb |
@NeonMinnen if you're running a deb on Arch then that's not exactly supported. The dependencies won't match exactly etc. |
I updated element-hq/element-web#25120 (though 'closed') with more information. I can also add that if your kernel is configured for OOM-KILLER (I.e. there is a /proc//oom-adjust and /proc//oom-score then the reason you get a white page is the kernel has killed it off. |
I did another deep dive on this today. My theory is that OOMs happen most often when laptops are asleep (but not plugged in), meaning that v8 runs very infrequently and something stacks up and runs out of RAM (perhaps serialising the sync accumulator for postmessage, or handing it to indexeddb, or handing it to sqlite). Logging heap usage in the console with the ugly hack of: while (true) {
await new Promise(r=>setTimeout(r, 1000));
s = performance.memory.usedJSHeapSize;
console.warn(s);
if (s > 1.5 * 1024 * 1024 * 1024)
debugger ;
} produces a trace like this when it goes bang: The spike at 14:26 catches a persist in the act, so if the same thing happened at 16:00 but starting from 1.3GB (modulo quantising) then it might be close enough to the 1.5GB limit where it explodes. So, in terms of solutions: I ended up doing option 3 at element-hq/element-web#25264. The workaround is that if your Element Desktop starts OOMing lots; log out and log in again (being sure to save your encryption keys). Meanwhile option 4 is the preferred solution. |
facebook/react#18116 looks like the upstream issue about React holding onto detached DOM if key changes. Upgrading to React 18 may alleviate this as it allows for more aggressive GC. Found an event emitter handler leak in ThreadView |
fwiw, I found this was happening on a timescale of minutes (making OOM fairly unlikely). Letting Element run normally, then having the machine go to sleep/turn off on its own, then a couple minutes later unsleeping the laptop only to be met with a blank screen in Element. It was not 100% reproducible, but for the 2 hours I had to force kill and restart Element every 10 minutes it was a tad annoying :p |
crashes are now happening every 4 hours for me, ftr. Curiously, it's only happening while I'm working - rarely on weekends/evenings/overnight. Since nearly all of our internal rooms are encrypted, I'm curious if it's related to that at all? I unfortunately can't get a memory profile in any reasonable sense, as memory profiling takes enough memory to crash the process :| |
This is getting worse: crashes are now happening every 2-3 hours when I'm trying to send messages in encrypted rooms. If I don't use the client for anything, it's fine. |
@ara4n / @turt2live : is this still a thing? or can it be closed? |
I'm finding OOMs from clicking around rooms every few hours, but my client which stayed active for 7 days didn't crash at all. I think this can be considered resolved for now |
Steps to reproduce
The app quite cleanly via File: Quit however.
Will send logs at next launch.
Outcome
What did you expect?
No white
What happened instead?
All the white
Operating system
macOS 12.3 on M1
Application version
Nightly
How did you install the app?
nightly
Homeserver
matrix.org
Will you send logs?
Yes
The text was updated successfully, but these errors were encountered: