Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on Node 10.x doesn't occur in 6.x #27107

Closed
jvictor85 opened this issue Apr 5, 2019 · 41 comments
Closed

Segfault on Node 10.x doesn't occur in 6.x #27107

jvictor85 opened this issue Apr 5, 2019 · 41 comments
Labels
v8 engine Issues and PRs related to the V8 dependency.

Comments

@jvictor85
Copy link

  • Version: v10.15.2 (also happens with other v10.x versions)
  • Platform: Ubuntu 18.04.1 LTS (GNU/Linux 4.15.0 GCP x86_64)

Since last year I've been having issues switching to 10.x from 6.x because of rare crashes that give no error. I have several production servers with hundreds of users and it's been too difficult to reproduce with a local server so far because it seems to only happens under load and it's fairly rare.

There's a large amount of code that I can't really reduce for testing without affecting my users. Since I haven't had the time or ability to debug this properly I was hoping someone could point me in the right direction, or that this would be helpful to the developers if this is a bug in 10.x or v8.

There are no errors to report, but here are some of the recent segfaults in various syslogs that only happen when a server is on node 10:

Feb 12 21:56:29 EU1 kernel: [22262355.042532] node[20512]: segfault at 3d77ac406468 ip 0000000000efade2 sp 00007f263e259c10 error 4 in node[400000+1e90000]

Mar 13 19:39:09 SEA1 kernel: [2908566.872204] node[4145]: segfault at 23e4891a2fb0 ip 0000000000efa272 sp 00007fbbf4dd0c10 error 4 in node[400000+1e8f000]

Mar 16 08:59:55 SEA1 kernel: [3129393.630360] node[14805]: segfault at 1ac3c9db79c8 ip 0000000000efa272 sp 00007f10629f0c10 error 4 in node[400000+1e8f000]

Mar 16 20:25:29 USW1 kernel: [3173535.851715] node[31823]: segfault at 13aa4aac9a78 ip 0000000000efa272 sp 00007fdb85380c10 error 4 in node[400000+1e8f000]

Mar 17 00:26:56 USE2 kernel: [25489067.874929] node[17011]: segfault at 93788ef0108 ip 0000000000efade2 sp 00007fc14bffec10 error 4 in node[400000+1e90000]

Mar 19 22:10:11 USW1 kernel: [3438995.257871] node[11791]: segfault at 7b0370d05c8 ip 0000000000efa272 sp 00007f88d3403c10 error 4 in node[400000+1e8f000]

Mar 21 11:46:28 USW1 kernel: [3574361.032453] node[18756]: segfault at 10f0170f9b8 ip 0000000000efa272 sp 00007fdb9e281c10 error 4 in node (deleted)[400000+1e8f000]

Mar 27 18:55:30 USE2 kernel: [26419545.476970] node[21011]: segfault at 706d2e8f0b8 ip 0000000000efade2 sp 00007f2b59fadc10 error 4 in node[400000+1e90000]

Apr 1 20:39:24 SEA1 kernel: [319450.383166] node[8710]: segfault at 16f9cfd832b0 ip 0000000000efa272 sp 00007f7850aa2c10 error 4 in node[400000+1e8f000]

Apr 1 20:23:53 USE1 kernel: [26742046.491931] node[4466]: segfault at 3253e97d4310 ip 0000000000efa272 sp 00007f1b2fffec10 error 4 in node[400000+1e8f000]

Apr 2 14:58:52 EU1 kernel: [26470558.192840] node[27273]: segfault at 1bbcf96ba358 ip 0000000000efade2 sp 00007f1e95a03c10 error 4 in node[400000+1e90000]

The module versions have been different throughout the last year and on different servers but here is what is currently being used on USW1 as an example:

  • mongodb@3.1.0-beta4
  • ws@5.2.0

I haven't done much testing in node 8.x due to some major garbage collection delay which is another issue altogether. Similarly I haven't had success on version 11.x. I was able to reproduce the crash on my local windows machine once upon a time after days of experimentation on node 10, but not reliably, and I can't seem to get it to occur anymore. It's possible some newer changes have reduced the frequency of this crash because I remember it happening more often in 2018.

I've been mostly using the N version switcher to test different versions. Let me know if there is any other info I can provide or some way to look up these segfault addresses to narrow things down. Thanks!

@sam-github
Copy link
Contributor

Have you collected core files? They can be used to get more information about the crash.

Can you add https://github.com/nodejs/node-report to your app? It will give you loggable text output on a crash, that will help a lot. Its builtin to node now, but for 10.x you'll have to add it to your env, its not hard: https://github.com/nodejs/node-report#usage

@addaleax
Copy link
Member

addaleax commented Apr 5, 2019

Also, do you use native addons, and if so, which? (find node_modules -name '*.node' might help). As far as I can tell, the instruction pointers at the time of the crash refer to two places in the garbage collection code.

To me, that makes it seem like an issue where a core dump would be more important than node-report output, although even the former might be hard to work with because it’s likely to be some sort of memory corruption issue where it’s hard to figure out the cause from the crash site.

@jvictor85
Copy link
Author

Thanks for the help! For now I tried enabling core dumps on one of the 10.x servers with ulimit.. assuming I did it right it could take days or weeks to see the crash though.

I forgot to mention that I use PM2 to restart the server process whenever it goes down, hopefully that doesn't matter.

As for native addons I'm not sure.. I ran that find command and it just shows:
node_modules/v8-profiler/build/profiler/v5.7.0/node-v48-linux-x64/profiler.node

That module isn't loaded in by default though so I'm assuming it isn't related to the crash. I have to manually load it during run-time, and have only used it in some rare situations where I needed to profile for CPU bottlenecks. The basic modules I utilize other than ws and mongodb are http, https, os, fs.

I'm not familiar with memory corruption issues, but let me know if there's anything else I can do that might help. I wondered if it might be something in the garbage collector.. I noticed that the GC in 10.x will have a rare delay spike of 2000-3000ms out of nowhere. On 6.x the sweep usually only takes 300ms or so, but they are more often.

The current flags used (on all node versions):
--max-old-space-size=2400 --nouse-idle-notification --expose-gc

@jvictor85
Copy link
Author

jvictor85 commented Apr 7, 2019

I got one pretty quick! A core file from this crash:

Apr 6 16:35:05 USE1 kernel: [27160290.897495] node[5357]: segfault at 1f6b91e12128 ip 0000000000efa272 sp 00007fea19e45c10 error 4 in node[400000+1e8f000]

It's about 1gb in size (90mb compressed). I'd love for someone to take a look at it though I'm not sure the best way to send it, and I'd like to avoid posting it publicly if possible. I added an email to my profile if someone can get in touch :)

@bnoordhuis
Copy link
Member

@jvictor85 Can you get the stack trace? gdb /path/to/node /path/to/core, then bt.

@jvictor85
Copy link
Author

Thanks! I tried that and it said it loaded symbols from node though it just shows these addresses:

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x00001b50e646b619 in ?? ()
[Current thread is 1 (LWP 32374)]
(gdb) bt
#0  0x00001b50e646b619 in ?? ()
#1  0x00000005000001ba in ?? ()
#2  0x0000000000000000 in ?? ()

@jvictor85
Copy link
Author

Apparently I had some core files from before and didn't even realize it. Here is the stack trace from the Feb 12th segfault above, it's one of 2 different types of crashes I get it seems - this one at addr 0000000000efade2 and it has some better symbols in it:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f263e25a700 (LWP 20512))]
(gdb) bt
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4  0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5  0x00007f263ee546db in start_thread (arg=0x7f263e25a700) at pthread_create.c:463
#6  0x00007f263eb7d88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This EU server is running v10.15.0.

@bnoordhuis
Copy link
Member

Okay, so it's a crash during GC. That probably means memory corruption but unfortunately the real bug is usually not anywhere near the site of the crash, neither in time or space.

What does info maps print?

@jvictor85
Copy link
Author

Sorry not familiar with gdb but when I run that it says Undefined info command: "maps".

If it's memory corruption during GC I'm guessing this isn't anything I can fix in my own code then.. If it helps I remember trying to isolate this last year and that removing the check if(typeof x === "object") in a distance calculation function reduced the frequency of crashes by a lot. It made no sense to me but it went from crashing every 10-60 minutes to how it is now, about once a week and only under load. I'm not sure I can even reproduce that frequent crash anymore though, it was so unpredictable that simply adding a line to output something to the console would prevent the crash from happening in some cases.

@bnoordhuis
Copy link
Member

Sorry, I was thinking of info proc mappings but that doesn't work with core files, only with live images. It'd be useful to check if the backtrace is always the same, i.e., if the crash site is consistent.

Aside: --nouse-idle-notification is a bit of lore that survived from Ye Olden Days but it's a no-op nowadays, you don't need it.

@jvictor85
Copy link
Author

jvictor85 commented Apr 12, 2019

Thanks I'll phase out that flag going forward. I tracked down some more core files and collected the backtraces below, along with the node version and date. The one for the SEA server I wasn't sure on the exact version used, so I included backtraces for both possibilities.

SA1 (Node v10.2.1) May 27 2018

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x0000275f5cc74c60 in ?? ()
[Current thread is 1 (Thread 0x7f1966f83740 (LWP 2979))]
(gdb) bt
#0  0x0000275f5cc74c60 in ?? ()
#1  0x00007fff661431a0 in ?? ()
#2  0x0000275f5c9b0d79 in ?? ()
etc...

USW1 (Node v10.4.0) June 9 2018

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x000017b9ddac0db9 in ?? ()
[Current thread is 1 (Thread 0x7ff9daa14740 (LWP 21738))]
(gdb) bt
#0  0x000017b9ddac0db9 in ?? ()
#1  0x00001bbbc8e822e1 in ?? ()
#2  0x00001bbbc8e822e1 in ?? ()
...

USE2 (Node v10.15.0) March 16 2019

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7fc14bfff700 (LWP 17011))]
(gdb) bt
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4  0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5  0x00007fc1519e86db in start_thread (arg=0x7fc14bfff700) at pthread_create.c:463
#6  0x00007fc15171188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

USW2 (Node v10.13.0) November 18 2018

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x000021a9ad996aa4 in ?? ()
[Current thread is 1 (Thread 0x7f83a4de3740 (LWP 2306))]
(gdb) bt
#0  0x000021a9ad996aa4 in ?? ()
#1  0x00000000032b6ce0 in ?? ()
#2  0x0000000000000001 in ?? ()
#3  0x0000000000000000 in ?? ()

SEA1 (Node v10.15.?) February 17 2019

Node v10.15.0:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f13fa7fc700 (LWP 10353))]
(gdb) bt
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4  0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5  0x00007f14010796db in start_thread (arg=0x7f13fa7fc700) at pthread_create.c:463
#6  0x00007f1400da288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Node v10.15.1:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efade2 in void v8::internal::BodyDescriptorApply<v8::internal::CallIterateBody, void, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*>(v8::internal::InstanceType, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*) [clone .constprop.755] ()
[Current thread is 1 (LWP 10353)]
(gdb) bt
#0  0x0000000000efade2 in void v8::internal::BodyDescriptorApply<v8::internal::CallIterateBody, void, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*>(v8::internal::InstanceType, v8::internal::Map*, v8::internal::HeapObject*, int, v8::internal::PointersUpdatingVisitor*) [clone .constprop.755] ()
#1  0x0000000000ee266b in v8::internal::MarkCompactCollector::~MarkCompactCollector() ()
#2  0x0000000000edfd0f in v8::internal::ItemParallelJob::Run(std::shared_ptr<v8::internal::Counters>) ()
#3  0x0000000000bc7166 in v8::internal::Cancelable::~Cancelable() ()
#4  0x00000000009625b9 in node::os::Initialize(v8::Local<v8::Object>, v8::Local<v8::Value>, v8::Local<v8::Context>) ()
#5  0x00007f14010796db in ?? ()
#6  0x0000000000000000 in ?? ()

UST1 (Node v10.15.0) February 28 2019

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7ff5a77fe700 (LWP 19333))]
(gdb) bt
#0  0x0000000000efade2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee266b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edfd0f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc7166 in v8::internal::CancelableTask::Run() ()
#4  0x00000000009625b9 in node::BackgroundRunner(void*) ()
#5  0x00007ff5ad9de6db in start_thread (arg=0x7ff5a77fe700) at pthread_create.c:463
#6  0x00007ff5ad70788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Hopefully someone can make sense of these or find a pattern that leads to something. We would love to finally be able to switch to node 10 for good, performance is critical for us. Let me know if there's anything else I can do. Thanks!

@jvictor85
Copy link
Author

Also just got a new one today:

USE1 (Node v10.15.2) April 12 2019

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7f4bd59e4700 (LWP 2001))]
(gdb) bt
#0  0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#4  0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd59e4700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Many of them seem to point to this same function, though on this server and a couple others it's at the efa272 address rather than efade2.

@gireeshpunathil
Copy link
Member

the one thing what is worth mentioning is the presence of TRACE_EVENT0 in this routine that internally includes accessing the trace agent; and in the past we have seen exit race between the main thread and helpers wherein the trace agent was cleaned up while other threads has still pending work to do with.

To confirm, could you please dump the stack of every threads?

(gdb) info thr

then switch to each thread by thr 1 , thr 2 etc. and then issuing bt

@gireeshpunathil
Copy link
Member

if the main thread is seen in the exit route with some advancement into cleanup actions, then we are looking at what I suspect.

@jvictor85
Copy link
Author

Sure thing, here are all 11 threads for the latest core dump from above:

(gdb) info thr
  Id   Target Id         Frame
* 1    Thread 0x7f4bd59e4700 (LWP 2001) 0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
  2    Thread 0x7f4bd69e6700 (LWP 1999) 0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00,
    maxevents=maxevents@entry=1024, timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
  3    Thread 0x7f4bd7f68700 (LWP 2004) 0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
    futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  4    Thread 0x7f4bbffff700 (LWP 2005) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  5    Thread 0x7f4bbf7fe700 (LWP 2006) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x24a7fcc <cond+44>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  6    Thread 0x7f4bbeffd700 (LWP 2007) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7    Thread 0x7f4bbe7fc700 (LWP 2008) 0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0,
    futex_word=0x24a7fc8 <cond+40>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  8    Thread 0x7f4bd61e5700 (LWP 2000) __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  9    Thread 0x7f4bd7f5b740 (LWP 1998) 0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
  10   Thread 0x7f4bd51e3700 (LWP 2002) 0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647,
    futex_word=0x3b13d94) at ../sysdeps/unix/sysv/linux/futex-internal.h:231
  11   Thread 0x7f4bd49e2700 (LWP 2003) __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
[Switching to thread 1 (Thread 0x7f4bd59e4700 (LWP 2001))]
#0  0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
(gdb) bt
#0  0x0000000000efa272 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#4  0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd59e4700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 2 (Thread 0x7f4bd69e6700 (LWP 1999))]
#0  0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00, maxevents=maxevents@entry=1024,
    timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
42      ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory.
(gdb) bt
#0  0x00007f4bd6b089ee in __GI_epoll_pwait (epfd=14, events=events@entry=0x7f4bd69e2e00, maxevents=maxevents@entry=1024,
    timeout=timeout@entry=-1, set=set@entry=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
#1  0x0000000000a56b6f in uv__io_poll (loop=loop@entry=0x3b143e8, timeout=-1) at ../deps/uv/src/unix/linux-core.c:275
#2  0x0000000000a4691b in uv_run (loop=0x3b143e8, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:370
#3  0x0000000000966508 in node::BackgroundTaskRunner::DelayedTaskScheduler::Start()::{lambda(void*)#1}::_FUN(void*) ()
#4  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd69e6700) at pthread_create.c:463
#5  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 3 (Thread 0x7f4bd7f68700 (LWP 2004))]
#0  0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
    futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205     ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007f4bd6de86d6 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0,
    futex_word=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>, abstime=0x0)
    at sem_waitcommon.c:111
#2  0x00007f4bd6de87c8 in __new_sem_wait_slow (
    sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007f4bd6de8839 in __new_sem_wait (sem=sem@entry=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>)
    at sem_wait.c:42
#4  0x0000000000a53dc2 in uv__sem_wait (sem=<optimized out>) at ../deps/uv/src/unix/thread.c:579
#5  uv_sem_wait (sem=0x24a71c0 <node::inspector::(anonymous namespace)::start_io_thread_semaphore>) at ../deps/uv/src/unix/thread.c:635
#6  0x00000000009c420a in node::inspector::(anonymous namespace)::StartIoThreadMain(void*) ()
#7  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd7f68700) at pthread_create.c:463
#8  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 4 (Thread 0x7f4bbffff700 (LWP 2005))]
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3  0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
    at ../deps/uv/src/unix/thread.c:735
#4  0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbffff700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 5 (Thread 0x7f4bbf7fe700 (LWP 2006))]
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fcc <cond+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fcc <cond+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3  0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
    at ../deps/uv/src/unix/thread.c:735
#4  0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbf7fe700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 6 (Thread 0x7f4bbeffd700 (LWP 2007))]
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3  0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
    at ../deps/uv/src/unix/thread.c:735
#4  0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbeffd700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 7 (Thread 0x7f4bbe7fc700 (LWP 2008))]
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0  0x00007f4bd6de59f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x24a7fc8 <cond+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x24a7f60 <mutex>, cond=0x24a7fa0 <cond>) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>) at pthread_cond_wait.c:655
#3  0x0000000000a53fe9 in uv_cond_wait (cond=cond@entry=0x24a7fa0 <cond>, mutex=mutex@entry=0x24a7f60 <mutex>)
    at ../deps/uv/src/unix/thread.c:735
#4  0x0000000000a4171e in worker (arg=0x0) at ../deps/uv/src/threadpool.c:76
#5  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bbe7fc700) at pthread_create.c:463
#6  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 8 (Thread 0x7f4bd61e5700 (LWP 2000))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f4bd6de2023 in __GI___pthread_mutex_lock (mutex=0x3b13d58) at ../nptl/pthread_mutex_lock.c:78
#2  0x0000000000a539c9 in uv_mutex_lock (mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:285
#3  0x0000000000962bc1 in node::BackgroundRunner(void*) ()
#4  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd61e5700) at pthread_create.c:463
#5  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 9 (Thread 0x7f4bd7f5b740 (LWP 1998))]
#0  0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
(gdb) bt
#0  0x0000000000efa07a in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000edfcd7 in v8::internal::ItemParallelJob::Run(std::shared_ptr<v8::internal::Counters>) ()
#4  0x0000000000eebe68 in v8::internal::MarkCompactCollector::UpdatePointersAfterEvacuation() ()
#5  0x0000000000ef68d4 in v8::internal::MarkCompactCollector::Evacuate() ()
#6  0x0000000000ef7342 in v8::internal::MarkCompactCollector::CollectGarbage() ()
#7  0x0000000000ecf191 in v8::internal::Heap::MarkCompact() ()
#8  0x0000000000ecf881 in v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) ()
#9  0x0000000000ed0484 in v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) ()
#10 0x0000000000ed18bf in v8::internal::Heap::HandleGCRequest() ()
#11 0x0000000000e70c3a in v8::internal::StackGuard::HandleInterrupts() ()
#12 0x000000000113bde5 in v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*) ()
#13 0x000025b8b33dbe1d in ?? ()
#14 0x0000214054f03dc1 in ?? ()
#15 0x000025b8b33dbd81 in ?? ()
#16 0x00007ffdbf6ff590 in ?? ()
#17 0x0000000000000006 in ?? ()
#18 0x00007ffdbf6ff6a8 in ?? ()
#19 0x000025b8b66b6767 in ?? ()
#20 0x4032000000000000 in ?? ()
#21 0x402a000000000000 in ?? ()
#22 0x0000000000000000 in ?? ()
[Switching to thread 10 (Thread 0x7f4bd51e3700 (LWP 2002))]
#0  0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647, futex_word=0x3b13d94)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:231
231     ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007f4bd6de5b0f in futex_wake (private=<optimized out>, processes_to_wake=2147483647, futex_word=0x3b13d94)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:231
#1  __condvar_dec_grefs (private=<optimized out>, g=1, cond=0x3b13d80) at pthread_cond_wait.c:160
#2  __pthread_cond_wait_common (abstime=0x0, mutex=0x3b13d58, cond=0x3b13d80) at pthread_cond_wait.c:559
#3  __pthread_cond_wait (cond=0x3b13d80, mutex=0x3b13d58) at pthread_cond_wait.c:655
#4  0x0000000000a53fe9 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:735
#5  0x0000000000962b64 in node::BackgroundRunner(void*) ()
#6  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd51e3700) at pthread_create.c:463
#7  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[Switching to thread 11 (Thread 0x7f4bd49e2700 (LWP 2003))]
#0  __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
3101    malloc.c: No such file or directory.
(gdb) bt
#0  __GI___libc_free (mem=0x7f4bc81275c0) at malloc.c:3101
#1  0x0000000000f15ab8 in void v8::internal::MemoryChunk::ReleaseSlotSet<(v8::internal::RememberedSetType)1>() ()
#2  0x0000000000efa3dc in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#3  0x0000000000ee1afb in v8::internal::PointersUpdatingTask::RunInParallel() ()
#4  0x0000000000edf19f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#5  0x0000000000bc65f6 in v8::internal::CancelableTask::Run() ()
#6  0x0000000000962bb9 in node::BackgroundRunner(void*) ()
#7  0x00007f4bd6ddf6db in start_thread (arg=0x7f4bd49e2700) at pthread_create.c:463
#8  0x00007f4bd6b0888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Pretty neat. I wonder if there is anything I can change about my code or environment that would encourage this crash, or avoid it. Or maybe this is just something in v8, like the exit race thing you mentioned? I appreciate the help!

@gireeshpunathil
Copy link
Member

thanks for the quick revert! we don't see the main thread in the list! Except thread 9 whose bottom stack is truncated (due to probably JS code) everything else are helper threads.

in summary, this is not conclusive of what I am suspecting.

@gireeshpunathil
Copy link
Member

looking for ways to get hold of the core dump. Can you send it to my email directly? or upload it in a private gist? (not sure whether the mail box / gitst allow 90mb or not)

@jvictor85
Copy link
Author

Ah too bad it's not what you suspected.. but yes I'm happy to email it. I'll just upload it to my site and send the link shortly.

@gireeshpunathil
Copy link
Member

@jvictor85 - while gdb interpretation of some instructions / rt library missing etc. are posing me challenges in interpreting the failing context, I see this sequence in my local build:

(gdb) x/30i 0xf4c900
   0xf4c900 <v8::internal::RememberedSetUpdatingItem::Process+1456>:	add    BYTE PTR [rdi],cl
   0xf4c902 <v8::internal::RememberedSetUpdatingItem::Process+1458>:	test   DWORD PTR [rcx-0x2],ecx
   0xf4c905 <v8::internal::RememberedSetUpdatingItem::Process+1461>:	(bad)  
   0xf4c906 <v8::internal::RememberedSetUpdatingItem::Process+1462>:	jmp    <internal disassembler error>
   0xf4c908 <v8::internal::RememberedSetUpdatingItem::Process+1464>:	cdq    
   0xf4c909 <v8::internal::RememberedSetUpdatingItem::Process+1465>:	(bad)  
   0xf4c90a <v8::internal::RememberedSetUpdatingItem::Process+1466>:	(bad)  
   0xf4c90b <v8::internal::RememberedSetUpdatingItem::Process+1467>:	jmp    <internal disassembler error>
   0xf4c90d <v8::internal::RememberedSetUpdatingItem::Process+1469>:	(bad)  
   0xf4c90e <v8::internal::RememberedSetUpdatingItem::Process+1470>:	fldcw  WORD PTR [rsi]
   0xf4c910 <v8::internal::RememberedSetUpdatingItem::Process+1472>:	add    BYTE PTR [rax-0x75],cl
   0xf4c913 <v8::internal::RememberedSetUpdatingItem::Process+1475>:	adc    BYTE PTR [rcx-0x44],al
   0xf4c916 <v8::internal::RememberedSetUpdatingItem::Process+1478>:	pop    rsp
   0xf4c917 <v8::internal::RememberedSetUpdatingItem::Process+1479>:	in     al,dx
   0xf4c918 <v8::internal::RememberedSetUpdatingItem::Process+1480>:	push   rcx
   0xf4c919 <v8::internal::RememberedSetUpdatingItem::Process+1481>:	add    cl,BYTE PTR [rax-0x75]
   0xf4c91c <v8::internal::RememberedSetUpdatingItem::Process+1484>:	push   rdx
   0xf4c91d <v8::internal::RememberedSetUpdatingItem::Process+1485>:	adc    BYTE PTR [rax-0x7f],cl
   0xf4c920 <v8::internal::RememberedSetUpdatingItem::Process+1488>:	cli    
   0xf4c921 <v8::internal::RememberedSetUpdatingItem::Process+1489>:	push   rax
   0xf4c922 <v8::internal::RememberedSetUpdatingItem::Process+1490>:	or     BYTE PTR [rsi-0x41f28c00],ch
   0xf4c928 <v8::internal::RememberedSetUpdatingItem::Process+1496>:	nop
   0xf4c929 <v8::internal::RememberedSetUpdatingItem::Process+1497>:	pushf  
   0xf4c92a <v8::internal::RememberedSetUpdatingItem::Process+1498>:	mov    ch,0x1
   0xf4c92c <v8::internal::RememberedSetUpdatingItem::Process+1500>:	mov    rdi,rax
   0xf4c92f <v8::internal::RememberedSetUpdatingItem::Process+1503>:	call   rdx
   0xf4c931 <v8::internal::RememberedSetUpdatingItem::Process+1505>:	mov    r12,rax

the call to rdx corresponds to a call to v8::platform::tracing::TracingController::GetCategoryGroupEnabled with tracing controller as the receiving obejct (obtained through v8::internal::tracing::TraceEventHelper::GetTracingController()) . this is a very good match with what I described earlier (regarding exit race)

that has a fix in v10.15.3

So while I debug further, can you please try with v10.15.3 and see if these crashes still occur?

@jvictor85
Copy link
Author

Alright that sounds promising! I'll be pleasantly surprised if it's such an easy fix. Thanks so much for spending the time - I'll start transitioning all the servers over today and let you know if I experience any further crashes within the next week or two.

@jvictor85
Copy link
Author

Got another crash, for v10.15.3

Apr 17 13:01:52 SEA1 kernel: [1674284.139116] node[20416]: segfault at 1959ec36ea00 ip 0000000000efe4f2 sp 00007f4373ffec10 error 4 in node[400000+1e96000]

Unfortunately it didn't write the core file.. I thought I set ulimit to do that but maybe it was a temporary setting. I'll try to get that sorted out and deliver the next one as soon as possible.

@bnoordhuis
Copy link
Member

Another avenue you could try is a debug build. Debug builds have a lot more checks so there's a pretty good chance they'll catch the bug closer to the source. Caveat emptor though: they're also a fair bit slower than release builds.

To compile a debug build, run ./configure --debug && make -j8; out/Debug/node is the binary.

@jvictor85
Copy link
Author

Thanks I'll keep the debug build in mind, though I don't believe we can afford the performance hit.

A couple crashes later and I've finally grabbed a core file again. Here is the log and the trace. I'll send the file shortly. This is from today on 10.15.3:

Apr 21 00:19:55 SEA1 kernel: [1974141.941690] node[3940]: segfault at 109dfd061b98 ip 0000000000efe4f2 sp 00007fc7ebd6ec10 error 4 in node[400000+1e96000]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000efe4f2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
[Current thread is 1 (Thread 0x7fc7ebd6f700 (LWP 3940))]
(gdb) bt
#0  0x0000000000efe4f2 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process() ()
#1  0x0000000000ee5d7b in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000ee341f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bca866 in v8::internal::CancelableTask::Run() ()
#4  0x0000000000963d89 in node::BackgroundRunner(void*) ()
#5  0x00007fc7ec9696db in start_thread (arg=0x7fc7ebd6f700) at pthread_create.c:463
#6  0x00007fc7ec69288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@gireeshpunathil
Copy link
Member

surprising to note that the crash is in the same method; it makes sense to believe that upgrade to v10.15.3 did not make any difference.

@gireeshpunathil
Copy link
Member

@jvictor85 sent in another core file with v10.15.3, but the failure sequence is absolutely same -
0xefe4f2 <_ZN2v88internal25RememberedSetUpdatingItemINS0_26MajorNonAtomicMarkingStateEE7ProcessEv+1506>: mov rax,QWORD PTR [rax-0x1]

so looks like the best way is to try with a debug build.

@jvictor85 :

though I don't believe we can afford the performance hit.

this is only to perform problem isolation and determination, definitely not recommending for production.

@jvictor85
Copy link
Author

Thanks for your help!

definitely not recommending for production.

I can try to get a debug build going on a test server, it's just that in my experience this crash tends to show up under load with many real clients connecting. It has happened during testing, but it can take several months to see it occur once, while on production it happens once or twice a week (across many servers).

Haven't had much luck simulating load to reproduce it more often. Though it does seem to occur more on particular servers in certain conditions, so I'll try grabbing database backups from before a crash and see if I can get something to happen. I also haven't tested v12 yet so I'll tinker with that to see if there is a difference.

I remember seeing some issues similar to mine (like #24955) but it seems to be rare.. not sure if that helps narrow things down.

@jvictor85
Copy link
Author

Just to follow up - I've had a couple servers on v12.1.0 for over a week, and recently started moving a few others over as well and so far they haven't experienced this crash at all.

The servers on v10.x still continue to crash. So I'm wondering if something in the v8 code in version 10 (or it could be anything between versions 6 and 12) is causing this.

I'll post an update if the crash ever happens on 12, and will be keeping an eye on updates to 10. I would prefer to use that when possible since I've noticed a bit better GC performance with 10 compared to 12 so far.

@gireeshpunathil
Copy link
Member

@jvictor85 - thanks for the update. Given that we don't have any further clues and debug build dump is unavailable, your action plan looks good to me.

@jvictor85
Copy link
Author

Tested with 10.16.0 and it's still crashing in the same manner. In version 12 it hasn't crashed in a month so it seems to be fixed there. Unfortunately performance is noticeably worse in 12 (same issue with 6) so I try to stick with 10 when possible so the users don't complain aside from the crashes.

May 30 17:48:24 SEA1 kernel: [480538.493040] node[889]: segfault at 4687fcf0af0 ip 0000000000f2aa52 sp 00007fbbc6202c10 error 4 in node[400000+2040000]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000f2aa52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Pro            cess() ()
[Current thread is 1 (Thread 0x7fbbc6203700 (LWP 889))]
(gdb) bt
#0  0x0000000000f2aa52 in v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Pro            cess() ()
#1  0x0000000000f122db in v8::internal::PointersUpdatingTask::RunInParallel() ()
#2  0x0000000000f0f97f in v8::internal::ItemParallelJob::Task::RunInternal() ()
#3  0x0000000000bf5d36 in v8::internal::CancelableTask::Run() ()
#4  0x000000000097fea9 in node::BackgroundRunner(void*) ()
#5  0x00007fbbc7dff6db in start_thread (arg=0x7fbbc6203700) at pthread_create.c:463
#6  0x00007fbbc7b2888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I was hoping this was a known issue in v8 that was fixed and could be backported to 10 but maybe it was resolved unintentionally through some of the newer changes.

If I can find the time I'll see if I can reproduce this on a test server again, but I'm not too hopeful that I can simulate the right conditions. Maybe someone familiar with the v8 version changes can identify where the fix may have happened?

@jvictor85
Copy link
Author

Alright maybe I found something - I accidentally stumbled on the right conditions for a crash and was able to reproduce it consistently. I reduced the code down to something that can be worked with. I'm not sure if this is the exact crash as on my production servers because the syslog reports it differently, but it behaves in a similar way: quits without an error and only occurs on node 10 and 11. I tested this with node 8 and 12 and don't see the crash. Here's the code:

console.log("Initializing on Node "+process.version);
var seed = 2, clients = [];

var distance = function(x,y,sx,sy) {
   let dx,dy;
   if(typeof x === "object") {
		dx = x.x - y.x;
		dy = x.y - y.y;
   }
   else {
		dx = sx - x;
		dy = sy - y;
   }
   return Math.sqrt(dx*dx + dy*dy);
}

var rand = function(from,to) {
	let x = Math.sin(seed++) * 10000;
	return Math.floor((x - Math.floor(x)) * (to - from + 1)) + from;
}

var make_object = function() {
	var obj = { x:rand(0,100), y:rand(0,100) };
	obj.update = function() {
		for(let i in clients)
			distance(this,clients[i]);
		distance(this.x,this.y,this.dx,this.dy);
	}
	return obj;
}

for(var i=0;i<50;i++)
	clients.push(make_object());

setInterval(function() {
	console.log('Up:'+process.uptime()+' Mem:'+process.memoryUsage().heapUsed);
	for(var i=0;i<clients.length;i++)
		clients[i].update();
},100);

It looks a little odd because it's a remnant of several thousand lines of code but I left in the lines that cause a quick crash. If you comment out either of the distance calls the program will run fine, even though they don't really do anything.

Tested in windows and linux. I'm hoping this is at least related to the production crash I'm experiencing, but let me know if I'm missing something obvious here..

@gireeshpunathil
Copy link
Member

ping @jvictor85 - is this still an outstanding issue for you?

@jvictor85
Copy link
Author

Yes unfortunately this remaining crash is still very much an issue for me and my users. The performance in version 12 is still too slow that I end up bouncing between versions 10 and 12 depending on whether I want to go for performance or stability.

Many of my servers that have lower user populations almost never experience the crash.. maybe every few weeks/months so I keep them on version 10. More specifically v10.16.x still has the crash, though I haven't tested with 10.17 I expect it is probably the same. I have 2 higher population servers in particular that experience the crash every couple days, so I try to keep them on v12.

Sorry I wasn't able to dedicate the time required to bisecting on a production server. I understand this is a rare crash that not many people experience so currently I'm waiting on new versions of 12 (or 13) to solve some of the performance regressions so I can fully switch to that. I'll post when that time comes, and let me know if there is any other information I can provide. It's up to you if you want to close this for now or keep it open. Thanks for all the help up to now!

@gireeshpunathil
Copy link
Member

is this still happening? should this remain open?

@jvictor85
Copy link
Author

My performance issues were improved enough by version 12.16.2 (and all versions of 14) that I moved every server over to those. Due to that I never got the chance to test on later versions of 10, but the crash was still happening when I switched over. I'm not sure if anyone else is experiencing this crash or can reproduce it, but I shouldn't be seeing it anymore so I'll close this for now. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

No branches or pull requests

8 participants