bestpractices: add crash pre-requisites

This covers the preparation of crash diagnostics Refs: #254
nodejs · Sep 25, 2019 · 2640f91 · 2640f91
1 parent baa43e9
commit 2640f91
Showing 1 changed file with 130 additions and 0 deletions.
diff --git a/documentation/crash/crash_setup.md b/documentation/crash/crash_setup.md
@@ -0,0 +1,130 @@
+
+## Node.js application crash diagnostics: Best Practices series #1
+
+This is the first of a series of best practices and useful tips if you
+are using Node.js in large scale production systems. 
+
+## Introduction
+
+Typical prodcution systems do not enjoy the benefits of development
+and staging systems in many aspects:
+
+ - they are isolated from public internet
+ - they are not loaded with development and debug tools
+ - they are configured with the most robust and secure
+   configurations possible at the OS level
+ - in certain deployment scenarios (such as Cloud) those
+   operate in a head-less mode [ no ssh ]
+ - in certain deployment scenarios (such as Cloud) those
+   operate in a state-less mode [ no persistent disk]
+
+The net effect of these constraints is that your production systems
+need to be manually `prepared` in advance to enable crash dianostic
+data generation on the first failure itself, without loosing vital data.
+The rest of the document illustrates this preparation steps.
+
+## Available  disk space
+Ensure that there is enough disk space available for the core file
+to be written:
+
+ - Maximum of 4GB for a 32 bit process.
+ - Much larger  for  64 bit process (common case). To know the precise
+   requirement, measure the peak-load memory usage of your application.
+   Add a 10% to that to accommodate core metadata. If you are using
+   common monitoring tools, one of the graph should reveal the peak
+   memory. If not, you can measure it directly in the system.
+
+In Linux variants, you can use `top -p <pid>` to see the instantaneous
+memory usage of the process:
+
+```
+$ top -p 106916
+
+   PID    USER    PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                            
+106916 user       20   0  600404  54500  15572 R 109.7  0.0  81098:54 node    
+```
+
+In Darwin, the flag is `-pid`
+In AIX, the command is `topas`
+In freebsd, the command is `top`. In both AIX and freebsd, there is no
+flag to show per-process details. In Windows, you could use the task
+manager window and view the process attributes visually.
+
+Insufficient file system space will result in truncated core files,
+and can severely hamper the ability to diagnose the problem.
+
+Figure out how much free space is available in the file system:
+`df -k` can be used invariably across UNIX platforms.
+In Windows, Windows explorer when pointed to a disk partition,
+provides a view of the available space in that partition.
+
+## Core file location and name
+
+By default, core file is generated on a crash event, and is
+written to the current working directory - the location from
+where the node process was started, in most of the UNIX variants.
+In Darwin, it appears in /cores location.
+
+By default, core files from node processes on Linux are named as
+`core` or `core.<pid>`, where <pid> is node process id.
+By default, core files from node processes on AIX and Darwin are
+named ‘core’.
+By default, core files from node processes on freebsd are named
+‘%N.core’. where `%N` is the name of the crashed process.
+
+However, Superuser (root) can control and change these defaults.
+
+In Linux, `sysctl kernel.core_pattern` shows corrent core file pattern.
+
+Modify pattern using `sysctl -w kernel.core_pattern=pattern` as root.
+
+In AIX, `lscore` shows the current core file pattern.
+
+Enable full core dump generation using `chdev -l sys0 -a fullcore=true`
+Modify the current pattern using `chcore -p on -n on -l /path/to/coredumps`
+
+In Darwin and freebsd, `sysctl kern.corefile` shows the corrent core file pattern.
+
+Modify the current pattern using `sysctl -w kern.corefile=newpattern` as root.
+
+To obtain full core files, set the following ulimit options, across UNIX variants:
+
+`ulimit -c unlimited` - turn on core file generation capability with unlimited size
+`ulimit -d unlimited` - set the user data limit to unlimited
+`ulimit -f unlimited` - set the file limit to unlimited
+
+The current ulimit settings can be displayed using:
+
+`ulimit -a`
+
+However, these are the `soft` limits and are enforced per user, per
+shell environment. Please note that these values are themselves
+practically constrained by the system-wide `hard` limit set by the
+system administrator. System administrators (with superuser privileges)
+may display, set or change the hard limits by adding the -H flag to
+the standard set of ulimit commands.
+
+## Manual dump generation
+
+Under certain circumstances where you want to collect a core
+manually follow these steps:
+
+In linux, use `gcore [-a] [-o filename] pid` where  `-a`
+specifies to dump everything.
+In AIX, use `gencore [pid] [filename]`
+In freebsd and Darwin, use `gcore [-s] [executable] pid`
+In Windows, you can use `Task manager` window, right click on the
+node process and select `create dump` option.
+
+Special note on Ubuntu systems with `Yama hardened kernel`
+
+Yama security policy inhibits a second process from collecting dump,
+practically rendering `gcore` unusable.
+
+`setcap cap_sys_ptrace=+ep `which gdb``
+
+
+These steps make sure that when your Node.js application crashes in
+production a valid, full core dump is generated at a known location that
+can be loaded into debuggers that understand Node.js internsls, and
+diagnose the issue. Next article in this series will focus on that part.