-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bestpractices: add crash pre-requisites
This covers the preparation of crash diagnostics Refs: #254
- Loading branch information
1 parent
baa43e9
commit 2640f91
Showing
1 changed file
with
130 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
|
||
## Node.js application crash diagnostics: Best Practices series #1 | ||
|
||
This is the first of a series of best practices and useful tips if you | ||
are using Node.js in large scale production systems. | ||
|
||
## Introduction | ||
|
||
Typical prodcution systems do not enjoy the benefits of development | ||
and staging systems in many aspects: | ||
|
||
- they are isolated from public internet | ||
- they are not loaded with development and debug tools | ||
- they are configured with the most robust and secure | ||
configurations possible at the OS level | ||
- in certain deployment scenarios (such as Cloud) those | ||
operate in a head-less mode [ no ssh ] | ||
- in certain deployment scenarios (such as Cloud) those | ||
operate in a state-less mode [ no persistent disk] | ||
|
||
The net effect of these constraints is that your production systems | ||
need to be manually `prepared` in advance to enable crash dianostic | ||
data generation on the first failure itself, without loosing vital data. | ||
The rest of the document illustrates this preparation steps. | ||
|
||
## Available disk space | ||
Ensure that there is enough disk space available for the core file | ||
to be written: | ||
|
||
- Maximum of 4GB for a 32 bit process. | ||
- Much larger for 64 bit process (common case). To know the precise | ||
requirement, measure the peak-load memory usage of your application. | ||
Add a 10% to that to accommodate core metadata. If you are using | ||
common monitoring tools, one of the graph should reveal the peak | ||
memory. If not, you can measure it directly in the system. | ||
|
||
In Linux variants, you can use `top -p <pid>` to see the instantaneous | ||
memory usage of the process: | ||
|
||
``` | ||
$ top -p 106916 | ||
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
106916 user 20 0 600404 54500 15572 R 109.7 0.0 81098:54 node | ||
``` | ||
|
||
In Darwin, the flag is `-pid` | ||
In AIX, the command is `topas` | ||
In freebsd, the command is `top`. In both AIX and freebsd, there is no | ||
flag to show per-process details. In Windows, you could use the task | ||
manager window and view the process attributes visually. | ||
|
||
Insufficient file system space will result in truncated core files, | ||
and can severely hamper the ability to diagnose the problem. | ||
|
||
Figure out how much free space is available in the file system: | ||
`df -k` can be used invariably across UNIX platforms. | ||
In Windows, Windows explorer when pointed to a disk partition, | ||
provides a view of the available space in that partition. | ||
|
||
## Core file location and name | ||
|
||
By default, core file is generated on a crash event, and is | ||
written to the current working directory - the location from | ||
where the node process was started, in most of the UNIX variants. | ||
In Darwin, it appears in /cores location. | ||
|
||
By default, core files from node processes on Linux are named as | ||
`core` or `core.<pid>`, where <pid> is node process id. | ||
By default, core files from node processes on AIX and Darwin are | ||
named ‘core’. | ||
By default, core files from node processes on freebsd are named | ||
‘%N.core’. where `%N` is the name of the crashed process. | ||
|
||
However, Superuser (root) can control and change these defaults. | ||
|
||
In Linux, `sysctl kernel.core_pattern` shows corrent core file pattern. | ||
|
||
Modify pattern using `sysctl -w kernel.core_pattern=pattern` as root. | ||
|
||
In AIX, `lscore` shows the current core file pattern. | ||
|
||
Enable full core dump generation using `chdev -l sys0 -a fullcore=true` | ||
Modify the current pattern using `chcore -p on -n on -l /path/to/coredumps` | ||
|
||
In Darwin and freebsd, `sysctl kern.corefile` shows the corrent core file pattern. | ||
|
||
Modify the current pattern using `sysctl -w kern.corefile=newpattern` as root. | ||
|
||
To obtain full core files, set the following ulimit options, across UNIX variants: | ||
|
||
`ulimit -c unlimited` - turn on core file generation capability with unlimited size | ||
`ulimit -d unlimited` - set the user data limit to unlimited | ||
`ulimit -f unlimited` - set the file limit to unlimited | ||
|
||
The current ulimit settings can be displayed using: | ||
|
||
`ulimit -a` | ||
|
||
However, these are the `soft` limits and are enforced per user, per | ||
shell environment. Please note that these values are themselves | ||
practically constrained by the system-wide `hard` limit set by the | ||
system administrator. System administrators (with superuser privileges) | ||
may display, set or change the hard limits by adding the -H flag to | ||
the standard set of ulimit commands. | ||
|
||
## Manual dump generation | ||
|
||
Under certain circumstances where you want to collect a core | ||
manually follow these steps: | ||
|
||
In linux, use `gcore [-a] [-o filename] pid` where `-a` | ||
specifies to dump everything. | ||
In AIX, use `gencore [pid] [filename]` | ||
In freebsd and Darwin, use `gcore [-s] [executable] pid` | ||
In Windows, you can use `Task manager` window, right click on the | ||
node process and select `create dump` option. | ||
|
||
Special note on Ubuntu systems with `Yama hardened kernel` | ||
|
||
Yama security policy inhibits a second process from collecting dump, | ||
practically rendering `gcore` unusable. | ||
|
||
`setcap cap_sys_ptrace=+ep `which gdb`` | ||
|
||
|
||
These steps make sure that when your Node.js application crashes in | ||
production a valid, full core dump is generated at a known location that | ||
can be loaded into debuggers that understand Node.js internsls, and | ||
diagnose the issue. Next article in this series will focus on that part. |