nodejs · trevnorris · Apr 15, 2016 · Apr 11, 2016 · benjamingr · Apr 12, 2016
diff --git a/doc/topics/domain-postmortem.md b/doc/topics/domain-postmortem.md
@@ -0,0 +1,299 @@
+# Domain Module Postmortem
+
+## Usability Issues
+
+### Implicit Behavior
+
+It's possible for a developer to create a new domain and then simply run
+`domain.enter()`. Which then acts as a catch-all for any exception in the
+future that couldn't be observed by the thrower. Allowing a module author to
+intercept the exceptions of unrelated code in a different module. Preventing
+the originator of the code from knowing about its own exceptions.
+
+Here's an example of how one indirectly linked modules can affect another:
+
+```js
+// module a.js
+const b = require('./b');
+const c = require('./c');
+
+
+// module b.js
+const d = require('domain').create();
+d.on('error', () => { /* silence everything */ });
+d.enter();
+
+
+// module c.js
+const dep = require('some-dep');
+dep.method();  // Uh-oh! This method doesn't actually exist.
+```
+
+Since module `b` enters the domain but never exits any uncaught exception will
+be swallowed. Leaving module `c` in the dark as to why it didn't run the entire
+script. Leaving a potentially partially populated `module.exports`. Doing this
+is not the same as listening for `'uncaughtException'`. As the latter is
+explicitly meant to globally catch errors. The other issue is that domains are
+processed prior to any `'uncaughtException'` handlers, and prevent them from
+running.
+
+Another issue is that domains route errors automatically if no `'error'`
+handler was set on the event emitter. There is no opt-in mechanism for this,
+and automatically propagates across the entire asynchronous chain. This may
+seem useful at first, but once asynchronous calls are two or more modules deep
+and one of them doesn't include an error handler the creator of the domain will
+suddenly be catching unexpected exceptions, and the thrower's exception will go
+unnoticed by the author.
+
+The following is a simple example of how a missing `'error'` handler allows
+the active domain to hijack the error:
+
+```js
+const domain = require('domain');
+const net = require('net');
+const d = domain.create();
+d.on('error', (err) => console.error(err.message));
+
+d.run(() => net.createServer((c) => {
+  c.end();
+  c.write('bye');
+}).listen(8000));
+```
+
+Even manually removing the connection via `d.remove(c)` does not prevent the
+connection's error from being automatically intercepted.
+
+Failures that plagues both error routing and exception handling are the
+inconsistencies in how errors are bubbled. The following is an example of how
+nested domains will and won't bubble the exception based on when they happen:
+
+```js
+const domain = require('domain');
+const net = require('net');
+const d = domain.create();
+d.on('error', () => console.error('d intercepted an error'));
+
+d.run(() => {
+  const server = net.createServer((c) => {
+    const e = domain.create();  // No 'error' handler being set.
+    e.run(() => {
+      // This will not be caught by d's error handler.
+      setImmediate(() => {
+        throw new Error('thrown from setImmediate');
+      });
+      // Though this one will bubble to d's error handler.
+      throw new Error('immediately thrown');
+    });
+  }).listen(8080);
+});
+```
+
+It may be expected that nested domains always remain nested, and will always
+propagate the exception up the domain stack. Or that exceptions will never
+automatically bubble. Unfortunately both these situations occur, leading to
+potentially confusing behavior that may even be prone to difficult to debug
+timing conflicts.
+
+
+### API Gaps
+
+While APIs based on using `EventEmitter` can use `bind()` and errback style
+callbacks can use `intercept()`, alternative APIs that implicitly bind to the
+active domain must be executed inside of `run()`. Meaning if module authors
+wanted to support domains using a mechanism alternative to those mentioned they
+must manually implement domain support themselves. Instead of being able to
+leverage the implicit mechanisms already in place.
+
+
+### Error Propagation
+
+Propagating errors across nested domains is not straight forward, if even
+possible. Existing documentation shows a simple example of how to `close()` an
+`http` server if there is an error in the request handler. What it does not
+explain is how to close the server if the request handler creates another
+domain instance for another async request. Using the following as a simple
+example of the failing of error propagation:
+
+```js
+const d1 = domain.create();
+d1.foo = true;  // custom member to make more visible in console
+d1.on('error', (er) => { /* handle error */ });
+
+d1.run(() => setTimeout(() => {
+  const d2 = domain.create();
+  d2.bar = 43;
+  d2.on('error', (er) => console.error(er.message, domain._stack));
+  d2.run(() => {
+    setTimeout(() => {
+      setTimeout(() => {
+        throw new Error('outer');
+      });
+      throw new Error('inner')
+    });
+  });
+}));
+```
+
+Even in the case that the domain instances are being used for local storage so
+access to resources are made available there is still no way to allow the error
+to continue propagating from `d2` back to `d1`. Quick inspection may tell us
+that simply throwing from `d2`'s domain `'error'` handler would allow `d1` to
+then catch the exception and execute its own error handler. Though that is not
+the case. Upon inspection of `domain._stack` you'll see that the stack only
+contains `d2`.
+
+This may be considered a failing of the API, but even if it did operate in this
+way there is still the issue of transmitting the fact that a branch in the
+asynchronous execution has failed, and that all further operations in that
+branch must cease. In the example of the http request handler, if we fire off
+several asynchronous requests and each one then `write()`'s data back to the
+client many more errors will arise from attempting to `write()` to a closed
+handle. More on this in _Resource Cleanup on Exception_.
+
+
+### Resource Cleanup on Exception
+
+The script [`domain-resource-cleanup.js`](domain-resource-cleanup.js)
+contains a more complex example of properly cleaning up in a small resource
+dependency tree in the case that an exception occurs in a given connection or
+any of its dependencies. Breaking down the script into its basic operations:
+
+- When a new connection happens, concurrently:
+  - Open a file on the file system
+  - Open Pipe to unique socket
+- Read a chunk of the file asynchronously
+- Write chunk to both the TCP connection and any listening sockets
+- If any of these resources error, notify all other attached resources that
+  they need to clean up and shutdown
+
+As we can see from this example a lot more must be done to properly clean up
+resources when something fails than what can be done strictly through the
+domain API. All that domains offer is an exception aggregation mechanism. Even
+the potentially useful ability to propagate data with the domain is easily
+countered, in this example, by passing the needed resources as a function
+argument.
+
+One problem domains perpetuated was the supposed simplicity of being able to
+continue execution, contrary to what the documentation stated, of the
+application despite an unexpected exception. This example demonstrates the
+fallacy behind that idea.
+
+Attempting proper resource cleanup on unexpected exception becomes more complex
+as the application itself grows in complexity. This example only has 3 basic
+resources in play, and all of them with a clear dependency path. If an
+application uses something like shared resources or resource reuse the ability
+to cleanup, and properly test that cleanup has been done, grows greatly.
+
+In the end, in terms of handling errors, domains aren't much more than a
+glorified `'uncaughtException'` handler. Except with more implicit and
+unobservable behavior by third-parties.
+
+
+### Resource Propagation
+
+Another use case for domains was to use it to propagate data along asynchronous
+data paths. One problematic point is the ambiguity of when to expect the
+correct domain when there are multiple in the stack (which must be assumed if
+the async stack works with other modules). Also the conflict between being
+able to depend on a domain for error handling while also having it available to
+retrieve the necessary data.
+
+The following is a involved example demonstrating the failing using domains to
+propagate data along asynchronous stacks:
+
+```js
+const domain = require('domain');
+const net = require('net');
+
+const server = net.createServer((c) => {
+  // Use a domain to propagate data across events within the
+  // connection so that we don't have to pass arguments
+  // everywhere.
+  const d = domain.create();
+  d.data = { connection: c };
+  d.add(c);
+  // Mock class that does some useless async data transformation
+  // for demonstration purposes.
+  const ds = new DataStream(dataTransformed);
+  c.on('data', (chunk) => ds.data(chunk));
+}).listen(8080, () => console.log(`listening on 8080`));
+
+function dataTransformed(chunk) {
+  // FAIL! Because the DataStream instance also created a
+  // domain we have now lost the active domain we had
+  // hoped to use.
+  domain.active.data.connection.write(chunk);
+}
+
+function DataStream(cb) {
+  this.cb = cb;
+  // DataStream wants to use domains for data propagation too!
+  // Unfortunately this will conflict with any domain that
+  // already exists.
+  this.domain = domain.create();
+  this.domain.data = { inst: this };
+}
+
+DataStream.prototype.data = function data(chunk) {
+  // This code is self contained, but pretend it's a complex
+  // operation that crosses at least one other module. So
+  // passing along "this", etc., is not easy.
+  this.domain.run(function() {
+    // Simulate an async operation that does the data transform.
+    setImmediate(() => {
+      for (var i = 0; i < chunk.length; i++)
+        chunk[i] = ((chunk[i] + Math.random() * 100) % 96) + 33;
+      // Grab the instance from the active domain and use that
+      // to call the user's callback.
+      const self = domain.active.data.inst;
+      self.cb.call(self, chunk);
+    });
+  });
+};
+```
+
+The above shows that it is difficult to have more than one asynchronous API
+attempt to use domains to propagate data. This example could possibly be fixed
+by assigning `parent: domain.active` in the `DataStream` constructor.  Then
+restoring it via `domain.active = domain.active.data.parent` just before the
+user's callback is called. Also the instantiation of `DataStream` in the
+`'connection'` callback must be run inside `d.run()`, instead of simply using
+`d.add(c)`, otherwise there will be no active domain.
+
+In short, for this to have a prayer of a chance usage would need to strictly
+adhere to a set of guidelines that would be difficult to enforce or test.
+
+
+## Performance Issues
+
+A significant deterrent from using domains is the overhead. Using node's
+built-in http benchmark, `http_simple.js`, without domains it can handle over
+22,000 requests/second. Whereas if it's run with `NODE_USE_DOMAINS=1` that
+number drops down to under 17,000 requests/second. In this case there is only
+a single global domain. If we edit the benchmark so the http request callback
+creates a new domain instance performance drops further to 15,000
+requests/second.
+
+While this probably wouldn't affect a server only serving a few hundred or even
+a thousand requests per second, the amount of overhead is directly proportional
+to the number of asynchronous requests made. So if a single connection needs to
+connect to several other services all of those will contribute to the overall
+latency of delivering the final product to the client.
+
+Using `AsyncWrap` and tracking the number of times
+`init`/`pre`/`post`/`destroy` are called in the mentioned benchmark we find
+that the sum of all events called is over 170,000 times per second. This means
+even adding 1 microsecond overhead per call for any type of setup or tear down
+will result in a 17% performance loss. Granted, this is for the optimized
+scenario of the benchmark, but I believe this demonstrates the necessity for a
+mechanism such as domain to be as cheap to run as possible.
+
+
+## Looking Ahead
+
+The domain module has been soft deprecated since Dec 2014, but has not yet been
+removed because node offers no alternative functionality at the moment. As of
+this writing there is ongoing work building out the `AsyncWrap` API and a
+proposal for Zones being prepared for the TC39. At such time there is suitable
+functionality to replace domains it will undergo the full deprecation cycle and
+eventually be removed from core.