Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

13.5 broke tap's error handling #31142

Closed
isaacs opened this issue Dec 31, 2019 · 17 comments
Closed

13.5 broke tap's error handling #31142

isaacs opened this issue Dec 31, 2019 · 17 comments
Labels
async_hooks Issues and PRs related to the async hooks subsystem. confirmed-bug Issues with confirmed bugs.

Comments

@isaacs
Copy link
Contributor

isaacs commented Dec 31, 2019

  • Version: 13.5.0
  • Platform: Darwin tau.local 19.0.0 Darwin Kernel Version 19.0.0: Thu Oct 17 16:17:15 PDT 2019; root:xnu-6153.41.3~29/RELEASE_X86_64 x86_64 (but irrelevant to this issue)
  • Subsystem: async hooks

The change in #30965 means that Node now relies on process._fatalException calling emitAfter(asyncId);

However, this cannot be done in userland code, because the emitAfter method is not exposed.

With the deprecation of domains in Node.js version 4, as far as I've been able to tell, patching process._fatalException and tracking the process flow through async hooks is the only way to reliably tie an error back to a chain of continuations. (And it doesn't even work 100% reliably, because Promises break async hooks.)

For node-tap, in order to fail the appropriate test when an error is thrown or promise is rejected (most of the time, at least), I'm using async-hook-domain. I used to use domains, but was scared off by the noisy warnings.

Now, a test like this:

const t = require('tap')
t.test('failure', t => {
  throw new Error('nope')
})

which should output this:

$ node fail.js
TAP version 13
# Subtest: failure
    not ok 1 - nope
      ---
      stack: |
        Test.<anonymous> (fail.js:3:9)
      at:
        line: 3
        column: 9
        file: fail.js
        function: Test.<anonymous>
      tapCaught: testFunctionThrow
      test: failure
      source: |
        t.test('failure', t => {
          throw new Error('nope')
        --------^
        })
      ...

    1..1
    # failed 1 test
not ok 1 - failure # time=18.611ms

instead (as of 503900b) outputs this:

$ node fail.js
TAP version 13
# Subtest: failure
    not ok 1 - nope
      ---
      stack: |
        Test.<anonymous> (fail.js:3:9)
      at:
        line: 3
        column: 9
        file: fail.js
        function: Test.<anonymous>
      tapCaught: testFunctionThrow
      test: failure
      source: |
        t.test('failure', t => {
          throw new Error('nope')
        --------^
        })
      ...

    1..1
    # failed 1 test
not ok 1 - failure # time=18.611ms

Error: async hook stack has become corrupted (actual: 18, expected: 1)
 1: 0x10000b8d9 node::AsyncHooks::pop_async_id(double) [/usr/local/bin/node]
 2: 0x10000189a node::InternalCallbackScope::Close() [/usr/local/bin/node]
 3: 0x10000149a node::InternalCallbackScope::~InternalCallbackScope() [/usr/local/bin/node]
 4: 0x1000b3f45 node::NodeMainInstance::Run() [/usr/local/bin/node]
 5: 0x10005e46f node::Start(int, char**) [/usr/local/bin/node]
 6: 0x7fff657932e5 start [/usr/lib/system/libdyld.dylib]
 7: 0x2

What has to happen to get error handling and continuation tracking shipped in core? The pieces are all there, they're just not exposed in any kind of useful way. I'm happy to help.

Failing that, I'd suggest reverting 503900b.

@Trott
Copy link
Member

Trott commented Dec 31, 2019

@apapirovski

@Trott
Copy link
Member

Trott commented Dec 31, 2019

@nodejs/async_hooks

@devsnek devsnek added async_hooks Issues and PRs related to the async hooks subsystem. confirmed-bug Issues with confirmed bugs. labels Dec 31, 2019
@devsnek
Copy link
Member

devsnek commented Dec 31, 2019

i can repro this

@apapirovski
Copy link
Member

Hrmmm... this is problematic. Patching _fatalException in this way also means any errors in nextTick & such will result in the same async stack corruption.

The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.

@apapirovski
Copy link
Member

apapirovski commented Dec 31, 2019

A couple of options:

  1. Revert, but then the reproduction instead becomes having your test throw in a nextTick. It's still wrong and broken.

  2. Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.

If there are no objections for 2, I'll get something coded up tomorrow.

@Trott
Copy link
Member

Trott commented Dec 31, 2019

FWIW, having tap in CITGM probably would have caught this problem. (npm ci && npm test on a fresh checkout of the node-tap repo passes with 13.4 and fails with 13.5.) @nodejs/citgm

@isaacs
Copy link
Contributor Author

isaacs commented Dec 31, 2019

Hrmmm... this is problematic. Patching _fatalException in this way also means any errors in nextTick & such will result in the same async stack corruption.

This is easy to work around: https://github.com/tapjs/async-hook-domain/blob/master/index.js#L153-L159

And it works today: https://gist.github.com/isaacs/3cc8394c351ead9d208748037b338764

Maybe I'm not sure I understand what you mean?

The problem is that the expectation was ever set that this is a valid behaviour, since it just isn't.

"Valid" is a value judgement, and I will freely admit that async-hook-domain is a dirty ugly gross hack that belongs in core rather than userland. If you want me to feel bad about it, fine, I feel bad. But unfortunately, there is no other way to track errors through continuations without triggering deprecation warnings or catching all errors (even those that should not be caught). The state of error handling hasn't changed since I brought this up 2 years ago.

There is a comment from @joyeecheung about maintaining the patchability of process._fatalException, and noting that investigation should be done to determine whether process._fatalException can be deprecated. To my knowledge, that investigation has not been done, and until there is some other way to track errors through continuations, I think it's clear that no, it can't be deprecated.

Rip out async hooks resolution out of _fatalException and instead have a wrapper that isn't monkey patchable and calls _fatalException. That will preserve the current monkey-patchability and let async hooks behavior resolve as intended without interference. That's ultimately likely the more correct solution.

If you ping me on the PR or post a link here, I'd be happy to review and make sure it un-breaks async-hook-domain and tap.

@Trott what has to happen to get tap in citgm?

Also: can we pull async-hook-domain or something like it into core? Who needs to be in that conversation?

@isaacs
Copy link
Contributor Author

isaacs commented Dec 31, 2019

Just to be super clear: I'd truly love to put require('domains') and process._fatalException and async-hook-domain in the trash bin forever, but not until there's something to replace them with and a reasonable transition plan to get there.

@sam-github
Copy link
Contributor

what has to happen to get tap in citgm?

PRed into https://github.com/nodejs/citgm/blob/master/lib/lookup.json

Perhaps async-hook-domain should be added in as well?

@sam-github
Copy link
Contributor

@nodejs/citgm

@isaacs
Copy link
Contributor Author

isaacs commented Jan 2, 2020

@sam-github I think adding ahd would be more annoying than useful. Its tests are pretty frequently broken by perfectly innocuous wording changes in error messages, but it's usually pretty easy to fix. Testing tap is a better integration canary anyway.

@apapirovski
Copy link
Member

apapirovski commented Jan 8, 2020

Haven't forgotten about this btw. Slightly bigger change but should have it done soon.

This is easy to work around

Sure, but that workaround was required because that also caused a similar async stack corruption.

@isaacs
Copy link
Contributor Author

isaacs commented Feb 11, 2020

@apapirovski Any update on this? It's really making it hard to run my tests on node v13, because thrown errors are not handled properly.

@coreyfarrell
Copy link
Member

libtap is being split from tap. This specific issue is caught by running the libtap tests against node.js 13.5.0+, once tap uses libtap many tests will be removed (moved to libtap). I spoke with @isaacs, we think that it would be useful to also include tap itself as this could catch issues with process spawning and/or stdio handling.

I haven't had a chance to actually look at CITGM, hopefully I'll be able to make time soon.

@addaleax
Copy link
Member

@isaacs How are you doing with the latest versions of Node.js 12 & 13? #31784 basically reverted the PR you mentioned above.

@coreyfarrell
Copy link
Member

@addaleax I just retested libtap and node-tap in latest node.js 12 and 13, neither have this crash. Note I never actually saw the crash in any version of node.js 12 (maybe the bug was in 12.16.0 only and I missed it).

@isaacs
Copy link
Contributor Author

isaacs commented May 7, 2020

It looks like this is fixed in the latest v14 release! Thanks! 🥳

If there's some other reason why this ticket is left open other than just forgetting to close it, feel free to re-open. But I'm satisfied, thank you very much for getting this done before the stable release :)

@isaacs isaacs closed this as completed May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks Issues and PRs related to the async hooks subsystem. confirmed-bug Issues with confirmed bugs.
Projects
None yet
Development

No branches or pull requests

7 participants