-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
module: warn of potential for deadlock with hooks worker #51035
base: main
Are you sure you want to change the base?
module: warn of potential for deadlock with hooks worker #51035
Conversation
Review requested:
|
Fast-track has been requested by @JakobJingleheimer. Please 👍 to approve. |
I feel like we should have a minimal reproduction before we document this. It also doesn't need fast track, docs don't get updated until the next release so it doesn't matter much how quickly this lands. |
We have a minimal repro. RE fast-track: I guess; but the sooner it lands, the sooner it stops taking up my capacity. |
doc/api/module.md
Outdated
> example, you have 2 modules, A and B. "A" is registered first and sets up a | ||
> message channel,which it uses in its `resolve` hook. "B" uses `register` to | ||
> register its own loader. Resolving "B"'s own loader will go through "A"'s | ||
> `resolve`, which will try to communicate with the Module Worker. The Module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> `resolve`, which will try to communicate with the Module Worker. The Module | |
> `resolve`, which will try to communicate with the thread that the hooks are running on. The Module |
No one knows what Module Worker means. I'm not even sure what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, who is "no one"?
doc/api/module.md
Outdated
> message channel,which it uses in its `resolve` hook. "B" uses `register` to | ||
> register its own loader. Resolving "B"'s own loader will go through "A"'s | ||
> `resolve`, which will try to communicate with the Module Worker. The Module | ||
> Worker is currently busy trying to register "B"'s loader, thus resulting in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Worker that the user created, or our hooks thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our hooks thread. I believe in node's source code, it is named module worker. Could be wrong—I haven't touched it in a while.
984d6ce
to
d9f61cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do have a minimal repro (do we?), let's add it to test/known_issues
> **Warning** When setting up a `MessageChannel` to communicate with hooks, | ||
> beware that this can lead to a deadlock. For example, you have 2 modules, | ||
> A and B. "A" is registered first and sets up a message channel, which it uses | ||
> in its `resolve` hook. After "A" is registered, "B" is registered. Resolving | ||
> "B"'s specifier will go through "A"'s `resolve` hook, which will try to | ||
> communicate with a locked thread that is busy trying to register "B"'s hooks. | ||
> Since registering "B" depends on resolving "B"'s specifier, and resolving | ||
> "B"'s specifier is blocked by "A"'s communication request that is itself | ||
> blocked by the pending registration that started the chain, the application | ||
> becomes deadlocked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this need a TLDR, in particular it needs an introduction sentence explaining what not to do – because let's be honest, I don't think anyone will be interested in the particular details unless they are running into that specific issue.
Maybe we can also tune down the details a lot, a vague explanation might be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: if there are multiple loaders and at least one uses MessageChannel, you will probably footgun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep it as that?
> **Warning** When setting up a `MessageChannel` to communicate with hooks, | |
> beware that this can lead to a deadlock. For example, you have 2 modules, | |
> A and B. "A" is registered first and sets up a message channel, which it uses | |
> in its `resolve` hook. After "A" is registered, "B" is registered. Resolving | |
> "B"'s specifier will go through "A"'s `resolve` hook, which will try to | |
> communicate with a locked thread that is busy trying to register "B"'s hooks. | |
> Since registering "B" depends on resolving "B"'s specifier, and resolving | |
> "B"'s specifier is blocked by "A"'s communication request that is itself | |
> blocked by the pending registration that started the chain, the application | |
> becomes deadlocked. | |
> **Warning** If a `resolve` or `load` is left pending on a response from a | |
> `MessageChannel`, that will cause a deadlock when the main thread is | |
> "asleep" waiting for a response from the loader thread. To avoid that, | |
> always set a timeout when dealing with cross thread communication | |
> inside those hooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for a response from the hooks
rather than loader thread. In this case just "hooks" is better than "hooks thread" because there could beer multiple hooks threads (for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid that, always set a timeout when dealing with cross thread communication inside those hooks.
This is never going to be a suitable solution, it would be better not to suggest it.
Either the timeout will be too short, and will cancel requests that would not be deadlocked, or it's too long, and will incur a startup penalty equal to the length of the timeout when they are deadlocked. There's no reliable way to tune it to only cancel when it would be deadlocked, without significant perf penalties.
Furthermore, just canceling the request may actually not be what you want. Consider a transpiler loader that converts TypeScript into JavaScript, but has to talk to a service on the main thread to know how to do that correctly. If any Module.register()
is called after this loader is registered, and the second loader is written in TypeScript, it's going to deadlock until the timer expires, then... what? Throw an error? Serve TypeScript to v8 uncompiled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I’d throw an error. After say one second of idling, it’s time to give up. It’s not a perf penalty, it’s a DX thing, querying the main thread is kind of like making a network call, you should always treat the case when you never get a response to not keep your users waiting indefinitely.
Did not know this existed 😅 i'll add it tomorrow. |
Does the issue require multiple user threads? Like can it happen regardless of whether the user has run |
@GeoffreyBooth No |
This comment was marked as outdated.
This comment was marked as outdated.
74e0826
to
60eb33f
Compare
I added a test-case (which isn't quite complete for the issue reported: it still needs the piece on main to respond to the hooks-worker's request). But the incomplete test reveals that we actually have another problem first: a hook returning a never-settling promise causes a deadlock. @aduh95 I thought we specifically handled that in our original off-thread implementation? 🤔 |
60eb33f
to
7068d97
Compare
Yep, we even have tests for that:
FYI a test that hasn't completed will timeout after 2 minutes: Lines 1364 to 1365 in 1b74aa3
|
Could you reduce this into a minimal reproduction? Like something that could become a test in the Node codebase (assuming there’s a fix for it). Or maybe make a branch from the Node repo and use the existing fixtures to create a test that shows the issue. |
let stderr = ''; | ||
let stdout = ''; | ||
// ! Do NOT use spawnSync here: it will deadlock. | ||
const child = spawn(execPath, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to write the test as you'd like Node.js to work, i.e. if we ever fix the bug, we should just have to git mv test/known_issues/test-hooks-deadlock.js test/es-module
const child = spawn(execPath, [ | |
const result = await spawnPromisified(execPath, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I then verify it is indeed "broken"? spawnPromisified
will cause the process to hang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call that broken, isn’t that good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I mean it would break CI. The test as currently written demonstrates that the targeted behaviour is broken. Do these tests work differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, known_issues
are for tests that are not passing (but we'd like them to). E.g. out/Release/node test/known_issues/test-vm-ownkeys.js
exits with non-zero code, and tools/test.py test/known_issues/test-vm-ownkeys.js
shows "All tests passed".
How the heck is the hung promise in the test I've added here deadlocking then 😵 |
Because of chaining? |
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
This is about as minimal as I could figure out how to make it. What would you suggest can be removed to make it simpler and still trigger the issue? It seems to require, at minimum:
|
I asked that before the PR with the failing test was created (or before I noticed it). I assume the failing test is about as minimal as we can get. |
This issue was identified in #50948. Until we can provide a mitigation or proper solution, we should at least warn users of the danger.