Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

module: warn of potential for deadlock with hooks worker #51035

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

JakobJingleheimer
Copy link
Member

This issue was identified in #50948. Until we can provide a mitigation or proper solution, we should at least warn users of the danger.

@JakobJingleheimer JakobJingleheimer added doc Issues and PRs related to the documentations. module Issues and PRs related to the module subsystem. esm Issues and PRs related to the ECMAScript Modules implementation. fast-track PRs that do not need to wait for 48 hours to land. labels Dec 3, 2023
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/loaders

Copy link
Contributor

github-actions bot commented Dec 3, 2023

Fast-track has been requested by @JakobJingleheimer. Please 👍 to approve.

@JakobJingleheimer JakobJingleheimer marked this pull request as ready for review December 3, 2023 20:34
@GeoffreyBooth
Copy link
Member

I feel like we should have a minimal reproduction before we document this.

It also doesn't need fast track, docs don't get updated until the next release so it doesn't matter much how quickly this lands.

@JakobJingleheimer
Copy link
Member Author

JakobJingleheimer commented Dec 3, 2023

We have a minimal repro.

RE fast-track: I guess; but the sooner it lands, the sooner it stops taking up my capacity.

doc/api/module.md Outdated Show resolved Hide resolved
doc/api/module.md Outdated Show resolved Hide resolved
doc/api/module.md Outdated Show resolved Hide resolved
> example, you have 2 modules, A and B. "A" is registered first and sets up a
> message channel,which it uses in its `resolve` hook. "B" uses `register` to
> register its own loader. Resolving "B"'s own loader will go through "A"'s
> `resolve`, which will try to communicate with the Module Worker. The Module
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> `resolve`, which will try to communicate with the Module Worker. The Module
> `resolve`, which will try to communicate with the thread that the hooks are running on. The Module

No one knows what Module Worker means. I'm not even sure what it means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, who is "no one"?

> message channel,which it uses in its `resolve` hook. "B" uses `register` to
> register its own loader. Resolving "B"'s own loader will go through "A"'s
> `resolve`, which will try to communicate with the Module Worker. The Module
> Worker is currently busy trying to register "B"'s loader, thus resulting in a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Worker that the user created, or our hooks thread?

Copy link
Member Author

@JakobJingleheimer JakobJingleheimer Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our hooks thread. I believe in node's source code, it is named module worker. Could be wrong—I haven't touched it in a while.

@JakobJingleheimer JakobJingleheimer changed the title module: warn of potential for deadlock with module worker module: warn of potential for deadlock with hooks worker Dec 3, 2023
@aduh95 aduh95 removed the fast-track PRs that do not need to wait for 48 hours to land. label Dec 3, 2023
Copy link
Contributor

@aduh95 aduh95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do have a minimal repro (do we?), let's add it to test/known_issues

Comment on lines +114 to +123
> **Warning** When setting up a `MessageChannel` to communicate with hooks,
> beware that this can lead to a deadlock. For example, you have 2 modules,
> A and B. "A" is registered first and sets up a message channel, which it uses
> in its `resolve` hook. After "A" is registered, "B" is registered. Resolving
> "B"'s specifier will go through "A"'s `resolve` hook, which will try to
> communicate with a locked thread that is busy trying to register "B"'s hooks.
> Since registering "B" depends on resolving "B"'s specifier, and resolving
> "B"'s specifier is blocked by "A"'s communication request that is itself
> blocked by the pending registration that started the chain, the application
> becomes deadlocked.
Copy link
Contributor

@aduh95 aduh95 Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need a TLDR, in particular it needs an introduction sentence explaining what not to do – because let's be honest, I don't think anyone will be interested in the particular details unless they are running into that specific issue.
Maybe we can also tune down the details a lot, a vague explanation might be preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: if there are multiple loaders and at least one uses MessageChannel, you will probably footgun

Copy link
Contributor

@aduh95 aduh95 Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep it as that?

Suggested change
> **Warning** When setting up a `MessageChannel` to communicate with hooks,
> beware that this can lead to a deadlock. For example, you have 2 modules,
> A and B. "A" is registered first and sets up a message channel, which it uses
> in its `resolve` hook. After "A" is registered, "B" is registered. Resolving
> "B"'s specifier will go through "A"'s `resolve` hook, which will try to
> communicate with a locked thread that is busy trying to register "B"'s hooks.
> Since registering "B" depends on resolving "B"'s specifier, and resolving
> "B"'s specifier is blocked by "A"'s communication request that is itself
> blocked by the pending registration that started the chain, the application
> becomes deadlocked.
> **Warning** If a `resolve` or `load` is left pending on a response from a
> `MessageChannel`, that will cause a deadlock when the main thread is
> "asleep" waiting for a response from the loader thread. To avoid that,
> always set a timeout when dealing with cross thread communication
> inside those hooks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for a response from the hooks rather than loader thread. In this case just "hooks" is better than "hooks thread" because there could beer multiple hooks threads (for now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid that, always set a timeout when dealing with cross thread communication inside those hooks.

This is never going to be a suitable solution, it would be better not to suggest it.

Either the timeout will be too short, and will cancel requests that would not be deadlocked, or it's too long, and will incur a startup penalty equal to the length of the timeout when they are deadlocked. There's no reliable way to tune it to only cancel when it would be deadlocked, without significant perf penalties.

Furthermore, just canceling the request may actually not be what you want. Consider a transpiler loader that converts TypeScript into JavaScript, but has to talk to a service on the main thread to know how to do that correctly. If any Module.register() is called after this loader is registered, and the second loader is written in TypeScript, it's going to deadlock until the timer expires, then... what? Throw an error? Serve TypeScript to v8 uncompiled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I’d throw an error. After say one second of idling, it’s time to give up. It’s not a perf penalty, it’s a DX thing, querying the main thread is kind of like making a network call, you should always treat the case when you never get a response to not keep your users waiting indefinitely.

@JakobJingleheimer
Copy link
Member Author

If we do have a minimal repro (do we?), let's add it to test/known_issues

Did not know this existed 😅 i'll add it tomorrow.

@GeoffreyBooth
Copy link
Member

Does the issue require multiple user threads? Like can it happen regardless of whether the user has run new Worker?

@isaacs
Copy link
Contributor

isaacs commented Dec 4, 2023

@GeoffreyBooth No new Worker involved. Just loaders being registered with Module.register(), and sending messages back to the main thread to decide how to return from the load/resolve hooks.

https://github.com/isaacs/node-21-import-deadlock

@JakobJingleheimer JakobJingleheimer marked this pull request as draft December 4, 2023 09:01
@JakobJingleheimer JakobJingleheimer marked this pull request as draft December 4, 2023 09:01
@JakobJingleheimer

This comment was marked as outdated.

@JakobJingleheimer
Copy link
Member Author

JakobJingleheimer commented Dec 4, 2023

I added a test-case (which isn't quite complete for the issue reported: it still needs the piece on main to respond to the hooks-worker's request). But the incomplete test reveals that we actually have another problem first: a hook returning a never-settling promise causes a deadlock. @aduh95 I thought we specifically handled that in our original off-thread implementation? 🤔

@aduh95
Copy link
Contributor

aduh95 commented Dec 4, 2023

I thought we specifically handled that in our original off-thread implementation? 🤔

Yep, we even have tests for that:

describe('should handle never-settling hooks in ESM files', { concurrency: true }, () => {

I put this back to draft to avoid triggering test runs in CI (which will hang)

FYI a test that hasn't completed will timeout after 2 minutes:

node/tools/test.py

Lines 1364 to 1365 in 1b74aa3

result.add_option("-t", "--timeout", help="Timeout in seconds",
default=120, type="int")

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented Dec 4, 2023

isaacs/node-21-import-deadlock

Could you reduce this into a minimal reproduction? Like something that could become a test in the Node codebase (assuming there’s a fix for it). Or maybe make a branch from the Node repo and use the existing fixtures to create a test that shows the issue.

let stderr = '';
let stdout = '';
// ! Do NOT use spawnSync here: it will deadlock.
const child = spawn(execPath, [
Copy link
Contributor

@aduh95 aduh95 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to write the test as you'd like Node.js to work, i.e. if we ever fix the bug, we should just have to git mv test/known_issues/test-hooks-deadlock.js test/es-module

Suggested change
const child = spawn(execPath, [
const result = await spawnPromisified(execPath, [

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I then verify it is indeed "broken"? spawnPromisified will cause the process to hang.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call that broken, isn’t that good enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean it would break CI. The test as currently written demonstrates that the targeted behaviour is broken. Do these tests work differently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, known_issues are for tests that are not passing (but we'd like them to). E.g. out/Release/node test/known_issues/test-vm-ownkeys.js exits with non-zero code, and tools/test.py test/known_issues/test-vm-ownkeys.js shows "All tests passed".

@JakobJingleheimer
Copy link
Member Author

I thought we specifically handled that in our original off-thread implementation? 🤔

Yep, we even have tests for that:

How the heck is the hung promise in the test I've added here deadlocking then 😵

@GeoffreyBooth
Copy link
Member

How the heck is the hung promise in the test I’ve added here deadlocking then 😵

Because of chaining?

Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
@isaacs
Copy link
Contributor

isaacs commented Dec 9, 2023

Could you reduce this into a minimal reproduction? Like something that could become a test in the Node codebase (assuming there’s a fix for it). Or maybe make a branch from the Node repo and use the existing fixtures to create a test that shows the issue.

This is about as minimal as I could figure out how to make it. What would you suggest can be removed to make it simpler and still trigger the issue? It seems to require, at minimum:

  • two loaders
  • the first of which is using a MessageChannel and only returns from its async resolve hook after getting a response from the main thread
  • registered serially using Module.register() (either using the serialized --import args behavior that landed recently, or by explicitly loading both import scripts one after the other.)

@GeoffreyBooth
Copy link
Member

What would you suggest can be removed to make it simpler and still trigger the issue?

I asked that before the PR with the failing test was created (or before I noticed it). I assume the failing test is about as minimal as we can get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Issues and PRs related to the documentations. esm Issues and PRs related to the ECMAScript Modules implementation. module Issues and PRs related to the module subsystem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants