Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module customization hooks and worker threads #1566

Open
GeoffreyBooth opened this issue May 30, 2024 · 35 comments
Open

Module customization hooks and worker threads #1566

GeoffreyBooth opened this issue May 30, 2024 · 35 comments

Comments

@GeoffreyBooth
Copy link
Member

GeoffreyBooth commented May 30, 2024

With regard to these issues and PRs discussed in yesterday’s TSC meeting:

It appears that we have a solution in nodejs/node#53200. I spoke to @dygabo today (the author of that PR) and he thinks that he’s fixed the bug and satisfied all the notes that people have left, other than @mcollina’s request (which I’ll get to in a minute).

In particular, what nodejs/node#53200 fixes is to make the prior PR, nodejs/node#52706, work as intended: calling register from new Worker execArgv no longer causes a hang, but registers hooks on the one hooks thread that nodejs/node#52706 created (replacing the prior architecture of separate hooks threads per worker thread). This has been the intended architecture since we first moved the hooks off-thread, and has been the plan since 2022. The issue I opened last year about the bug where there are multiple hooks threads, nodejs/node#50752, has 22 👍 s and this architecture was discussed in loaders meetings with stakeholders so i do think it’s the design favored by the majority of the community, though clearly there are some who prefer the alternative approach and consider moving to a single thread to be a regression.

In particular, @mcollina’s objection to nodejs/node#53200 is that it doesn’t revert the change intended by nodejs/node#52706: he wants multiple hooks threads, though he didn’t object to nodejs/node#52706 or the previous threads where this was discussed.

What’s on main right now is what no one wants; either we should fix the single-hooks-thread implementation or revert it. On the one hand I sympathize with the users like Matteo who were affected by this change, as even the bugfixed version will still cause them to need to update their code to compensate for the new architecture; the Angular team has already done so in angular/angular-cli#27721, which fixes their app even for Node 22.2.0 with the hanging bug. That PR demonstrates a workaround for the new behavior, for people who really want scoped hooks: use child processes instead of worker threads. I assume this approach is probably slower, but it works. And it doesn’t need to be the end state; we could very well ship a new way to provide scoped hooks, and a discussion has begun in nodejs/node#53195 to brainstorm options.

The option I don’t think we’re likely to choose is the prior state of how things were in 22.1.0, where every worker thread got its own hooks thread, regardless if that’s what the user wanted. Many of the options we’re discussing involve scoping the hooks in a more limited way, where they can behave differently per thread but still run on a single hooks thread rather than N hooks threads. So I’m hesitant to revert to the 22.1.0 behavior (which, to be clear, was buggy and unintentional) when I expect it will just change again. My preference would be to land nodejs/node#53200 to fix the current implementation, land at least some documentation explaining the child processes workaround, and continue to evaluate options for the scoped hooks feature request and potentially land something to satisfy those use cases when we have a solution that seems good to everyone. This would minimize churn, as the majority of users who aren’t trying to get scoped hooks won’t be affected by the future additions to support scoped hooks.

Alternatively we could revert back to 22.1.0 and then eventually land the change that moves to a single hooks thread alongside some new API that provides scoped hooks. This would involve more churn for users, and probably push out the overall API becoming stable for quite a while.

cc @nodejs/loaders @nodejs/tsc

@jasnell
Copy link
Member

jasnell commented May 30, 2024

Let's revert the release branch to deal with the reported breakage there and get a new release out, even if that means continuing with the original bug that this change was meant to fix.

In main we have a couple possible paths forward, I think @mcollina has an alternative approach in mind and we're just waiting for the PR to be opened.

@mcollina
Copy link
Member

mcollina commented May 30, 2024

I think when unexpected breakage od this magnitude happens, it's best to revert an reapply later. This situation has brought up a few use cases that were covered by unit tests but not in the radar of the loaders team. It's better to regroup and reassess the situation.

Telling people that were using threads before to use processes instead is 10 years backwards in the history of Node.js. I think we are past that point. If this is the recommendation, we should remove threads. In all intents and purposes I cannot endorse that as a technical priority for the project.

@GeoffreyBooth
Copy link
Member Author

Something that was expressed to me about the idea of hooks threads that correspond to loader pools is that such an architecture is very complicated both to build and to reason about, and would entail significant additional infrastructure: we’d need to provide APIs for spawning a defined number of hooks threads, for determining when to terminate the hooks threads (should they survive the pool they were meant to serve?) and so on. The intended design of a single hooks thread that all process threads share is much simpler.

Consider the meeting where this was decided, back in 2022:

Quick update from our recent team meeting: We invited Gil Tayar (author/maintainer of several pertinent libraries) to discuss spawning a dedicated loaders thread per user-land thread, and he noted that will add enormous complexity to library authors (on top of the extra complexity on node’s side). We decided for an initial implementation, it would be better to use a single loaders thread shared by all user-land threads and add caveats to the relevant sections of the docs. If there is sufficient appetite, we can subsequently add a configuration option (perhaps to the Worker constructor) to spawn dedicated loaders threads.

It’s more likely that a hooks author will want to share hooks state between all threads of the process, than to have separate hooks state per application thread. A transpilation hook will want to transpile a file once and keep the result in memory, not transpile it again and again for each thread. A testing framework might want to create a mock that applies to all threads. Instrumentation libraries would want to have only one copy of their library running, not one copy per thread. I feel pretty sure that a single hooks thread should be our default behavior, and something more sophisticated can come later (or not at all; I’m genuinely concerned about our ability to maintain even the level of complexity that we already have). I think we could add workerData to the context object passed into each hook, so that the hooks can know about the threads that processed modules are coming from and act accordingly (to give a particular mock to a particular thread, say) and that should suffice for most use cases.

Ultimately this is about the conception of how people think about threads. Are they part of an existing app, spreading out CPU-intensive to multiple cores, or are they basically pseudo-child processes where you can launch mini-Nodes in different ways to do various different things? The desire to have different hooks for different threads relates to the latter concept, which is why child processes are a workable alternative. We only have this issue in the first place because of new Worker execArgv, which is a Node invention.

@jasnell
Copy link
Member

jasnell commented May 30, 2024

Let's separate the concerns here.

Issue 1: Reverting the change in the release branch to unbreak the folks this change broke. This should be our immediate first action.

Issue 2: What is the long term correct fix and what steps do we take to get there that (a) accomplishes the goal AND (b) minimizes breaks in the ecosystem. There is a path forward here where we are working with the ecosystem and not in spite of them.

Before we tackle issue 2, let's deal with issue 1 which is the more pressing matter.

@mcollina
Copy link
Member

mcollina commented May 30, 2024

The current list of broken things:

  1. mocking tools (esmock)
  2. frontend frameworks (Angular)
  3. testing tools (ava)

The latter is quite important, because it's built around the concept of using worker threads for isolation. The community is telling us it is a valid use case for threads.

(Note that I'm 100% supportive of sharing one loader thread, I think this implementation landed too early).

I consider not restoring this functionality as soon as possible is disrespectful for our users. How many wasted hours of dev time should we push to the module authors?

@GeoffreyBooth
Copy link
Member Author

The current list of broken things:

Can you share the issues for esmock and ava? I’ve only seen the issue for Angular.

@mcollina
Copy link
Member

There aren’t public issues, I’ve just tried them.

@GeoffreyBooth
Copy link
Member Author

Issue 1: Reverting the change in the release branch to unbreak the folks this change broke. This should be our immediate first action.

Issue 2: What is the long term correct fix and what steps do we take to get there that (a) accomplishes the goal AND (b) minimizes breaks in the ecosystem. There is a path forward here where we are working with the ecosystem and not in spite of them.

So say we do step 1 here, of reverting on the release branch. Then presumably we land nodejs/node#53200 on main, and both it and the predecessor nodejs/node#52706 get tagged as don’t-backport to any release line. Then what?

We can’t restore the 22.1.0 behavior and migrate to a single hooks thread. So even after the revert in 22.3.0 changes us back to the 22.1.0 behavior, eventually we’re going to ship another intentional change that will force some people to react to. So now we cause changes twice: in 22.3.0, and again in some future version when we try again to ship the “one hooks thread” design, possibly with something in addition to opt into scoped hooks. Maybe that’s better than just shipping the bugfix? I dunno. All I know is that I’ve heard very consistent feedback that library authors want us to minimize the number of changes they need to react to.

With regard to Ava, it already has a --no-worker-threads option to tell it to run tests in child processes rather than worker threads: https://github.com/avajs/ava/blob/main/docs/08-common-pitfalls.md#nodejs-command-line-options-child-processes-and-worker-threads. So it’s already been shipping the same workaround that Angular built.

I don’t know what issue esmock could be having. It only added a reference to worker_threads two weeks ago: iambumblehead/esmock@0a2a93a. There are no references to new Worker in its codebase.

Lastly, it may be that I was wrong to assume that worker threads are faster than child processes; apparently it’s the reverse: https://github.com/orgs/nodejs/discussions/44264. So encouraging people to migrate from new Worker to child_process.fork might actually be a good recommendation, and perhaps there’s not much reason for us to build complicated hooks thread orchestration APIs when child_process.fork already exists and is probably faster. And anyone using worker threads to get scoped loaders today, who migrate to child processes, won’t need to do another migration later on regardless of how hooks and threads may evolve.

@dygabo
Copy link
Member

dygabo commented May 31, 2024

Please consider my two cents.

Note: If no change in behaviour compared to the behaviour in 22.1.0 is acceptable, then the only short term way forward is the revert.

Having said that, when I started working actively on this was after some time of following the discussions (on and off). From my perspective we needed a solution to reduce the number of threads from 2*N to N+1 in order to reduce resource consumption, etc.

This kind of change is intrusive by definition and IMO it is expected that changes in behaviour will occur. We unfortunatelly introduced the hanging process bug with that PR (unintentedly, needles to say). #53200 tries to fix that. But one more decision of #52706 (obviously not an easy one) was that module.register can only be called by the main thread and we postponed the work for a solution on that.

Supporting module.register from an arbitrary thread introduces one more issue. If we do that, we need to ensure that the thread calling module.register has ownership of the hook inside the hook chain. Otherwise the hook chain might grow indefinitely. This might have effects on scenarios like "a whole thread pool shares a hook". If these are discussed and agreed upon, we should also put that in the documentation as a limitation or thing that hooks implementors have to think about.

Personally, I would opt for a solution based on nodejs/node#53200 (comment). But #53200 needs some work after the last discussions and it would be good to be able to do that work without the release next week pressure.

I also think that the API openness of both worker_threads and loader hooks (which is a good thing to begin with) can create a few runtime scenarios that are hard to diagnose in troubleshooting scenarios. These should be somehow clearly documented.

And then there's the concern of hooks polluting the state on the hooks thread in a way that impacts other threads. I think our open interface makes room for this kind of bugs (that isolation of worker threads tries to solve)

@mcollina
Copy link
Member

I personally think this is fixable, and a design that can cater to most use cases can be done. Even if not, it should be clearly communicated ahead of time and circulated widely. What we need to do now is to restore a situation that did not break users, and then design something that can work for everybody.

@aduh95
Copy link
Contributor

aduh95 commented May 31, 2024

Let's land the revert now, and then let's open a "revert revert" PR that can also land immediately with dont-land- labels. That way, 22.x gets back the old behavior while we can keep iterating on main until we're happy with the behavior, at which point we can remove the labels.

@ShogunPanda
Copy link
Contributor

@aduh95 When you talk about revert revert it basically means reapplying the original nodejs/node#52706 to main? If is that so, wouldn't be easier to just land nodejs/node#53183 on v22.x only?

@mcollina
Copy link
Member

@ShogunPanda I think the double revert make the job easier for the releasers.

@ShogunPanda
Copy link
Contributor

Oh, I see. It makes totally sense. :)

@GeoffreyBooth
Copy link
Member Author

Let's land the revert now, and then let's open a "revert revert" PR that can also land immediately with dont-land- labels.

I'm okay with this if the latter PR can't get blocked. Otherwise we can just revert on the release branch.

@mcollina
Copy link
Member

I have no problem iterating on the feature in the main or just working on a new implementation that does not have these issues on a fresh PR. I'm okay as long as we don't break constituents (or we break them consciously).

@GeoffreyBooth
Copy link
Member Author

Personally, I would opt for a solution based on nodejs/node#53200 (comment). But #53200 needs some work after the last discussions and it would be good to be able to do that work without the release next week pressure.

I agree. How about this:

  1. We get module: have a single hooks thread for all workers node#52706 off of v22 while keeping it on main, however it makes the most sense to do so (just revert on v22, revert-revert, whatever).
  2. We land module: allow module.register from workers node#53200 on main to fix the hanging bug.
  3. We land a new PR that adds workerData and/or threadId to the context passed into the resolve and load hooks. This workerData and/or threadId would correspond with the thread of the module being imported, that the hooks are customizing. This would let us have a single hooks thread for the process while still allowing hooks to behave differently for the modules on each thread.

Step 2 fixes the cases that @ShogunPanda provided in his reproduction, and step 3 would fix the Angular case. Angular’s old code wouldn’t just start working again, but adapting to this change would be very minor: instead of importing workerData from node:worker_threads within the hooks thread, the code would retrieve it from context.workerData.

This would fix the “more than one hooks threads getting spawned” bug while preserving the ability for hooks to behave differently based on the thread that modules are on. The other advantage of step 3 is that it’s small in scale; there’s a chance it could be done before the 22.3.0 release, which would minimize churn.

@ShogunPanda
Copy link
Contributor

About #3 I'm currently working on https://github.com/ShogunPanda/node/tree/multiple-chains (based off nodejs/node#53200 that will address the threadId and enable new use cases.

@aduh95
Copy link
Contributor

aduh95 commented May 31, 2024

  1. We get module: have a single hooks thread for all workers node#52706 off of v22 while keeping it on main, however it makes the most sense to do so (just revert on v22, revert-revert, whatever).

That's not going to work well with the current way we're cutting releases, let's just fast-track the revert and the revert-revert, no one is going to block it.

@GeoffreyBooth
Copy link
Member Author

let’s just fast-track the revert and the revert-revert, no one is going to block it.

As long as it can’t be blocked, that’s fine, I’m happy to do whatever method makes releases easier.

@GeoffreyBooth
Copy link
Member Author

So just to build on my suggested way forward in #1566 (comment), let’s do whatever reverts and revert reverts or whatever the proper procedure is right after the next TSC meeting; in the TSC meeting we can confirm that everyone at that meeting is on board with that plan, and we can coordinate who’s doing what and ensure that we’ll have the proper approvals and no blocks. Does that work @aduh95?

Then I think the priorities need to be:

  1. Fix the bug where calling register sometimes causes a hang. This is already fixed by module: allow module.register from workers node#53200.
  2. Add workerData and/or threadId to context so that hooks have the ability to behave differently based on the thread that a module is imported on. This should be a quick follow-up, or maybe even included in module: allow module.register from workers node#53200. This will fix the Angular case.
  3. Add a way to define which hooks run for which threads (the separate chains idea by @ShogunPanda). As this is a new idea and a new API, we should do some outreach to hooks authors to get their feedback before landing this.

I think we have a good chance of landing the first two before 22.3.0, in which case we can release nodejs/node#52706, nodejs/node#53200 and the context addition all in 22.3.0; this would both unbreak everyone that we know that we’ve broken while also minimizing churn, because we don’t ship a release that reverts to the 22.1.0 behavior only to change it yet again in the following release. I think this is the ideal outcome.

I’m not aware of anyone currently relying on the ability of execArgv to define different hooks for different threads. Even if there are such users out there, context.workerData should still unbreak them, as a single hook can always run for every thread and just differ its behavior per thread; but I can see how it might be preferable to give threads the ability to register hooks that don’t affect other threads. (Child processes are another option that can already unbreak such a use case.) But there have been some concerns already raised about the “separate chains” idea, so I don’t think it’s something that we should try to rush out for 22.3.0. We could revert to the 22.1.0 behavior until we sort out all the issues around separate chains, but since we don’t know of anyone broken by the lack of this isolation and since there are two workarounds, I’m inclined to let what I’m proposing here as the 22.3.0 behavior (current 22.2.0 plus items 1 and 2 above) ship and then “separate chains” or some other solution for isolation can land whenever it’s ready (or we decide that child processes are the best solution for that use case). Thoughts?

@aduh95
Copy link
Contributor

aduh95 commented Jun 1, 2024

What's the point of waiting for the next TSC meeting? The revert is going to miss the next release on Tuesday which is a disservice to our users; if we all agree it should be reverted on 22.x, let's start with that.

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Jun 1, 2024

If nodejs/node#53200 including context.workerData/context.threadId (what I proposed in #1566 (comment)) is ready by the 22.3.0 cutoff, and no one blocks it, then we don’t need to do a revert at all. So that’s the question: if that’s ready in time, does anyone have any objections to releasing that as 22.3.0?

In short: for 22.3.0 we unbreak everyone who we know was broken by 22.2.0. In future releases we handle additional desired use cases.

@ASISBusiness

This comment was marked as spam.

@mcollina
Copy link
Member

mcollina commented Jun 2, 2024

I have a feeling that the timeline is rushed, and at least one PR would need to be fast-tracked. The safest bet is to revert, and take a bit of time to verify the solution works in all design and in all constraints. The proposed fix introduces the possibility for one thread to alter the behavior of a sibling thread by issuing a register(). That'd be a nightmare for users to debug.

I think the source of the confusion is Node.js supports having user threads with different responsibilities. The current implementation of "one loader chain" breaks this assumption without a way for the user to control what is happening. The solution @GeoffreyBooth is proposing is to "use child processes" if isolation is required. I don't think this is a good path for Node.js.

As an example, pino spawns a thread for distributing logs to destinations (elasticsearch and the like) off the main thread. This has advantages for the end users, as it guarantees better performance as well as the ability to deliver the logs inside the 'exit' event by blocking the main thread. We recommended using external processes for doing this for years, until it was possible to use threads.

I don't is something we want to rush as there are important decisions to be made.

@ShogunPanda
Copy link
Contributor

I agree. While @dygabo is doing a wonderful job, our time constraint for 22.3 doesn't give us enough time to consider all possible edge cases.
Let's revert and work with ease aiming for 22.4

@GeoffreyBooth
Copy link
Member Author

The solution @GeoffreyBooth is proposing is to "use child processes" if isolation is required.

I don't think this is quite accurate. If each hook has access to each thread's workerData or ID, the hook can run code specific to that thread simply by branching.

@mcollina
Copy link
Member

mcollina commented Jun 2, 2024

I don't think this is quite accurate. If each hook has access to each thread's workerData or ID, the hook can run code specific to that thread simply by branching.

Sorry I misunderstood the proposal. This can work.

@GeoffreyBooth
Copy link
Member Author

Sorry I misunderstood the proposal. This can work.

Okay. I’ve approved the revert, since based on nodejs/node#53195 (comment) it seems like my suggestion of context.workerData probably can’t be ready in time for 22.3.0. If somehow it is, we can still land the revert-revert and the follow-up PR(s) all on the 22.x branch as part of putting together 22.3.0, but that seems unlikely.

Both context.workerData and separate chains have the design that while all hooks code is isolated from application code (by being on the hooks thread) hooks code is not isolated from other hooks code. This has been the case ever since we allowed chaining of hooks; if one hook sets a global, it would be available to other hooks, and it’s the responsibility of hooks authors to be careful if they don’t want to pollute the global scope of the hooks thread. I think this is okay, and there’s a lot more need for things to be shared between hooks (like caches of transpiled files, mocks that should shared state between hooks or between modules importing those mocks, etc) then there is a need for isolation; and authors can always create isolation themselves in various ways.

@mcollina would you be okay with allowing nodejs/node#53200 to land if it includes what it has now plus some way for hooks to branch based on thread, like

export function resolve(url, context, next) {
  if (context.workerData?.translations === 'italian') {
    // Do something specific for the thread with the Italian translations

or similar (like maybe it’s threadId)? And then we can release the revert-revert and this expanded nodejs/node#53200?

And then the “separate chains” idea can be a follow-up.

@mcollina
Copy link
Member

mcollina commented Jun 3, 2024

@mcollina would you be okay with allowing nodejs/node#53200 to land if it includes what it has now plus some way for hooks to branch based on thread, like

I'm not convinced that it would solve all cases, and all the deduplication issues would be solved in time. I fear we are going to rush something out that would be differently broken vs the status quo, confusing our users further.

@dygabo
Copy link
Member

dygabo commented Jun 3, 2024

I'm not convinced that it would solve all cases, and all the deduplication issues would be solved in time. I fear we are going to rush something out that would be differently broken vs the status quo, confusing our users further.

fwiw the most difficult issue that I see now with the single thread approach is that it runs code initiated by different threads. Even if this is solvable at a logical level (offer the hook all data to decide if it acts or skips based on threadId+workerData), it cannot be enforced that running a hook function does not have side effects on subsequent calls of that function that was initiated by a different thread. If it cannot be guaranteed that the hook functions are pure (zero side-effects in runtime behaviour, return values, global state, etc.), this cannot entirely work in isolation.

I'm not sure if there is an agreement on the lifespan of the registered hooks either. If a hook was registered by a call on thread X and that thread is gone, the registered hook must go too. This is implementable but opens different issues. If we have the requirement that a node process must start without hooks thread if nobody needs them, what happens if only one worker needs it? Does the respective worker thread own the hook or does it also own the HooksThread?

@GeoffreyBooth
Copy link
Member Author

GeoffreyBooth commented Jun 3, 2024

The way I see it is that module.register is defined as “register these hooks to run for the current Node process, including all worker threads.” So register behaves identically whether called from the main thread or a worker thread. So if a hooks thread is created because register is called from a worker thread, then it stays until the overall process exits. We don’t currently have a way to deregister hooks (or to remove the hooks thread, which would presumably happen when the last hooks are deregistered) but such abilities can be added later.

If a worker thread wants to register hooks that affect only that thread, it can do so by passing data such as that thread’s threadId into the initialize hook, and the hook code knows to act only when that data is present. Something like:

// worker.js
import { register } from 'node:module'
import { threadId } from 'node:worker_threads'

register(new URL('./hooks.js', import.meta.url), {
  data: { threadId }
})


// hooks.js
let registeringThreadId

export function initialize({ threadId }) {
  registeringThreadId = threadId
}

export function resolve(url, context, next) {
  if (context.threadId === registeringThreadId) {
    // Do something only for the thread that registered this hook
  }

  next(url, context)
}

@mcollina
Copy link
Member

mcollina commented Jun 4, 2024

The way I see it is that module.register is defined as “register these hooks to run for the current Node process, including all worker threads.” So register behaves identically whether called from the main thread or a worker thread.

We are not in agreement. I think having a register called from a worker thread changing the behavior of its parent or another worker thread could lead to situations that are surprising snd and impossible to debug.

@alan-agius4
Copy link

For visibility, I share the same sentiment as @mcollina. This approach somewhat breaks the fundamental concept of worker thread isolation, as initialization code in one worker thread can now affect other sibling threads and the parent process.

@ShogunPanda
Copy link
Contributor

Hello!
I finally created nodejs/node#53332 which should address all the use cases we discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants