Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support registering worker specific custom ESM loader hooks (Regression from #52706) #53195

Open
alan-agius4 opened this issue May 29, 2024 · 18 comments
Labels
feature request Issues that request new features to be added to Node.js. loaders Issues and PRs related to ES module loaders stale

Comments

@alan-agius4
Copy link
Contributor

alan-agius4 commented May 29, 2024

What is the problem this feature will solve?

Following the implementation of #52706 in Node.js version 22.2, workers can no longer specify custom ESM loader hooks. In the Angular CLI, we utilized this feature as part of our prerendering (SSG) pipeline.

Here’s a summary of our build pipeline and process: In the main process, the application is transpiled and bundled, generating all files in memory. The main process then spawns multiple thread workers (using Piscina) with a customized loader hook to intercept imports and redirect them to the in-memory files when present. As of version 22.2 this is no longer possible.

It's worth noting that in our scenario, multiple threads (e.g., Thread A and Thread B) may be initialized, resulting in varying module resolutions and ESM module hooks across these threads.

In the following example, the workerData contains the files and modules utilized by the ESM loader hooks:

const { Worker } = require("worker_threads");

const filesInMemoryInItalian = {
  'main.mjs': 'export const hello = "ciao"';
};

const filesInMemoryInEnglish = {
  'main.mjs': 'export const hello = "hello"';
};

const workerA = new Worker("./lib/child.js", {
  workerData: { files: filesInMemoryInItalian },
  execArgv: ["--import", "./lib/register.js"],
});

const workerB = new Worker("./lib/child.js", {
  workerData: { files: filesInMemoryInEnglish },
  execArgv: ["--import", "./lib/register.js"],
});

What is the feature you are proposing to solve the problem?

Thread workers should be permitted to specify custom ESM loader hooks.

What alternatives have you considered?

An alternative approach is to spawn a forked process from the main process, specifying the custom loader hook in the forked process. This forked process then spawns the worker threads. As demonstrated in angular/angular-cli#27721.

However, It is important to note that this approach is not backward compatible. In previous versions, the ESM loader specified on the main thread did not propagate to child workers. Consequently, from a maintenance perspective, tools now need two methods to use ESM custom loaders with thread workers and need to support Node.js LTS and Active versions.

@alan-agius4 alan-agius4 added the feature request Issues that request new features to be added to Node.js. label May 29, 2024
@Flarna Flarna added the loaders Issues and PRs related to ES module loaders label May 29, 2024
@ShogunPanda
Copy link
Contributor

Hello!
Me and @dygabo are working on two possible solutions to this problem, which basically fix the regression introduced in #52706.
Will keep you posted.

@GeoffreyBooth
Copy link
Member

@nodejs/loaders

@GeoffreyBooth
Copy link
Member

@alan-agius4 Thanks for this. I’m a little confused by your example, because it has execArgv: ["--import", "./lib/register.js"] sent into both workers. Are you intending to have the same hooks apply to all workers?

Aside from the bug introduced by #52706, the intended design is that hooks registered on the main thread would apply to all worker threads automatically. From your example, I gather that this is the API you want, but somehow you get the differing context of each thread available to the hook? So like I see the differing workerData passed into each thread; would you want that data available within the hook, like on the context property for example?

export async function load(url, context, next) {
  const { files } = context.workerData // Either Italian or English, depending on the thread of the module
  // ...

@alan-agius4
Copy link
Contributor Author

I’m a little confused by your example, because it has execArgv: ["--import", "./lib/register.js"] sent into both workers. Are you intending to have the same hooks apply to all workers?

Yes, we intend to use the same hook logic but with different contexts/data. Before version 22.2, this was feasible as each thread independently registered its own hook with different workerData. The API you described would suit our needs. Currently, we have something similar for older Node.js versions:

register.js

register('./loader-hooks.js', {
  parentURL: pathToFileURL(__filename),
  data: workerData,
});

loader-hooks.js

let outputFiles;

export function initialize(data) {
  outputFiles = data.files;
}

export function resolve(specifier, context, nextResolve) {
  // resolve logic here
}

app.js

const workerA = new Worker("./lib/child.js", {
  workerData: { files: filesInMemoryInItalian },
  execArgv: ["--import", "./register.js"],
});

const workerB = new Worker("./lib/child.js", {
  workerData: { files: filesInMemoryInEnglish },
  execArgv: ["--import", "./register.js"],
});

the intended design is that hooks registered on the main thread would apply to all worker threads automatically

I understand this and it's acceptable. However, I'm unsure if it's a bug or not, but with version 22.2, you can opt-out of this behavior by setting execArgv: [] in the Worker constructor. For example, the following would force the worker not to use the main process loader hooks:

const workerB = new Worker("./lib/child.js", {
  execArgv: [],
});

Questions

  • Will loaders registered in thread workers eventually apply to the main process? I believe they shouldn't.
  • Would a hook registered in a thread worker also apply to sibling workers? For instance, in the provided scenario above, if workerB doesn't specify execArgv: ["--import", "./register.js"], would it inherit hooks from the other worker that were created previously even though no hooks are registered in the main process?

@GeoffreyBooth
Copy link
Member

  • Will loaders registered in thread workers eventually apply to the main process? I believe they shouldn’t.

I don't know; we haven't decided yet.

  • Would a hook registered in a thread worker also apply to sibling workers?

If we decide that calling register from a worker thread adds hooks to the one hooks thread that all threads share, then yes.

I’m unsure if it’s a bug or not, but with version 22.2, you can opt-out of this behavior by setting execArgv: [] in the Worker constructor.

This is essentially the problem introduced by #52706. It wasn’t on our radar that hooks could be applied to (or removed from) worker threads via execArgv; that was unintentional. Currently (and pre-22.2) calling register from a worker thread is a no-op. It shouldn’t be; it should either error, register hooks to apply to the entire process, or somehow register hooks that apply just to that thread. But we were planning on tackling that as a follow-up after landing #52706.

Right now we just want to unbreak people so that we can take some time to think about how to handle hooks for worker threads. The off-thread hooks code is already monstrously complex, so I worry about the maintainability of something so specific as “if register is called from a worker thread, the registered hooks apply only to that thread.” I understand the usefulness of such an ability, but we’re already up against the limits (if not beyond them) of what the Node team can achieve and support in this area. If there’s a simpler solution like the context.workerData idea that can also satisfy the use case, I think we need to pursue the simpler options.

Something else to consider is, what are you using worker threads for here? It seems like their purpose is a way to group sets of modules to be customized in a particular way. You probably don’t have easy access to edit the source code of the files you’re loading, but if you did, you could use import attributes for that purpose: import './file.js' with { translations: 'italian' }. The import attributes from the import statement/expression are already provided in the context argument sent into each hook. Another option would be to put the group marker into the URL somehow, like as a prefix or a query string param: import './file.js?translations=italian'. And so on.

Another option is that we have an old PR for a deregister method. You could call register with a particular set of hooks and data to pass into initialize, then import some modules you want customized by these hooks and settings, and then call deregister to tear it down.

Maybe none of these options are preferable to the isolation that worker threads provide, but my point is just to try to define the problem in a way that doesn’t prescribe a particular solution. As in, I assume that the threads are an implementation detail to whatever you’re ultimately trying to achieve. Maybe they are the best solution, better than child processes or import attributes or query strings or other ideas, but for the purposes of evaluating approaches it would help to define the goal as naively as possible when it comes to how it should be achieved.

@alan-agius4
Copy link
Contributor Author

I can see that the deregister function can be useful and we could work as in our case we could create the workers for each locale sequentially (Which we already do). However, we don't have easy access to modify the source file because, in our use case, the same source file is transpiled and localized into numerous locales as part of the build pipeline. I've created a diagram to help you better understand the workflow.

Screenshot 2024-05-30 at 10 21 49

@mcollina
Copy link
Member

I think there are four cases we need to discuss:

  1. one thread has loaders registered, all worker threads created by it should reuse the loader thread
  2. one thread does not have loaders registered, all worker threads created by it will have their own loader thread
  3. independently of all the status of the running thread, a pool of the worker threads created should use the same loader.
  4. one thread has loaders registered, a worker thread created by it wants to have the same loaders of the main thread.

I think that in order to support those cases, we need to have an option in the Worker class.

@mcollina
Copy link
Member

@alan-agius4 based on your diagram, it would be advantageous for you to have one loader thread for all WorkerPoolEN and one for WorkerPoolIT. Is it correct?

@alan-agius4
Copy link
Contributor Author

@mcollina, correct.

@GeoffreyBooth
Copy link
Member

One thing that’s come up while working on these PRs is that it’s very hard to reason about and debug multiple hooks threads. There’s also the question of how to manage them: do they go away when all threads using them exit, or stick around for potential future worker threads to use? We’d end up needing to build APIs for orchestrating these hooks threads.

What about the API I proposed in #53195 (comment), of passing in workerData on context? I think that would let you have hooks behave differently for modules imported by each thread, while still having only one hooks thread.

@mcollina
Copy link
Member

I'm not sure that would support having workers with different loader hooks.

@ShogunPanda
Copy link
Contributor

@alan-agius4

I'm currently working on https://github.com/ShogunPanda/node/tree/multiple-chains (based off #53200).

Basically the idea is that each thread at any given time has a chainId value. So in the Hooks class the #chains property goes from this:

#chains = {
  resolve: [...],
  load: [...]
}

To this:

#chains = {
  [id]: {
     resolve: [...],
     load: [...]
  }
}

Each application thread can OPTIONALLY have a different chain of hooks (by default it inherits its parent one or starts a new one if none is found)
And they all run in the same, single, application global, hook thread as it is of today.

Every time a thread needs to change the hooks data, it can start a new chain. The chain will be present in initialize, resolve and load context arguments (initialize will gain the second parameter compared to today's situation) so they can store data locally in the global thread without conflicts.

Do you think it will fit your use case?

17:54
The chainId is inherited from the parent thread but can be changed with a new node:module API
17:54
So new chains can be started at any time

@GeoffreyBooth
Copy link
Member

I’m not sure that would support having workers with different loader hooks.

It wouldn’t, but I don’t think we need that complexity. Within the hook itself it can handle the various cases:

export function resolve(specifier, context, next) {
  if (context.workerData?.onThreadA) {
    doWhateverShouldBeDoneForThreadA()

This is no less performant than us doing a similar switch in internal code to determine which hook chain to run.

@dygabo
Copy link
Member

dygabo commented May 31, 2024

I have not yet analyzed the details of the solution but as a first thought on this fwiw I think that having one single chain and calling the hooks with all the needed information for the hook to decide if it does something or calls next would be more open and also easier. The functionality is the same as I think your solution covers but I tend to believe that it composes better. One might want to have different state information in the worker data or simply the threadId as discriminator and one registered hooks set can be part of as many chains it wants without bookkeeping in core. The chain subscription would not be needed as that logic is distributed across the whole chain.

I pushed on #53200 one commit today that starts to implement a solution in this direction. In fact with that one the cases needed by the angular regression should already work.

Tests are missing, it is WIP but the idea should already be visible. We need a way to automatically unregister hooks. Still working on that part.

once again the disclaimer: not advocating, just proposing.

@GeoffreyBooth
Copy link
Member

I pushed on #53200 one commit today that starts to implement a solution in this direction. In fact with that one the cases needed by the angular regression should already work.

@dygabo can you explain this a bit? How would the Angular case be solved by what’s currently on #53200? Presumably their existing code won’t just start working again, so what would their rewritten hooks look like so that they can get the desired behavior of having the Italian translations available as the context data for modules imported from one thread and the English translations available as the context data for modules imported from another thread?

@dygabo
Copy link
Member

dygabo commented Jun 2, 2024

can you explain this a bit?

@GeoffreyBooth that is still WIP and it does not have all the data yet. The idea would be that the hook when it runs, it knows for what it was created (creation time workerData + creator threadId) and for what it is being called (runtime workerData + hook initiator threadId). Then it has all the data to make the decision what runs when called. Parts are missing and that might not work yet but that's WIP. For the purpose of the other implementation, @ShogunPanda please ignore this WIP. If the approach makes progress and covers the cases, it should be used.

Also once that works, tests need to be added and more cases tested. So I don't think it is realistically possible to have a finished #53200 by 22.3.0. It also may be just another failed attempt of only one hooks thread implementation altogether. Even rethinking if a single worker for the hooks is what we want is an option. Hope this makes sense. But covering all the new testcases opens a whole new set of questions.

@alan-agius4
Copy link
Contributor Author

Do you think it will fit your use case?
@ShogunPanda, yes it does look like it will.

Copy link
Contributor

github-actions bot commented Dec 1, 2024

There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the never-stale Mark issue so that it is never considered stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.

@github-actions github-actions bot added the stale label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Issues that request new features to be added to Node.js. loaders Issues and PRs related to ES module loaders stale
Projects
Status: Awaiting Triage
Development

No branches or pull requests

6 participants