Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running synchronous iterators block the event loop when fed into Readable.from() #41821

Closed
AncientSwordRage opened this issue Feb 2, 2022 · 14 comments
Labels
doc Issues and PRs related to the documentations. stream Issues and PRs related to the stream subsystem.

Comments

@AncientSwordRage
Copy link

Version

v16.13.0

Platform

Linux WIN-******* 4.4.0-19041-Microsoft #1237-Microsoft Sat Sep 11 14:32:00 PST 2021 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

streams

What steps will reproduce the bug?

Seen here: https://stackoverflow.com/q/70915042/1075247

Run the below code (with an appropriate subPerm function - something that is synchronous generator and long running).

const { Readable } = require('stream');
const { intervalToDuration, formatDuration, format } = require('date-fns');
const { subsetPerm } = require('./permutation'); // code from https://stackoverflow.com/a/70839805/1075247

function formatLogs(counter, permStart) {
    const newLocal = new Date();
    const streamTime = formatDuration(intervalToDuration({
        end: newLocal.getTime(),
        start: permStart.getTime()
    }));
    const formattedLogs = `wrote ${counter.toLocaleString()} patterns, after ${streamTime}`;
    return formattedLogs;
}

const ONE_MINUTES_IN_MS = 1 * 60 * 1000;

let progress = 0;
let timerCallCount = 1;
let start = new Date();
const interval = setInterval(() => {
    console.log(formatLogs(progress, start));
}, ONE_MINUTES_IN_MS);

const iterStream = Readable.from(subsetPerm(Object.keys(Array.from({ length: 200 })), 5));

console.log(`Stream started on: ${format(start, 'PPPPpppp')}`)
iterStream.on('data', () => {
    progress++;
    if (new Date().getTime() - start.getTime() >= (ONE_MINUTES_IN_MS * timerCallCount)) {
        console.log(`manual timer: ${formatLogs(progress, start)}`)
        timerCallCount++;
        if (timerCallCount >= 3) iterStream.destroy();
    }
});

iterStream.on('error', err => {
    console.log(err);
    clearInterval(interval);
});

iterStream.on('close', () => {
    console.log(`closed: ${formatLogs(progress, start)}`);
    clearInterval(interval);
})

console.log('done!');

Note that the logs inside the on('data', ... are printed (i.e. the ones prefaced with 'manual time'), but the ones from the setInterval are not.

How often does it reproduce? Is there a required condition?

This occurs every time

What is the expected behavior?

I would expect the loop that processes the generator (Readable.from) to give way to the event loop at some point, but it does not

What do you see instead?

No event loop processing happens at all. It doesn't even seem to print something at the end of the

Additional information

Batching seems like the sort of thing that would help here, but I've seen the code for Readable.from and I couldn't see a way to easily modify it to include that.

#34207 looks highly related, but again even under that proposal I'm not clear if you'd want to wrap one in the other, or use batching 'under the hood'.

Also, I'm aware this is labled as a bug, even though the behaviour is quite explainable. I'd argue it's very unintuitive and completely undocumented that the event loop is essentially paused under these circumstances (although I admit long running synchronous generators used where it matters might be rare in the wild) - and potentially harmful if any authentication timeouts are relying on setTimeout etc.

@AncientSwordRage
Copy link
Author

I just found https://nodejs.org/en/docs/guides/dont-block-the-event-loop/ (linked from goldbergyoni/nodebestpractices#294) which more makes me think it's a bug.

But I'm not clear what is a good fix? Each loop of the generator is a very small task, it's just lot of them.

@VoltrexKeyva VoltrexKeyva added the stream Issues and PRs related to the stream subsystem. label Feb 2, 2022
@benjamingr
Copy link
Member

You are performing work that is not I/O - that defers at most a microtask. Stuff that doesn't do I/O will always happen before stuff that does - this is by design.

You don't need from or async generators for that :) :

const r = new Readable({ read() { this.push(1); }, objectMode: true });
r.on('data', (chunk) => console.log('got data', chunk));
setInterval(() => console.log('interval'));```

@AncientSwordRage
Copy link
Author

You are performing work that is not I/O - that defers at most a microtask.

In this example yes, but I'm talking about a general case.

Stuff that doesn't do I/O will always happen before stuff that does - this is by design.

I understand that. It's how the event loop works.

It doesn't change that this is possible with from when it shouldn't be.

What I think is needed is either a change to the documentation (to say, don't use from for long running synchronous stuff - no matter what it is) or a way that the code in from to pass back to the event loop optionally after a short while.

If you need a more conceptual case, look at this:

const authInterval = setInterval(() => checkAuth(AUTH_KEY), SOME_MILLISECONDS);
const longRunningStream = Readable.from(someImportantSynchronousGenerator);

longRunningStream.on('data', (chunk) => processIfAuthorised(chunk, AUTH_KEY))

or anything similar.

Is this potentially the wrong way to do stuff? Is it dumb? Possibly yes to both!

Should something tell you not to, or make it possible to work around?

I think so.

@devsnek
Copy link
Member

devsnek commented Feb 2, 2022

(assuming I am reading this issue correctly) This surprises me too. I guess I would've expected that sync iterators are wrapped in async iterators, like for-await.

@benjamingr
Copy link
Member

@devsnek even if the sync iterator is wrapped in an async iterator - it would still defer at most microtasks since it doesn't do I/O - like a function that does await Promise.resolve() in a loop. It will never yield to i/o.

@benjamingr
Copy link
Member

@AncientSwordRage

It doesn't change that this is possible with from when it shouldn't be.

What I think is needed is either a change to the documentation (to say, don't use from for long running synchronous stuff - no matter what it is) or a way that the code in from to pass back to the event loop optionally after a short while.

Well, from isn't special here - this is everywhere that doesn't do I/O vs anywhere that does. In no place in Node (since 0.8 I believe) you will find I/O getting scheduled behind things like async functions.

If you do long-running work without yielding to I/O - you are not yielding to I/O. Whether or not that is appropriate to your use case that is entirely up to you.

You can easily yield to the event loop by doing something like:

from(async function*() {
  let i = 0;
  for (const item of getMySyncGenerator()) {
    yield item;
    if (i++ % 20 === 0) await setImmediate(); // yield to the event loop every 20 items;
  }
});

Or something similar - this isn't unique to from it's the same issue with a regular synchronous generator being iterated in a for loop, or an async function running a for loop or any form of synchronous (or microtisk-only) code in Node.

@AncientSwordRage
Copy link
Author

@benjamingr I did try something similar and I didn't get it to yield, but this looks like it might work. If it doesn't I'll update you.

Really, the reason this is surprising is that a lot of other streams stuff seems to pass back to the event loop because V8 knows to pass it to an underlying C++ library. I not saying .from has to, but it would be good to at least put a reminder in the docs:

Note: NodeJs will only pass callbacks involving async I/O and other similar tasks to the event loop, so along running synchronous iterator will block the event loop and prevent some tasks running.

It's too early to be sure those words even make sense, but that sort of thing would have prevent my confusion at least 😅

Also, if it's as easy as your snippet makes out it would be trivial to implement directly

@benjamingr
Copy link
Member

Well, a lot of streams do perform I/O so they yield back to the event loop. If you are doing work that explicitly does not do I/O (like just from with a generator) it will not yield back to the event loop.

There is not a single place streams "pass back to the event loop" - it is always the actual source of the stream (or the destination) where the I/O happens - the stream just pauses waiting for the read/write to complete.

Note streams aren't special in this regard - this is true for other things with microtick semantics (like if you nest process.nextTick.


I am happy with a doc change as a resolution to this but I am not sure where it would go since this behavior isn't specific to streams and is omnipresent in Node APIs.

@AncientSwordRage
Copy link
Author

Yeah, that's why I said seems to. Adding it to .from would cover this issue, but I don't know about the general case.

@benjamingr benjamingr added the doc Issues and PRs related to the documentations. label Feb 3, 2022
@AncientSwordRage
Copy link
Author

AncientSwordRage commented Feb 4, 2022

Here's the code that got this working:

function setImmediatePromise() {
    return new Promise(resolve => setImmediate(resolve));
}

const iterStream = Readable.from(async function* () {
    let i = 0
    for await (const item of baseGenerator) {
        yield item;
        i++;
        if (i % 1e5 === 0) await setImmediatePromise();
    }
}());

Based on comments here and this snyk.io article

@benjamingr
Copy link
Member

Sure though "working" might be a misnomer since the fact that microtasks (promises/queueMicrotask/nextTick etc aka jobs) never yield to I/O is quite intentional and is a strong guarantee on the order of events (you never process another I/O event before you process all microticks - also true in browsers).

Old versions of node used to warn on nextTick nesting but that was removed since there are many places that cause microticks now.

As a side note, you can simply your code to:

import { setImmediate } from 'timers/promises'; // that's built in now;

const iterStream = Readable.from(baseGenerator)
                           .asIndexedPairs()
                           .map(async ([i, x]) => { 
                             if(i % 1e5 === 0) await setImmediate();
                             return x;
                           });

@benjamingr
Copy link
Member

@nodejs/documentation the ask here is to more explicitly state that if you perform actions that don't perform I/O (like promise thens, awaits etc) you will "starve" the event loop.

@AncientSwordRage
Copy link
Author

I think if you make a conscious choice to yield to the event loop, when the code you wrote does that, you should be ok to say it's working without it being a misnomer.

On the plus side I very much like your updated version.

@benjamingr
Copy link
Member

Happy you like it nodejs/nodejs.org#4404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Issues and PRs related to the documentations. stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

No branches or pull requests

4 participants