Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Piping a stream into slow and fast consumer streams in parallel causes excessive buffering #16706

Closed
RealDolos opened this issue Nov 3, 2017 · 3 comments
Labels
stream Issues and PRs related to the stream subsystem.

Comments

@RealDolos
Copy link

  • Version: v9.0.0, v8.9.0
  • Platform: Windows (but really, all of them)
  • Subsystem: streams

Problem

My use case is the following:
I want to send a file over the network but at the same time calculate a local checksum (like a sha-1 hash) for later use.

I tried the following (abbreviated and pseudo):

const fileStream = fs.createReadStream(path, {encoding: null});
const checksum = crypto.createHash("sha1");
const network = createSomeSlowSlowNetworkStream();
fileStream.pipe(checksum);
fileStream.pipe(network);
const res = await network.response();
checksum.end();
if (res.checksum !== checksum.read().toString("hex")) {
  throw new Error("dis is bad");
}

And it works... Except that pipe'ing to checksum - which can consume a lot faster than a networked stream - will cause the file stream to emit data at that high rate, but since the networked stream is a lot slower, the file stream will buffer all data for it.
In the end (almost) the entire file is buffered in memory this way, which is kinda bad especially if that file is a multi gigabyte file like in my tests.
Inspecting the node process showed thousands of 16KB (default highWatermark) buffers in memory totaling 5GB (essentially the file size of the file stream) waiting to be consumed by the network stream.

Remedies

When a read stream has multiple pipes, it should emit data at the rate of the slowest attached stream, not at the rate of the fastest stream.

Workaround

Well, I cheated and implemented a custom Transform that will checksum.write in .transform. That way, only the backpressure of the network stream is taken into account by the file stream emitting data.

@mscdex mscdex added the stream Issues and PRs related to the stream subsystem. label Nov 3, 2017
@addaleax
Copy link
Member

addaleax commented Nov 3, 2017

Hi, thanks for the bug report! This seems pretty weird for me since it sounds like the exact same bug as #5820 (and #5257), so we actually have tests that should make sure this doesn’t happen.

If I modify your example like this to make it runnable, I see a steady memory footprint:

'use strict';
const fs = require('fs');
const crypto = require('crypto');
const stream = require('stream');

const fileStream = fs.createReadStream(process.execPath, {encoding: null});
const checksum = crypto.createHash("sha1");
const network = new class extends stream.Writable {
  _write(data, encoding, cb) {
    // 'Slow' stream
    setTimeout(cb, 10);
  }
}();
fileStream.pipe(checksum);
fileStream.pipe(network);
network.on('finish', () => {
  checksum.end();
  console.log(checksum.read().toString("hex"));
});

My conclusion would be that something might be calling fileStream.resume(); for some reason. Do you think you could try to verify that, e.g. by overriding that method and seeing whether it’s called? Alternatively, do you have a standalone reproduction of the issue?

@Ginden
Copy link

Ginden commented Nov 3, 2017

When a read stream has multiple pipes, it should emit data at the rate of the slowest attached stream, not at the rate of the fastest stream.

I'm not sure if that would be an expected behaviour in all cases (I can think of cases when it's totally fine to use whole memory avalaible). I think it would be better to emit warning if stream buffers too much (eg. more than 100*highWatermark).

@RealDolos
Copy link
Author

I just tried to follow your hint regarding stream.resume(), by tracing what happens:

        let oresume = stream.resume.bind(stream);
        stream.resume = (...args) => {
          console.trace("resume");
          return oresume(...args);
        };
        let opause = stream.pause.bind(stream);
        stream.pause = (...args) => {
          console.trace("pause");
          return opause(...args);
        };

It didn't reveal anything interesting, but it did reveal interesting stuff on the call stack.
Turns out it's the combined-stream package (required by form-data) with help from delayed-stream deserving the blame here, not the node stdlib.

So, closing this issue for that reason. Sorry for wasting your time.

If you're curious and have too much time, here is a more detailed explanation of what happens.

I was actually using theform-data package, which uses combined-stream which uses delayed-stream which then buffers up to .maxDataSize if the DelayedStream is not released yet... Guess what combined-stream sets this value to? If you guessed Infinity you won (no prize, just mad eight ball skills).

So combined-stream - hidden a few layers deep in abstractions - will tell the delayed-stream a layer deeper to buffer all data. But delayed-stream will only buffer data when the stream wasn't released yet.

So without the checksum pipe, the file stream isn't flowing yet and delayed-stream will some time in the future just start the flow and be released (aka stop buffering) at the same time, so no buffering in the first place, so no issue there.

But with the checksum, the stream becomes flowing immediately thanks to .pipe(checksum), delayed-stream not having been released yet will happily buffer, and this results in me having a sadface and erroneously reporting a bug to the wrong issue tracker. Didn't exactly help that the Chrome memory tools via node --inspect were a tad bit misleading as to who really owns all those pesky buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream Issues and PRs related to the stream subsystem.
Projects
None yet
Development

No branches or pull requests

4 participants