Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: partition readFile to avoid threadpool exhaustion #17054

Closed
wants to merge 1 commit into from

Conversation

davisjam
Copy link
Contributor

@davisjam davisjam commented Nov 15, 2017

Problem
Node implements fs.readFile as a call to stat, followed by a C++ -> libuv request
to read the entire file based on the size reported by stat.

Why is this bad?
The effect is to place on the libuv threadpool a potentially-large read request,
occupying the libuv thread until it completes.
While readFile certainly requires buffering the entire file contents,
it can partition the read into smaller buffers (as is done on other read paths)
along the way to avoid threadpool squatting.

If the file is relatively large or stored on a slow medium,
reading the entire file in one shot seems particularly harmful,
and presents a possible DoS vector.

Downsides to partitioning?

  1. Correctness: I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes.

  2. Performance implications:
    a. Downside: Partitioning means that a single large readFile will be broken into many "out and back" requests to libuv, introducing overhead.
    b. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request.

Related
It might be that writeFile has similar behavior. The writeFile path is a bit more complex and I didn't investigate carefully.

Fix approach
Simple -- instead of reading in one shot, partition the read length using kReadFileBufferLength.

Test
I introduced a new test to ensure that fs.readFile works for files smaller and larger than kReadFileBufferLength. It works.

Performance:

  1. Machine details:
    $ uname -a
    Linux jamie-Lenovo-K450e 4.8.0-56-generic My contribution to this logo throw in. #61~16.04.1-Ubuntu SMP Wed Jun 14 11:58:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

  2. Excerpts from lscpu:
    Architecture: x86_64
    CPU(s): 8
    Thread(s) per core: 2
    Core(s) per socket: 4
    Socket(s): 1
    Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
    CPU MHz: 1499.194

  3. benchmark/fs/readfile.js

Summary
Benchmarks using benchmark/fs/readfile.js are unfavorable. I ran three iterations with my change and three with an unmodified version. Performance within a version was similar across the three iterations, so I report only the third iteration for each.

  • comparable performance on the 1KB file
  • significant performance degradation on the 16MB file (4-5x decrease)

With partitioned read:

$ for i in `seq 1 3`; do /tmp/node-part/node benchmark/fs/readfile.js; done
...
fs/readfile.js concurrent=1 len=1024 dur=5: 42,836.45194074361
fs/readfile.js concurrent=10 len=1024 dur=5: 94,170.12611909183
fs/readfile.js concurrent=1 len=16777216 dur=5: 71.79583090225451
fs/readfile.js concurrent=10 len=16777216 dur=5: 163.98033223174818

Without change:

$ for i in `seq 1 3`; do /tmp/node-orig/node benchmark/fs/readfile.js; done
...
fs/readfile.js concurrent=1 len=1024 dur=5: 43,815.347866646596
fs/readfile.js concurrent=10 len=1024 dur=5: 93,783.59180605657
fs/readfile.js concurrent=1 len=16777216 dur=5: 339.77196820103387
fs/readfile.js concurrent=10 len=16777216 dur=5: 592.325183524534
  1. benchmark/fs/readfile-clogging.js

As discussed above, the readfile.js benchmark doesn't tell the whole story. The contention of this PR is that the 16MB reads will clog the threadpool, disadvantaging other work contending for the threadpool. I've introduced a new benchmark to characterize this.

Benchmark summary: I copied readfile.js and added a small asynchronous zlib operation to compete for the threadpool. If a non-partitioned readFile is clogging the threadpool, there will be a relatively small number of zips.

Performance summary:

  • Small file: No difference whether 1 read or 10
  • Large file: With 1 read, some effect (1 thread is always reading, but 3 threads remain for zip). With 10 reads, huge effect (zips get a fair share of the threadpool when partitoined). 61K zips with partitioned, 700 zips without.

Partitioned:

$ for i in `seq 1 3`; do /tmp/node-part/node benchmark/fs/readfile-clogging.js; done
...
bench ended, reads 96464 zips 154582
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 19,289.8420223229
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 30,909.421907455828
bench ended, reads 332932 zips 62896
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 66,572.28049862666
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 12,575.639939453387
bench ended, reads 149 zips 149574
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29.793230608569676
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29,905.935378334147
bench ended, reads 623 zips 61745
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 124.57446300744513
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 12,345.553950958118

Non-partitioned:

$ for i in `seq 1 3`; do /tmp/node-orig/node benchmark/fs/readfile-clogging.js; done
...
bench ended, reads 92559 zips 153226
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 18,510.65052192176
fs/readfile-clogging.js concurrent=1 len=1024 dur=5: 30,641.12621937156
bench ended, reads 332066 zips 62739
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 66,396.6979771542
fs/readfile-clogging.js concurrent=10 len=1024 dur=5: 12,543.801322137173
bench ended, reads 1554 zips 98886
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 310.708121371412
fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 19,769.932924561737
bench ended, reads 2759 zips 703
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 550.9968714783075
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 140.38479443398438

Issue:
This commit addresses #17047.

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

fs

@nodejs-github-bot nodejs-github-bot added the fs Issues and PRs related to the fs subsystem / file system. label Nov 15, 2017
@davisjam
Copy link
Contributor Author

Working on linter errors.

@benjamingr benjamingr changed the title Partition readFile to avoid threadpool exhaustion fs: partition readFile to avoid threadpool exhaustion Nov 16, 2017
@benjamingr
Copy link
Member

Thanks for following up, pinging @nodejs/fs for review.

@bnoordhuis
Copy link
Member

I can see how this is a concern in a theoretical sense but I don't remember any bug reports where it was an actual issue. Seems like premature (de)optimization.

@davisjam
Copy link
Contributor Author

davisjam commented Nov 16, 2017

@bnoordhuis I wouldn't call this a de-optimization. It optimizes the throughput of the threadpool in its entirety by increasing the number of requests that a readFile makes. It's an optimization for throughput, at the cost of the latency of large readFiles.

I think this is in the spirit of Node.js: handle many client requests simultaneously on a small number of threads, and don't do too much work in one shot on any of the threads. This is already the approach taken by a readStream.

The benchmark/fs/readfile-clogging.js demonstrates this:
Reading a 16MB file:

  • 1 thread: partitioning yields 149K zips vs. 98K zips currently
  • 10 threads: partitioning yields 60K zips vs. 700 zips currently

FWIW The latency cost can be largely mitigated with the use of a more reasonably-sized buffer. The 1-thread numbers for readfile.js improve to a 30% degradation and the 10-thread numbers are comparable to the non-partitioned performance.

benchmark/fs/readfile.js

Here's the (rounded) readFile throughout for various read lengths on the 16MB file:

With an 8KB buffer:

fs/readfile.js concurrent=1 len=16777216 dur=5: 71
fs/readfile.js concurrent=10 len=16777216 dur=5: 163

With a 64KB buffer:

fs/readfile.js concurrent=1 len=16777216 dur=5: 222
fs/readfile.js concurrent=10 len=16777216 dur=5: 534

With the full 16MB file in one shot:

fs/readfile.js concurrent=1 len=16777216 dur=5: 339
fs/readfile.js concurrent=10 len=16777216 dur=5: 592

benchmark/fs/readfile-clogging.js

Here's the (rounded) zip throughout for various read sizes:

With an 8KB buffer:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 29,905
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 12,345

With a 64KB buffer:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 33,201
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 6,995

With the full 16MB file in one shot:

fs/readfile-clogging.js concurrent=1 len=16777216 dur=5: 19,769
fs/readfile-clogging.js concurrent=10 len=16777216 dur=5: 140

Conclusion
If we use a 64KB buffer size for readFile, there will be a 10% increase in readFile latency but a 50x increase in the ability of small concurrent threadpool operations to get a turn.

Admittedly the zip operation I'm using in readfile-clogging.js is tiny, so this is a particularly favorable comparison. I'm happy to try it with a more "reasonable" competing operation if anyone would like to suggest one.

@mscdex mscdex added the performance Issues and PRs related to the performance of Node.js. label Nov 16, 2017
console.log(`bench ended, reads ${reads} zips ${zips}`);
bench_ended = true;
bench.end(reads);
bench.end(zips);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this twice does not make sense and will break compare.js which is expecting one bench.end() per benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mscdex Thanks for pointing this out. How ought I report throughput for two separate variables like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't. Perhaps just combine both values for total fulfilled requests per second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll leave in the console.log then so the distinction between request type is clear.

bench.end(reads);
bench.end(zips);
try { fs.unlinkSync(filename); } catch (e) {}
process.exit(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really safe since process.send() used by bench.end() is not synchronous. It's better to just return early in afterRead() and afterZip() when bench_ended === true and let the process exit naturally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this process.exit(0) is just a copy/paste from benchmark/fs/readfile.js. I'll fix this in both places.


var reads = 0;
var zips = 0;
var bench_ended = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but lower camelCase is typically used in JS portions of node core and underscores are typically used in C++ portions.

@bnoordhuis
Copy link
Member

I wouldn't call this a de-optimization. It optimizes the throughput of the threadpool in its entirety by increasing the number of requests that a readFile makes. It's an optimization for throughput, at the cost of the latency of large readFiles.

I understand that. My point is that no one complained so far - people aren't filing bug reports. To me that suggests it's mostly a theoretical issue.

Meanwhile, the proposed changes will almost certainly regress some workloads and people are bound to take notice of that.

@davisjam
Copy link
Contributor Author

Meanwhile, the proposed changes will almost certainly regress some workloads and people are bound to take notice of that.

True. But other workloads should be accelerated -- it's the readfile.js vs. readfile-clogging.js tradeoff.

@mscdex
Copy link
Contributor

mscdex commented Nov 16, 2017

I think I agree with Ben here. Anyone who wants a file read in chunks can just use fs.createReadStream().

@YurySolovyov
Copy link

maybe add a separate API instead?
.readFile is simpler to use (over streams) because you don't have to manage assembling the final result.

@davisjam
Copy link
Contributor Author

davisjam commented Nov 16, 2017

Anyone who wants a file read in chunks can just use fs.createReadStream().

And if they want to make giant reads, they can use fs.read().

But if they've opted for the simplicity offs.readFile(), I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency. Presumably some kind of chunking/partitioning is done by crypto and compression as well?

@mscdex
Copy link
Contributor

mscdex commented Nov 16, 2017

On an unrelated note, please do not @ mention me in commit messages.

@davisjam
Copy link
Contributor Author

Fixed, sorry.

@refack
Copy link
Contributor

refack commented Nov 16, 2017

Hello @davisjam and thank you for the contribution 🎩

Sure looks like you did a lot of research and experimentation, and I really appreciate that.

I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency

There are several assumptions and rule-of-thumb optimizations around the uv threadpool [addition: based on empirical experience, and feedback]. One of those is that since the pool serves I/O bound operations, a small pool is enough. As such doing multiple interleaved FS operations is an anti-pattern.
As for "doing the right thing", I would go in a different way that has less concurrency, not more. Check that the uv threadpool is not all consumed doing the same operation.

@davisjam
Copy link
Contributor Author

davisjam commented Nov 16, 2017

One of those is that since the pool serves I/O bound operations, a small pool is enough. As such doing multiple interleaved FS operations is an anti-pattern. As for "doing the right thing", I would go in a different way that has less concurrency, not more. Check that the uv threadpool is not all consumed doing the same operation.

@refack Perhaps I misunderstand you, but this PR does not increase concurrency. With my patch, an fs.readFile results in more requests to the threadpool, but each such request is submitted when the previous one completes, hand-over-hand. [addition: Of course, a server might fs.readFile on behalf of different clients concurrently, but there will be one task per ongoing fs.readFile in the queue.]

I agree that a small pool is good for certain activities, but reading large files in one shot is not one of them. A small pool suffices so long as each task doesn't take too long, but a long-running task on a small pool monopolizes its assigned thread, degrading the task throughput of the pool. Then indeed (one thread in) "the uv threadpool is all consumed doing the same operation." [addition: The trouble is that on Linux, a thread still performs the I/O-bound task synchronously, since approaches like KAIO have been rejected (see here). So if the task is long-running, the thread blocks for a long time.]

If the threadpool is used solely for "large" tasks, there's no problem -- each task takes a long time anyway, and partitioning them just adds overhead. But if the threadpool is used for a mix of larger and smaller tasks (e.g. serving different sized files to different clients, running compression and file I/O concurrently, etc.), then the larger tasks will harm the throughput of the smaller tasks. In my benchmark/fs/readfile-clogging.js benchmark, the small task throughput improves by 50x if you partition the large reads.

@jasnell
Copy link
Member

jasnell commented Nov 16, 2017

I definitely appreciate the work here, but I think I'm also falling on the -1 side on this. I think a better approach would be to simply increase efforts to warn dev's away from using fs.readFile() for large files that cannot be read in a single uv roundtrip. Anything beyond that should be deferred to using either fs.read()or fs.createReadStream(). Using fs.readFile() to read anything larger than that is an anti-pattern that I really do not think we should be encouraging.

@davisjam
Copy link
Contributor Author

@jasnell Thanks for your input!

  1. I agree that fs.readFile is not a good idea in the server context, though of course it's fine for scripting purposes. I'm planning to include this discussion as part of this proposed guide, if there's interest from the nodejs.org folks.

  2. However, all the documentation in the world won't stop a new developer from making a mistake. If we agree that "anything larger than [small files] should be deferred to fs.read() or fs.createReadStream()", then surely partitioning fs.readFile() in the style of fs.createReadStream() is an appropriate step. I don't think doing so encourages bad developer behavior -- it's just ensuring that if the developer has made a mistake, they won't pay too much for it. Do The Right Thing and so on.

My 64KB benchmark suggests that scripts that use fs.readFile() shouldn't suffer overmuch from partitioning (they still read 8GB's worth of the same 16MB file in 5 seconds), and that this partitioning stands to benefit some kinds of servers.

@davisjam
Copy link
Contributor Author

warn dev's away from using fs.readFile() for large files that cannot be read in a single uv roundtrip

Right, but the current fs.readFile() behavior is to read any file in one uv roundtrip, regardless of its size. If a dev is reading small files with fs.readFile(), this PR will have no effect on performance. If the dev is reading large files with fs.readFile(), (1) they shouldn't be, but (2) we can still help them out.

@davisjam
Copy link
Contributor Author

davisjam commented Nov 17, 2017

Found a few minutes for some deeper benchmarking...I've collected measurements across a range of partition sizes to give a better sense of the tradeoffs between degrading readFile performance and improving threadpool throughput.

I looked at the following partition sizes in KB: 4 8 16 32 64 128 256 512 1024 4096 16384. At each stage I doubled the partition size until I reached 1024 KB (1MB), at which point I quadrupled to 4MB and again to 16MB. The final partition size, 16384KB (16MB), is the size of the file being read, so this last size is the baseline, equivalent to the current behavior of Node.

The numbers I'm reporting represent a single run of the benchmarks on the machine described above, which is otherwise idle. Since it's just one run for each partition size, these numbers are just an estimate.

Excerpting the "1 and 10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile.js:

4 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 39
fs/readfile.js concurrent=10 len=16777216 dur=5: 88
8 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 67
fs/readfile.js concurrent=10 len=16777216 dur=5: 162
16 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 111
fs/readfile.js concurrent=10 len=16777216 dur=5: 364
32 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 146
fs/readfile.js concurrent=10 len=16777216 dur=5: 514
64 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 214
fs/readfile.js concurrent=10 len=16777216 dur=5: 575
128 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 212
fs/readfile.js concurrent=10 len=16777216 dur=5: 566
256 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 292
fs/readfile.js concurrent=10 len=16777216 dur=5: 538
512 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 205
fs/readfile.js concurrent=10 len=16777216 dur=5: 523
1024 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 246
fs/readfile.js concurrent=10 len=16777216 dur=5: 492
4096 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 274
fs/readfile.js concurrent=10 len=16777216 dur=5: 477
16384 KB
fs/readfile.js concurrent=1 len=16777216 dur=5: 356
fs/readfile.js concurrent=10 len=16777216 dur=5: 577

Excerpting the "10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile-clogging.js:

for size in 4 8 16 32 64 128 256 512 1024 4096 16384; do echo "$size KB"; echo; PARTITION_SIZE_KB=$size /tmp/node-part-cfg/node benchmark/fs/readfile-clogging.js | tee /tmp/o_clogging_$size; echo; done
4 KB
bench ended: reads 300, zips 62511, total ops 62811
8 KB
bench ended: reads 600, zips 61582, total ops 62182
16 KB
bench ended: reads 1139, zips 59401, total ops 60540
32 KB
bench ended: reads 1779, zips 47918, total ops 49697
64 KB
bench ended: reads 2258, zips 35383, total ops 37641
128 KB
bench ended: reads 2305, zips 19380, total ops 21685
256 KB
bench ended: reads 2743, zips 11553, total ops 14296
512 KB
bench ended: reads 2409, zips 5553, total ops 7962
1024 KB
bench ended: reads 2268, zips 2889, total ops 5157
4096 KB
bench ended: reads 2298, zips 1053, total ops 3351
16384 KB
bench ended: reads 2767, zips 693, total ops 3460

Summarizing these results:

  1. readfile.js:
    a. The relationship between partition size and read rate in the 1-reader case is unclear. The best performance is at 16MB (356 reads/second at one read per readFile), but other pretty good points were 256KB (292 reads/second) and 4096KB (274 reads/second).
    b. The relationship between partition size and read rate in the 10-reader case is also unclear. The high point was again 16MB (577 reads/second), but 64KB (575 reads/sec) and 128KB (566 reads/sec) were also contenders.
  2. readfile-clogging.js: Unsurprisingly, the number of zips is generally inversely proportional (roughly linearly) with the partition size, or linearly proportional to the number of partitions. The more partitions, the more turns the zip job gets.

Recommendation:

The 1-reader case seems pretty unrealistic, so let's focus on the 10-reader case. It looks to me like if we go with a 64KB partition, for pure readFile we face somewhere between a 10% drop in throughput (reported earlier) and a negligible drop in throughput (in this data). For this we get a 50x improvement in throughput for contending threadpool jobs. For better readFile performance, a larger blocksize could be used while still improving overall threadpool throughput.

Since the patch is a one-liner, nothing fancy, this seems like a pretty good trade to me.

As has been discussed, best practice is certainly not to use fs.readFile for serving files. But for users who are doing so, I think this patch could give them nice performance improvements for free.

Docs:
I agree with @jasnell that urging developers to avoid fs.readFile in server contexts is a good idea. I'm also happy to pursue a docs change and/or a longer guide in this direction.

@davisjam
Copy link
Contributor Author

davisjam commented Nov 21, 2017

But if they've opted for the simplicity of fs.readFile(), I think the framework should Do The Right Thing -- namely, optimize threadpool throughput, not single request latency. Presumably some kind of chunking/partitioning is done by crypto and compression as well?

Actually, I just checked the crypto module. It does not chunk/partition large requests.

The following example will not print "Short buf finished" until there are no more long requests in the threadpool queue and one of the workers picks up the short request.

var nBytes = 10 * 1024*1024; /* 50 MB */
var nLongRequets = 20;
const crypto = require('crypto');

for (var i = 0; i < nLongRequets; i++) {
	crypto.randomBytes(nBytes, (buf) => {
		console.log('Long buf finished');
	});
}

crypto.randomBytes(1, (buf) => {
	console.log('Short buf finished');
});

console.log('begin');

Thoughts on a similar PR to partition large crypto requests like this, or a doc-change PR like #17154 with a warning?

For the FS there are alternatives to fs.readFile if you are making a large request. I don't see comparable framework alternatives for large crypto requests. Thoughts on a new API for this?

@Trott
Copy link
Member

Trott commented Nov 22, 2017

@nodejs/crypto (see #17054 (comment))

@bnoordhuis
Copy link
Member

Same as #17054 (comment) with the addendum that crypto.randomBytes() doesn't do I/O, it's purely CPU-bound. Users can partition requests themselves if they want.

As well, there is hardly ever a reason to request more than a few hundred bytes of randomness at a time. I don't think large requests are a practical concern.

@davisjam
Copy link
Contributor Author

@addaleax Yes.

@BridgeAR
Copy link
Member

BridgeAR commented Feb 1, 2018

Landed in 67a4ce1

@BridgeAR BridgeAR closed this Feb 1, 2018
BridgeAR pushed a commit to BridgeAR/node that referenced this pull request Feb 1, 2018
Problem:

Node implements fs.readFile as:
- a call to stat, then
- a C++ -> libuv request to read the entire file using the stat size

Why is this bad?
The effect is to place on the libuv threadpool a potentially-large
read request, occupying the libuv thread until it completes.
While readFile certainly requires buffering the entire file contents,
it can partition the read into smaller buffers
(as is done on other read paths)
along the way to avoid threadpool exhaustion.

If the file is relatively large or stored on a slow medium, reading
the entire file in one shot seems particularly harmful,
and presents a possible DoS vector.

Solution:

Partition the read into multiple smaller requests.

Considerations:

1. Correctness

I don't think partitioning the read like this raises
any additional risk of read-write races on the FS.
If the application is concurrently readFile'ing and modifying the file,
it will already see funny behavior. Though libuv uses preadv where
available, this doesn't guarantee read atomicity in the presence of
concurrent writes.

2. Performance

Downside: Partitioning means that a single large readFile will
  require into many "out and back" requests to libuv,
  introducing overhead.
Upside: In between each "out and back", other work pending on the
  threadpool can take a turn.

In short, although partitioning will slow down a large request,
it will lead to better throughput if the threadpool is handling
more than one type of request.

Fixes: nodejs#17047

PR-URL: nodejs#17054
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
@BridgeAR
Copy link
Member

BridgeAR commented Feb 1, 2018

This broke our CI. I did not realize it right away and landed a couple other commits afterwards, otherwise I would have reverted this. A change landed a few hours before this one that changed the tmpDir behavior and broke the test from this PR.

I am submitting a fix.

@BridgeAR BridgeAR mentioned this pull request Feb 1, 2018
4 tasks
MylesBorins pushed a commit that referenced this pull request Feb 11, 2018
PR-URL: #17610
Refs: #17054 (comment)
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Evan Lucas <evanlucas@me.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this pull request Feb 12, 2018
PR-URL: #17610
Refs: #17054 (comment)
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Evan Lucas <evanlucas@me.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MylesBorins pushed a commit that referenced this pull request Feb 13, 2018
PR-URL: #17610
Refs: #17054 (comment)
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: Evan Lucas <evanlucas@me.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Reviewed-By: Jon Moss <me@jonathanmoss.me>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
MayaLekova pushed a commit to MayaLekova/node that referenced this pull request May 8, 2018
Problem:

Node implements fs.readFile as:
- a call to stat, then
- a C++ -> libuv request to read the entire file using the stat size

Why is this bad?
The effect is to place on the libuv threadpool a potentially-large
read request, occupying the libuv thread until it completes.
While readFile certainly requires buffering the entire file contents,
it can partition the read into smaller buffers
(as is done on other read paths)
along the way to avoid threadpool exhaustion.

If the file is relatively large or stored on a slow medium, reading
the entire file in one shot seems particularly harmful,
and presents a possible DoS vector.

Solution:

Partition the read into multiple smaller requests.

Considerations:

1. Correctness

I don't think partitioning the read like this raises
any additional risk of read-write races on the FS.
If the application is concurrently readFile'ing and modifying the file,
it will already see funny behavior. Though libuv uses preadv where
available, this doesn't guarantee read atomicity in the presence of
concurrent writes.

2. Performance

Downside: Partitioning means that a single large readFile will
  require into many "out and back" requests to libuv,
  introducing overhead.
Upside: In between each "out and back", other work pending on the
  threadpool can take a turn.

In short, although partitioning will slow down a large request,
it will lead to better throughput if the threadpool is handling
more than one type of request.

Fixes: nodejs#17047

PR-URL: nodejs#17054
Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
Reviewed-By: Tiancheng "Timothy" Gu <timothygu99@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
@zbjornson
Copy link
Contributor

zbjornson commented Jan 27, 2019

The comments above noted that this doesn't appear to cause a significant performance regression, but we're seeing a 7.6-13.5x drop in read throughput between 8.x and 10.x in both the readfile benchmark and our real-world benchmarks that heavily exercise fs.readFile. Based on my troubleshooting, I think it's from this change, but it's possible some other change is responsible.

The readfile benchmark (Ubuntu 16):

Test v8.15.0 v10.15.0 8 ÷ 10
concurrent=1 len=1024 6,661 7,066 1.06x
concurrent=10 len=1024 23,100 21,079 0.91x
concurrent=1 len=16777216 156.6 11.6 13.5x
concurrent=10 len=16777216 584 76.6 7.6x

From what I can extract from the comments in this PR, either no degradation or a 3.6-4.8x degradation was expected for the len=16M cases.

As for why I think it's because of this change, the benchmark below compares fs.readFile against a simple version of how fs.readFile used to work (one-shot read), measuring time to read the same 16 MB file 50 times.

// npm i async

const fs = require("fs");
const async = require("async");

function chunked(filename, cb) {
	fs.readFile(filename, cb);
}

function oneshot(filename, cb) {
	// shoddy implementation -- leaks fd in case of errors
	fs.open(filename, "r", 0o666, (err, fd) => {
		if (err) return cb(err);
		fs.fstat(fd, (err, stats) => {
			if (err) return cb(err);
			const data = Buffer.allocUnsafe(stats.size);
			fs.read(fd, data, 0, stats.size, 0, (err, bytesRead) => {
				if (err) return cb(err);
				fs.close(fd, err => {
					cb(err, data);
				});
			});
		});
	});
}

fs.writeFileSync("./test.dat", Buffer.alloc(16e6, 'x'));

function bm(method, name, cb) {
	const start = Date.now();
	async.timesSeries(50, (n, next) => {
		method("./test.dat", next);
	}, err => {
		if (err) return cb(err);
		const diff = Date.now() - start;
		console.log(name, diff);
		cb();
	});
}

async.series([
	cb => bm(chunked, "fs.readFile()", cb),
	cb => bm(oneshot, "oneshot", cb)
])
Node.js OS fs.readFile() (ms) one-shot (ms)
v10.15.0 Ubuntu 16 7320 370
v8.15.0 Ubuntu 16 693 378
v10.15.0 Win64 2972 493

We've switched to fs.fstat() and then fs.read() (similar to above) as a work-around, but I wouldn't be surprised if this has also negatively impacted other apps/tools. As far as the original justification: web servers aside, other sorts of apps like build tools and compilers (for which DoS attacks are irrelevant) often need to read an entire file as fast as possible, and furthermore aren't typically concerned with concurrency.

Is anyone else able to verify that this degradation exists and/or was expected?

@mcollina
Copy link
Member

I’ve done some empirical tests, and I saw some degradation. I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

@davisjam
Copy link
Contributor Author

@zbjornson

I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

And tag me in it! :-)

@Trott
Copy link
Member

Trott commented Jan 27, 2019

I’d recommend you to open up a new issue based on your data as it is likely to get more attention.

I used "Reference in new issue" to create #25740 from @zbjornson's comment.

@UltiRequiem

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. fs Issues and PRs related to the fs subsystem / file system. performance Issues and PRs related to the performance of Node.js. semver-major PRs that contain breaking changes and should be released in the next major version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.