-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
child_process: measure buffer length in bytes #6764
Conversation
I've added a benchmark to check performance and it shows that this is as performant as the current implementation. However, the benchmark could definitely use a review as it's very easy to do benchmarks wrong and I have no particular expertise. @nodejs/benchmarking |
5765246
to
6060a64
Compare
const dur = +conf.dur; | ||
const len = +conf.len; | ||
|
||
const msg = '"' + Array(len).join('.') + '"'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use '.'.repeat(len)
for this kind of thing nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But also run msg.test(/./)
to flatten the string. Creates more reliable benchmark results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex @trevnorris So I should do this?
const msg = `"${'.'.repeat(len)}"`;
msg.match(/./);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. that looks good. i'm not sure what the internal mechanics of repeat()
do, but if it's similar to doing +=
then should take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done! The benchmark shows frighteningly similar results for Node.js 6.2.0 vs. this PR.
6.2.0:
$ node benchmark/child_process/child-process-exec-stdout.js
child_process/child-process-exec-stdout.js len=64 dur=5: 40873.45059
child_process/child-process-exec-stdout.js len=256 dur=5: 40853.93567
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.88575
child_process/child-process-exec-stdout.js len=4096 dur=5: 40851.36544
child_process/child-process-exec-stdout.js len=32768 dur=5: 40876.38241
$
This PR:
$ ./node benchmark/child_process/child-process-exec-stdout.js
child_process/child-process-exec-stdout.js len=64 dur=5: 40873.08798
child_process/child-process-exec-stdout.js len=256 dur=5: 40860.52815
child_process/child-process-exec-stdout.js len=1024 dur=5: 40878.13205
child_process/child-process-exec-stdout.js len=4096 dur=5: 40869.97254
child_process/child-process-exec-stdout.js len=32768 dur=5: 40863.11006
$
I did some benchmarking outside of node recently that was related to this and IIRC concatenating strings was faster (rather than creating the string at the end)? This would need some double checking though. |
@@ -277,7 +271,7 @@ exports.execFile = function(file /*, args, options, callback*/) { | |||
if (!encoding) | |||
_stderr.push(chunk); | |||
else | |||
_stderr += chunk; | |||
_stderr += chunk.toString(encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this yield incorrect results if a variable-width encoded character is split across two chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is true. That's why setEncoding()
was being used initially. This will need to be solved differently I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution is to always store the data as Buffers and concat/toString them just before being emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that what StringDecoder is for? (didn't look at this in detail so may be missing something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be to keep setEncoding()
, etc. and only change the length increment lines to something like:
stderrLen += (this.encoding ? Buffer.byteLength(chunk, this.encoding) : chunk.byteLength);
I haven't benchmarked the difference between that solution and only converting to string at the end though.
EDIT: or another take on this alternative solution would be to add another data
handler first that depends on whether encoding
is set that increments the length appropriately, if you want to avoid doing the encoding
check on every data
event (maybe the performance difference is negligble, I haven't tested):
var onStderrData;
if (encoding) {
child.stderr.setEncoding(encoding);
onStderrData = function(chunk) {
stderrLen += Buffer.byteLength(chunk, this.encoding);
};
} else {
onStderrData = function(chunk) {
stderrLen += chunk.byteLength;
};
}
child.stderr.addListener('data', onStderrData);
child.stderr.addListener('data', function(chunk) {
if (stderrLen > options.maxBuffer) {
// ...
});
EDIT 2: Actually the above solution wouldn't work since the encoding could be changed at any time, so you would need to do what I originally suggested (checking this.encoding
on each chunk).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasnell The problem this PR is trying to solve is that the units for maxBuffer
are in bytes, but once you set up a string decoder, there is no way to know how many bytes went into making the string that gets emitted on data
. That means you can no longer rely on chunk.length
since you could have multi-byte characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex ... +1 gotcha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, does it make sense to emit on byte count and not character count if encoding is set? Only because the string passed to 'data'
can vary noticeably if reading in utf8 characters.
By the way, the discussion in #1902 is probably relevant. The PR changed over time and some of the commits seem to be lost, but check out the comments by @piscisaureus. |
@cjihrig I personally don't see a problem with going over by at most 3 bytes if the stream ends on a partial character, but you could always peek into the |
@mscdex While I'm not sure whether we can account for this, there are cases where the buffered character is larger. Here's an exaggerated case: const s = 'ก็็็็็็็็็็็็็็็็็็็็';
console.log(s.length); // 21
console.log(Buffer.from(s)); // 63 |
@trevnorris and If you pass say |
@mscdex Sorry, didn't convey my point. It's that while the character code may be complete it's also possible that the rendered character is incomplete because of missing diacritical marks. But also that this isn't a case we're able to handle. |
For example: Buffer('Á'); // <Buffer c3 81>
Buffer('Á'); // <Buffer 41 cc 81> Appear to be the same character, but the first is actually |
@trevnorris I still don't quite follow. Are you referring to a situation where Now that I think about it, maybe yet another (better) solution is to just manually instantiate a string decoder and pass that the |
That's where I was leading with #6764 (comment) |
@jasnell Ah ok. The only difference performance-wise with that last proposed solution is that it may be still be possible for streams to internally concatenate buffered |
Yeah, not sure either. Would need to benchmark it to be certain
|
@mscdex This isn't about the string decoder. It's about the (very unlikely) possibility of data coming in, then say be logged. Where it would basically be: console.log('A');
console.log('\u0301'); The rendered output would be wrong, even though each character is technically correct. I'm not saying we should consider supporting this properly. Just wanted to make the point that there are cases where even having the full utf8 characters won't render properly. |
If someone was truly worried about that case they could buffer and wait for a new line to be received before calling |
@trevnorris That's a separate issue though and node can't do anything about that anyway. Being able to count the bytes is the issue here, but even that separate issue isn't an issue for the majority of people that use |
@mscdex sure thing. there was another comment that prompted me to make mention of this, and clarify that node doesn't have any intention of handling it. |
I've changed the implementation to only run |
@Trott |
CI seems to be experiencing less heartburn now than 2 days ago. Let's try again. CI: https://ci.nodejs.org/job/node-test-pull-request/2754/ |
|
||
child.stdout.addListener('data', function(chunk) { | ||
stdoutLen += chunk.length; | ||
// If `child.stdout.setEncoding('utf8')` happened in userland, convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean any setEncoding()
call, or specifically w/ 'utf8'
passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean any setEncoding()
. Will change the comment...
OK... CI looks good... https://ci.nodejs.org/job/node-test-pull-request/2754/ Benchmarks are alarmingly on par:
Do we feel good about this as a solution to the problem? If so, can I get an LGTM or two? If not, what are the deficiencies? |
CI is green, LGTM. |
This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: nodejs#6764 Fixes: nodejs#1901 Reviewed-By: Brian White <mscdex@mscdex.net>
Landed in c9a5990 |
This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. PR-URL: nodejs#6764 Fixes: nodejs#1901 Reviewed-By: Brian White <mscdex@mscdex.net>
This change fixes a known issue where `maxBuffer` limits by characters rather than bytes. Benchmark added to confirm no performance regression occurs with this change. This necessarily changes default behavior of `stdout` and `stderr` callbacks such that they receive buffers rather than strings. The alternative would be a performance hit on the defaults. Refs: nodejs#6764 Refs: nodejs#1901
@Trott lts? |
@thealphanerd If it lands cleanly, yes. |
Checklist
child_process test
Affected core subsystem(s)
child_process test
Description of change
This change fixes a known issue where
maxBuffer
limits by charactersrather than bytes.
Fixes: #1901
Probably need to benchmark it against the current implementation. It's entirely possible no one has done it this way because of performance?