-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: test: force garbage collection on crypto timing test #8203
Conversation
Garbage collection interferes with timing statistics gathering so force it before beginning the data collection.
This is mostly what I had in mind, but I was thinking GC would take place before each benchmark rather than before each set of 10000 benchmarks. That might yield more consistent results (although it would cause the test to take awhile). |
@not-an-aardvark Well, the way I did it certainly didn't fix the flakiness, so let's try it again. I've moved the |
Oh my, yes, unsurprisingly, this slows the test waaaay down... times out now... Might have to try to find a happy medium or something.... |
What's the time limit for the tests? We could reduce the number of trials if necessary; I don't think 10000 trials are necessary to get significant results. |
@not-an-aardvark I believe it's 60 seconds (longer on Raspberry Pi devices, though). |
Changing |
Yeah, I just did the same experiment. 8 seconds. Let's try it... |
Still flaky unfortunately. Two of the Linux hosts (so far) have failed... |
I might be looking in the wrong place but it seems like the |
Yeah, I don't look at the widget when running a bunch of tests in rapid succession like this. Here are the tests I'm seeing failing: https://ci.nodejs.org/job/node-test-commit-linux/4741/nodes=centos6-64/ Interestingly, though, those are the only two so far. And that's a much better ratio than we've been seeing. So maybe this is getting close... |
Windows failures are Jenkins/build-related and unrelated to the test...other failures have come in though...not sure the |
Out of interest, what happens if you run the benchmarks in the other order? i.e. swap lines 66 and 68 here I've noticed that the t-values have been postive numbers for the most part; if the issue is caused by memory/other external features unrelated to the buffers, then the t-value should flip its sign because the "unequal" benchmark will be affected in the same way as the "equal" benchmark currently is. |
Swapped the two lines, re-running CI: https://ci.nodejs.org/job/node-test-pull-request/3773/ |
The tests failed again as expected, but this time all the t-values (see here, here, and here) were negative. That's good news; it means that in the flaky cases, the average time is higher for whatever comparison comes first in the |
Could we add this right before the if (Math.abs((equalMean - unequalMean) / standardErr) > T_THRESHOLD && compareFunc === crypto.timingSafeEqual) {
console.log({
equalMean,
unequalMean,
equalLen,
unequalLen,
equalStd: standardDeviation(equalBenches),
unequalStd: standardDeviation(unequalBenches),
combinedStd,
standardErr,
t: (equalMean - unequalMean) / standardErr,
rawEqualBenches: JSON.stringify(rawEqualBenches),
rawUnequalBenches: JSON.stringify(rawUnequalBenches)
});
} This will print out all the raw benchmarks if the test fails. I'm interested to see whether the benchmarks are consistently higher for one set, or whether there are just a few benchmarks that are affecting the stats by being different from all the others. (Sorry about having to keep nagging you to run things on the CI server; this is the type of debugging that would be better done locally if I was able to reproduce the flakiness.) |
Added the logging, new CI: https://ci.nodejs.org/job/node-test-pull-request/3777/
Thank you for your patience with this. |
Got a few failures already with the loggin. Here's one: https://ci.nodejs.org/job/node-test-commit-linux/4748/nodes=centos6-64/console |
Occurs to me that CI results get wiped after a certain period of time, so for posterity:
|
Another possibility: the issue could be because V8 is optimizing the I made a branch to test this on my fork; could we run the CI on that (or pull it into the test branch on this PR)? |
Yeah that's a bit worrying. Since there seems to be a chance that this is a security issue rather than a test issue, should we revert the |
That would be my inclination. Looping in @nodejs/crypto ... |
Ugh. Yeah, I'd agree. That's unfortunate. |
This reverts commit 0fc5e0d. Additional testing indicates that there may still be timing issues with this implementation. Revert in order to give more time for testing before this goes out into a release... Refs: nodejs#8040 Refs: nodejs#8203
See: #8225 |
Closing as revert is imminent.... :-/ |
Minor update: I tried running the test on a Ubuntu 16.04 VM (was using OSX before), but was still unable to reproduce the bug locally. What are the specs of the server that runs the tests? I'm wondering if I would be able to simulate whatever is causing the test to fail, e.g. by restricting memory usage. (I couldn't reproduce the issue with |
@not-an-aardvark Maybe this is useful? Build details for the host (ansible playbook mostly): https://github.com/nodejs/build/tree/master/setup/ubuntu16.04 |
@not-an-aardvark If I'm reading |
This reverts commit 0fc5e0d. Additional testing indicates that there may still be timing issues with this implementation. Revert in order to give more time for testing before this goes out into a release... Refs: #8040 Refs: #8203 PR-URL: #8225 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
Commit is reverted... |
I managed to reproduce the issue on a Linux server that I have access to. I'm still investigating, but I think V8 is doing something very strange here. When I run this function (almost the same function from the original PR, but with an additional commented line that contains the letter 'A' a bunch of times): function getTValue(compareFunc) {
// ... function body etc. etc.
// see [here](https://github.com/nodejs/node/blob/0fc5e0dcd9c628b62fc9908daf64bc5db8d503c0/test/sequential/test-crypto-timing-safe-equal.js#L48-L84) for the full function body
// ...
const standardErr = combinedStd * Math.sqrt(1 / equalLen + 1 / unequalLen);
//AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
return (equalMean - unequalMean) / standardErr;
} ...the test fails very reliably. Note that the commented line contains the letter A exactly 100 times, and If I add one more A to the end of the commented line, the test passes reliably. This seems to be dependent on the function's string length; adding some other extraneous character (e.g. an extra semicolon) to function has the same effect as adding an extra A to the commented line. My guess would be that V8 is performing some optimization internally depending on the string length of the |
I just saw this article about optimizations with function length. It seems like setting the |
My uneducated guess is function inlining is the culprit. EDIT: Looks like you hit on the same conclusion as I was typing this. @nodejs/v8 |
I think I understand (extremely vaguely) how inlining works, but I'm still not sure I understand why inlining would cause the two benchmarks in this function to take different amounts of time. In particular, we should figure out which (if any) of these two statements is accurate:
|
@not-an-aardvark Maybe @indutny or @bnoordhuis might have insight (as they're the only people who are on both @nodejs/crypto and @nodejs/v8). |
That sounds plausible. V8 uses JS function length (including comments) as a cheap signal for its inlining heuristics. You should be able to confirm that with |
I can confirm that the |
Inlining is done by the tier 2 compiler (the optimizing compiler), the tier 1 compiler (the baseline compiler) doesn't inline. The job of the tier 1 compiler is to generate code that collects data for the tier 2 compiler to do its job. The tier 2 compiler doesn't start until the hot part of the program has been executing for a while. From an outside perspective it makes total run-time somewhat non-deterministic because you can't really know when and where the optimizing compiler is going to kick in. |
That makes sense, but I'm confused about why we would get benchmarks like the ones that appeared in #8203 (comment). The benchmarks are gathered in alternating order (run one benchmark with equal buffers, then one benchmark with unequal buffers, etc.), so even if inlining happens at some non-deterministic point as the program runs, I don't understand why the values in |
For what it's worth, here's the full output of
For some reason, the test seems to pass as a result of including the |
Looking at the build history on #8040, it seems like the linux builds only started to fail after When we end up getting a consistent build that works on all platforms, we should try to make sure we have a good understanding why these builds failed. I don't know very much about the V8 internals, but it's still unclear to me whether the issue on linux was a test flake or a real security issue that is now masked. |
Created a new PR at #8304. |
Checklist
make -j4 test
(UNIX), orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
test crypto
Description of change
Garbage collection interferes with timing statistics gathering so force
it before beginning the data collection.
Work in progress. Do not merge.