-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate flaky test-crypto-timing-safe-equal-benchmarks #38226
Comments
This has now started failing somewhat consistently on LinuxONE in CI. I don't know if that means the test just isn't reliable or if it means there's a timing issue with faster processors or what. @nodejs/crypto
https://ci.nodejs.org/job/node-test-commit-linuxone/nodes=rhel7-s390x/26963/consoleText |
Stress tests for bisecting:
Bisecting: 17 revisions left to test after this (roughly 4 steps)
Bisecting: 8 revisions left to test after this (roughly 3 steps)
Bisecting: 4 revisions left to test after this (roughly 2 steps)
Bisecting: 1 revision left to test after this (roughly 1 step)
Bisecting: 0 revisions left to test after this (roughly 0 steps)
7919ced is the first bad commit
:040000 040000 8080a76b162a40209ec5b9397db3945fa36ee143 247e5c50613229df3f9776ce9637237146682dd0 M |
I know they're not active in this repo anymore, but ping @not-an-aardvark: Hey, does this look like an issue with the crypto safe timing test? Or does it look like a bug? The machine that it's failing on is very fast, if that helps. Would increasing |
From looking at a few of the failing cases, it seems like the error message is always "timingSafeEqual should not leak information from its execution time (t=<some positive number>)". Assuming that the test is working properly, this would be a signal that The test fails if it determines that a timing difference is statistically significant. In general, increasing In this case, it seems like the t-values are routinely 9 and above, which wouldn't plausibly happen by chance. So we can be pretty confident that it's either (a) a legitimate bug in crypto.timingSafeEqual, or (b) a bug in the test setup, which wouldn't be fixed by changing Incidentally, when we were originally implementing Since then, there have been significant advances in side-channel attacks (Meltdown/Spectre and others), which has led me to believe that double HMAC could be a good defense-in-depth measure against increasingly powerful compilers and CPU branch prediction that could optimize away It might be worth updating *I also have some doubts about whether it's really possible to avoid leaking the boolean result, due to branch prediction. This might also be why this test keeps failing. In any case, combining |
Bisect results:
|
The bisect results are certainly surprising. Going to run a bunch more tests to confirm them. Stress tests against 7919ced:
Stress tests on 4af15df (the last change before the one above that is being tentatively blamed for the problem):
|
When I increase |
Those sure seem consistent with 7919ced making the problem much worse, presumably as some indirect side effect. I wonder if the next step is to sorta bisect the individual line changes in that commit (since outside of the lint tool changes, they are mostly self-contained) to get to the bottom of which one (if it's indeed only one) is causing the issue. |
So, I bisected on individual file changes and the one that is causing the issue is Trott@845adb7 which is totally puzzling to me, but maybe @not-an-aardvark or @aduh95 would have a reasonable explanation (assuming my results hold up to further scrutiny, which I will definitely be applying to confirm). |
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also makes them far more timing safe. Refs: nodejs#38226 (comment) Fixes: nodejs#38226
@Trott should we exclude the failing test until we resolve the issue? Sounds like you are on top of the PRs that might fix it, but seems like it may still take a little while until there is agreement on the fix? I'd be happy to open a PR to exclude for now. |
What I meant is mark flaky |
Opened a PR to mark the test as flaky: #40065 |
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also makes them far more timing safe. Refs: nodejs#38226 (comment) Fixes: nodejs#38226
Add HMAC to crypto.timingSafeEqual(). This makes things slower but also makes them far more timing safe. Refs: nodejs#38226 (comment) Fixes: nodejs#38226
I tried rewriting the test to avoid branching within the benchmark entirely, to get rid of V8 optimizations etc. It only benchmarks calls that return diff --git a/test/pummel/test-crypto-timing-safe-equal-benchmarks.js b/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
index 3c6e31e829..db8f85169e 100644
--- a/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
+++ b/test/pummel/test-crypto-timing-safe-equal-benchmarks.js
@@ -9,61 +9,60 @@ if (!common.enoughTestMem)
const assert = require('assert');
const crypto = require('crypto');
-function runOneBenchmark(compareFunc, firstBufFill, secondBufFill, bufSize) {
- return eval(`
- const firstBuffer = Buffer.alloc(bufSize, firstBufFill);
- const secondBuffer = Buffer.alloc(bufSize, secondBufFill);
-
- const startTime = process.hrtime();
- const result = compareFunc(firstBuffer, secondBuffer);
- const endTime = process.hrtime(startTime);
-
- // Ensure that the result of the function call gets used, so it doesn't
- // get discarded due to engine optimizations.
- assert.strictEqual(result, firstBufFill === secondBufFill);
-
- endTime[0] * 1e9 + endTime[1];
- `);
-}
+const numTrials = 1e5;
+const bufSize = 16384;
+const readOnlyTestBuffer = crypto.randomBytes(bufSize);
function getTValue(compareFunc) {
- const numTrials = 1e5;
- const bufSize = 10000;
// Perform benchmarks to verify that timingSafeEqual is actually timing-safe.
- const rawEqualBenches = Array(numTrials);
- const rawUnequalBenches = Array(numTrials);
-
- for (let i = 0; i < numTrials; i++) {
- if (Math.random() < 0.5) {
- // First benchmark: comparing two equal buffers
- rawEqualBenches[i] = runOneBenchmark(compareFunc, 'A', 'A', bufSize);
- // Second benchmark: comparing two unequal buffers
- rawUnequalBenches[i] = runOneBenchmark(compareFunc, 'B', 'C', bufSize);
- } else {
- // Flip the order of the benchmarks half of the time.
- rawUnequalBenches[i] = runOneBenchmark(compareFunc, 'B', 'C', bufSize);
- rawEqualBenches[i] = runOneBenchmark(compareFunc, 'A', 'A', bufSize);
- }
+ // Store all results in a single array and separate it later to avoid
+ // branching in the benchmark loop.
+ const measurements = Array(2 * numTrials).fill(0);
+ const testBuffer = Buffer.from(readOnlyTestBuffer);
+
+ // Run the actual benchmark. Avoid all branching (except the loop condition)
+ // to avoid conditional V8 optimizations.
+ let n = 0;
+ for (let i = 0; i < 2 * numTrials; i++) {
+ // Modify either the first or last byte of the copy of the test buffer.
+ const j = (bufSize - 1) * (i % 2);
+ testBuffer[j] ^= 1 | (i & 0xff);
+ // Call the comparison function and coerce the result into a number.
+ const startTime = process.hrtime.bigint();
+ n += compareFunc(testBuffer, readOnlyTestBuffer);
+ const endTime = process.hrtime.bigint();
+ measurements[i] = Number(endTime - startTime);
+ // Restore the original byte.
+ testBuffer[j] = readOnlyTestBuffer[j];
}
- const equalBenches = filterOutliers(rawEqualBenches);
- const unequalBenches = filterOutliers(rawUnequalBenches);
+ // The comparison function should have returned false in every iteration, but
+ // we only check that here to avoid explicit branching above.
+ assert.strictEqual(n, 0);
+
+ // A simple comparison would be fast for even i and slow for odd i.
+ const rawFastBenches = measurements.filter((_, i) => i % 2 === 0);
+ const rawSlowBenches = measurements.filter((_, i) => i % 2 !== 0);
+ const fastBenches = filterOutliers(rawFastBenches);
+ const slowBenches = filterOutliers(rawSlowBenches);
// Use a two-sample t-test to determine whether the timing difference between
// the benchmarks is statistically significant.
// https://wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test
- const equalMean = mean(equalBenches);
- const unequalMean = mean(unequalBenches);
+ const fastMean = mean(fastBenches);
+ const slowMean = mean(slowBenches);
+
+ const fastLen = fastBenches.length;
+ const slowLen = slowBenches.length;
- const equalLen = equalBenches.length;
- const unequalLen = unequalBenches.length;
+ const combinedStd = combinedStandardDeviation(fastBenches, slowBenches);
+ const standardErr = combinedStd * Math.sqrt(1 / fastLen + 1 / slowLen);
- const combinedStd = combinedStandardDeviation(equalBenches, unequalBenches);
- const standardErr = combinedStd * Math.sqrt(1 / equalLen + 1 / unequalLen);
+ console.log(JSON.stringify({ rawFastBenches, rawSlowBenches, fastMean, slowMean, t: (fastMean - slowMean) / standardErr }));
- return (equalMean - unequalMean) / standardErr;
+ return (fastMean - slowMean) / standardErr;
}
// Returns the mean of an array The variable names Stress test on master: 44.9 % failed (build) Second stress test on master: 29.0 % failed (build) So this seems to help, but the failure rate doesn't go down to 0. I added some logging and re-ran the stress test with the above patch (build). I copied the raw measurements ( The test script had reported d <- read.csv('fast-slow.csv')
t.test(d$fast, d$slow)
The test does include some code to remove "large outliers" from the measurements, that is, values that are at least 50 times larger than the mean. max(d$fast) / mean(d$fast)
max(d$slow) / mean(d$slow)
So none of the measurements are close to being considered "large outliers" by the test. But there certainly are outliers (which are not removed by the test), and there are certainly also large differences even between successive measurements. # Relative difference between successive "fast" measurements, as percentages.
rel.diff <- 100 * abs(diff(d$fast)) / d$fast[1:99999]
quantile(rel.diff, probs = c(0.90, 0.99))
This means that 1 in 10 measurements differed by at least 11.6 % from the previous measurement, and 1 in 100 measurements differed by at least 41.5 % from the previous measurement. Plotting all 200,000 data points: library(ggplot2)
library(reshape2)
library(dplyr)
d$index <- 1:nrow(d)
plottable <- melt(d, id.vars = "index", measure.vars = c("fast", "slow"))
ggplot(plottable, aes(index, value, colour = variable)) + geom_point(alpha = 0.4, size = 1) It doesn't look that bad. (Most of the There are certainly more outliers in the first part. So let's try the t test again, using the second half of the graph only: t.test(d$fast[50000:100000], d$slow[50000:100000])
The test uses 100 * (mean(d$fast[1:50000]) / mean(d$fast[50000:100000]) - 1)
100 * (mean(d$slow[1:50000]) / mean(d$slow[50000:100000]) - 1)
So the t.test(d$fast[1:50000], d$fast[50000:100000])
Given that there is a statistically significant difference between the first and second half of the same benchmark, it is not surprising that there is also a statistically significant difference between the two benchmarks. The graph also hints at another issue. Let's only consider the ggplot(d, aes(index, fast)) +
geom_point(alpha=0.1, size=0.5) +
ylim(mean(d$fast) - sd(d$fast), mean(d$fast) + sd(d$fast)) +
geom_hline(yintercept = mean(d$fast), color = "red") This does not look like it's close to being a normal distribution; the measurements below the mean are much closer together than those above the mean. This becomes more apparent when considering how many samples fall within library(purrr)
samples <- rnorm(length(d$fast), mean = mean(d$fast), sd = sd(d$fast))
f <- function(x) length(samples[abs(samples - mean(samples)) <= x * sd(samples)]) / length(samples)
map_dbl(c(0.5, 1, 2, 3), f)
But this is what we see: samples <- d$fast
map_dbl(c(0.5, 1, 2, 3), f)
But the t test assumes that both sets of samples are from normal distributions, and since that isn't true, the t test is not necessarily meaningful. My main takeaway from this (at least for the moment) is that This does not explain why the failures are much more frequent on rhel72-ppc64 than on other machines though. We can probably fine-tune this test to reduce the failure probability enough to re-enable the test, but I am not sure we are addressing the real problem. (cc @Trott) |
PR-URL: nodejs#50232 Refs: nodejs#38226 Reviewed-By: Ruy Adorno <ruyadorno@google.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com>
Refs: #52341 Not sure if this change might make it better or worse. |
PR-URL: nodejs/node#50232 Refs: nodejs/node#38226 Reviewed-By: Ruy Adorno <ruyadorno@google.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com>
PR-URL: nodejs/node#50232 Refs: nodejs/node#38226 Reviewed-By: Ruy Adorno <ruyadorno@google.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com>
Seems to be failing a lot on MacOS now. An example: https://github.com/nodejs/node/actions/runs/8822600805/job/24221172295?pr=52671
|
PR-URL: #52751 Refs: #38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
b876e00 landed, I'm closing this. |
PR-URL: nodejs#52751 Refs: nodejs#38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
PR-URL: #52751 Refs: #38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
PR-URL: #52751 Refs: #38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
PR-URL: nodejs#52751 Refs: nodejs#38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
PR-URL: nodejs#52751 Refs: nodejs#38226 Reviewed-By: Marco Ippolito <marcoippolito54@gmail.com> Reviewed-By: Filip Skokan <panva.ip@gmail.com> Reviewed-By: Michaël Zasso <targos@protonmail.com> Reviewed-By: Daniel Lemire <daniel@lemire.me> Reviewed-By: Richard Lau <rlau@redhat.com> Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Rich Trott <rtrott@gmail.com>
pummel/test-crypto-timing-safe-equal-benchmarks
seems to fail frequently on LinuxONE. That's often a sign that a test (or some Node.js internal code) is not prepared to run on very fast CPUs. (In CI, our LinuxONE host is very fast and often finds race conditions of the "oh, I didn't expect it to run that quickly" variety).Is this a bug in
crypto
? A bug in the test? Something else?@nodejs/crypto @nodejs/testing
Doesn't seem to be a LinuxONE platform team so pinging a few IBM/Red Hat folks who are active in the Build team: @richardlau @AshCripps @mhdawson Probably not the exact right people, but they probably know who to loop in.
The text was updated successfully, but these errors were encountered: