Improve accuracy of CPU-bound benchmarks #1428

syduki · 2023-10-16T23:06:17Z

The rationale behind this is that running multiple iterations of a benchmark within the same tab can distort the results due to optimization strategies in the browser. Consequently, this introduces an unreliable variance when comparing different types of frameworks, whether by nature (compiled vs runtime) or by behavior (e.g., innerHTML vs createElement).
This aligns with my observations, where certain frameworks lag behind others in some scenarios, even when using manual DOM manipulations.

To observe the impact of this change a small experiment was run in the following environment:

HW: HP EliteBook 8470p, i5-3320M × 4, 16.0 GiB
SW: Ubuntu 23.04, Chromium 118.0.5993.70 (Official Build) snap (64-bit)

Given the constraints of low-spec hardware, I opted for a minimal set of well-known (keyed) frameworks. The vanillajs-1 serves as the control benchmark. Additionally, the karyon framework, which I maintain and am familiar with regarding its behavior under various test scenarios, was included in the selection.

The benchmark figures are below:

using page reload between iterations (current):

and the results.ts.

using new page between iterations (modified):

and the results.ts.

krausest · 2023-10-17T18:51:36Z

Thanks. I'll take a look at it. Which results do you find the most striking? Alpine create rows or something else?

fabiospampinato · 2023-10-17T19:22:39Z

Interesting how Karyon seems to be significantly more affected than the others by the change 🤔

syduki · 2023-10-17T20:41:43Z

@krausest thanks. I would say that apart from scenarios with creation/deletion of rows, all other have better results, albeit as you noted alpine gets a boost in all scenarios.

@fabiospampinato I suppose it is because the karyon is based purely on objects, the other implementations (except vanilla js) are using some form of string manipulation, thus the first thing that come in my mind which could potentially play the role here is the bfcache which persists all those objects, though this is just an assumption and there may be other browser optimization processes involved under the hood. However, I'm pretty confident that if I had run the full suite of frameworks, we would have observed more instances of them behaving similarly.

krausest · 2023-10-22T17:35:05Z

I took a look at it and I'm not so sure...
We definitely need a metric to determine wether accuracy is higher when a new page is opened per benchmark iteration (your proposal) instead of when one benchmark iteration is performed in one page (implemented version).
I suggest that we use the sum of squares as the metric to decide: sum[(duration for benchmark iteration - mean for benchmark and framework)^2].

I chose the following frameworks (pretty similar to your choice above):
['alpine-v3.12.0-keyed', 'anansi-v0.14.0-keyed', 'angular-cf-v17.0.0-rc.0-keyed', 'angular-ngfor-v17.0.0-rc.0-keyed', 'inferno-v8.2.2-keyed', 'ivi-v3.0.0-keyed', 'lit-v3.0.0-keyed', 'preact-classes-v10.13.1-keyed', 'react-v18.2.0-keyed', 'react-hooks-v18.2.0-keyed', 'solid-v1.8.0-keyed', 'svelte-v4.0.0-keyed', 'sycamore-v0.9.0-beta.2-keyed', 'uhtml-v3.2.1-keyed', 'vanillajs-keyed', 'voby-v0.48.0-keyed', 'vue-v3.3.4-keyed', 'wasm-bindgen-v0.2.84-keyed']

Sum of squares for the new implementation (new page per iteration):
sum of squares for webdriver-ts/results_sumsquares_new and 01_run1k: 314.92891733333335
sum of squares for webdriver-ts/results_sumsquares_new and 02_replace1k: 551.9059909333336
sum of squares for webdriver-ts/results_sumsquares_new and 03_update10th1k_x16: 205.10238533333342
sum of squares for webdriver-ts/results_sumsquares_new and 04_select1k: 169.03058983999998
sum of squares for webdriver-ts/results_sumsquares_new and 05_swap1k: 606.8691998666666
sum of squares for webdriver-ts/results_sumsquares_new and 06_remove-one-1k: 202.6347252000001
sum of squares for webdriver-ts/results_sumsquares_new and 07_create10k: 14413.777205733328
sum of squares for webdriver-ts/results_sumsquares_new and 08_create1k-after1k_x2: 221.21344626666672
sum of squares for webdriver-ts/results_sumsquares_new and 09_clear1k_x8: 150.18832746666666
Sum of squares total for webdriver-ts/results_sumsquares_new : 16835.650787973325

Sum of squares for the old implementation (new page per benchmark):
sum of squares for webdriver-ts/results_sumsquares_old and 01_run1k: 242.88264213333332
sum of squares for webdriver-ts/results_sumsquares_old and 02_replace1k: 352.53603626666666
sum of squares for webdriver-ts/results_sumsquares_old and 03_update10th1k_x16: 412.1191811999999
sum of squares for webdriver-ts/results_sumsquares_old and 04_select1k: 145.1237332
sum of squares for webdriver-ts/results_sumsquares_old and 05_swap1k: 752.8689664000001
sum of squares for webdriver-ts/results_sumsquares_old and 06_remove-one-1k: 284.18701120000003
sum of squares for webdriver-ts/results_sumsquares_old and 07_create10k: 8304.411788666663
sum of squares for webdriver-ts/results_sumsquares_old and 08_create1k-after1k_x2: 220.26893653333332
sum of squares for webdriver-ts/results_sumsquares_old and 09_clear1k_x8: 162.22621959999998
Sum of squares total for webdriver-ts/results_sumsquares_old : 10876.624515199996

As you can see the sum of squares is lower for the old implementation. If we look at each benchmark we seen update 10th standing out.

My current conclusion: Except for update 10th row your proposal is worse for accuracy. It could be interesting to check if we could indeed improve accuracy by using new page per iteration for update 10th row.

I can provide some python scripts if someone wants to investigate.

syduki · 2023-10-22T23:50:45Z

To be honest, I didn't dug so deep into this, I raised it just by staring at the results table and visually comparing the data in the results.ts files (linked in the initial post). Now, looking at your calculations, I must admit that they are the opposite of my observations, thus to give some proof here I'm providing my own calculations.

I'm agree with your suggestion to use sum of squares as the metric, but I will go further and suggest the sample variance as more complete formula for measurement of variance dispersion.

Below are the results (using the RawResult array from the above-mentioned results.ts files), where "actual" is old implementation (new page per benchmark) while "probe" is the new implementation (new page per iteration). The results are weighted by CPU throttling factor, for more emphasis.

Results

deviation: {
  "actual": 70,
  "probe": 38,
  "equal": 0
}

global_variance: {
  "actual": 90317.94378434446,
  "probe": 77131.07725527778
}

total_variance_per_scenario: {
  "01_run1k": {
    "actual": 420.8260004444444,
    "probe": 195.72536325555555
  },
  "02_replace1k": {
    "actual": 425.1723292111113,
    "probe": 415.9624710555551
  },
  "03_update10th1k_x16": {
    "actual": 53264.40420355556,
    "probe": 39700.66003022223
  },
  "04_select1k": {
    "actual": 21912.573555022227,
    "probe": 20733.772931200005
  },
  "05_swap1k": {
    "actual": 1982.985489422223,
    "probe": 1085.9812504000015
  },
  "06_remove-one-1k": {
    "actual": 1323.9257832888882,
    "probe": 649.5034381333339
  },
  "07_create10k": {
    "actual": 5061.0647073555665,
    "probe": 4479.431361922221
  },
  "08_create1k-after1k_x2": {
    "actual": 2082.4557004,
    "probe": 1952.440614511111
  },
  "09_clear1k_x8": {
    "actual": 3844.5360156444463,
    "probe": 7917.599794577779
  }
}

The code used to obtain these results is next.

Code

const compare = (actual, probe) => {
    const weight = {
        '01_run1k': 1,
        '02_replace1k': 1,
        '03_update10th1k_x16': 16,
        '04_select1k': 16,
        '05_swap1k': 4,
        '06_remove-one-1k': 4,
        '07_create10k': 1,
        '08_create1k-after1k_x2': 2,
        '09_clear1k_x8': 8
    };
    
    // https://en.wikipedia.org/wiki/Sample_variance
    const variance = sample => {
        const l = sample.length;
        const m = sample.reduce((s, i) => i + s, 0) / l;
        return sample.reduce((s, i) => Math.pow(i - m, 2) + s, 0) / (l - 1);
    };
    
    // framework variance per scenario (weighted, lower is better)
    const items = {};
    Object.entries({actual, probe}).forEach(([kind, values]) => {
        values.forEach(i => {
            if (i?.v?.total) {
                const key = `${i.f}-${i.b}`;
                items[key] ??= {type: i.b};
                items[key][kind] = variance(i.v.total) * weight[i.b];
            }
        });
    });
    
    // scenarios with high variance between implementations (lower is better)
    const count = Object.entries(items).reduce((a, [, v]) => {
        a.actual += v.actual > v.probe ? 1 : 0;
        a.probe += v.actual < v.probe ? 1 : 0;
        a.equal += v.actual === v.probe ? 1 : 0;
        return a;
    }, {actual: 0, probe: 0, equal: 0});
    
    // total variance per scenario (lower is better)
    const total = {};
    Object.entries(items).forEach(([, {type, actual, probe}]) => {
        total[type] ??= {actual: 0, probe: 0};
        total[type].actual += actual;
        total[type].probe += probe;
    });
    
    // global variance (lower is better)
    const global = {actual: 0, probe: 0};
    Object.entries(total).forEach(([, {actual, probe}]) => {
        global.actual += actual;
        global.probe += probe;
    });
    
    console.log('total_variance_per_scenario:', total);
    console.log('global_variance:', global);
    console.log('deviation:', count);
};

From the results above can be observed that the worst variance in the new implementation is only in 09_clear1k_x8 scenario.
Anyway, this contradiction between my and your results suggests that the browser is not the only variable here and the environment (HW/SW) also plays it's role in the benchmark.

leeoniya · 2023-10-23T00:22:12Z

...suggests that the browser is not the only variable here and the environment (HW/SW) also plays it's role in the benchmark.

fwiw, this is true of every benchmark ever written.

syduki · 2023-10-23T06:35:27Z

@leeoniya I got your point, but that statement was about contradiction of variance results, not performance of benchmark itself, i.e. when we are witnessing worse variance values in an more performant environment (if it was really the same as in the official benchmark, i.e. MacBook Pro 14 (32 GB RAM, 8/14 Cores, OSX 14.0)).

krausest · 2023-11-06T21:04:32Z

I tried for the chrome 119 numbers (actual=same tab, probe=new tab):

total_variance_per_scenario: {
  '01_run1k': { actual: 288.8817823142857, probe: 243.49750913333344 },
  '02_replace1k': { actual: 319.02374380952386, probe: 278.0641292952382 },
  '03_update10th1k_x16': { actual: 957.3278442857146, probe: 519.5212925714286 },
  '04_select1k': { actual: 126.82963652666672, probe: 92.45575704333334 },
  '05_swap1k': { actual: 957.6544103238094, probe: 659.5528565047621 },
  '06_remove-one-1k': { actual: 247.7073512476191, probe: 126.67321120000004 },
  '07_create10k': { actual: 39693.03699246666, probe: 37827.02893817143 },
  '08_create1k-after1k_x2': { actual: 499.0531258571429, probe: 408.69639880952377 },
  '09_clear1k_x8': { actual: 177.88484760952383, probe: 122.45087591428567 },
}
global_variance: { actual: 43267.39973444094, probe: 40277.940968643336 }
deviation: { actual: 769, probe: 410, equal: 0 }

So in this run it looked indeed better for your suggestion in all cases!

There's one caveat: I'm currently seeing a few errors where the trace is mostly empty (the error says that no click event is included in the trace). I had 5 such errors for the same tab approach and 35 for the new tab approach. I currently have no idea how to mitigate this error. If I find something we might switch to the new tab approach.

syduki · 2023-11-06T23:17:18Z

Good results indeed, now everything is its place and the experiment aligns with the theory, i.e. we got what was expected, a sizeable improvement for the CPU-throttled scenarios, the ones most "environment-sensitive".

As for the empty traces and errors, I see these are not specifically related to this change, only magnified by it. To me they look more like some page tracing issues, so maybe it would be better to have a separate thread for that.

krausest · 2023-11-12T10:05:27Z

As per #1493 I'm closing it here, similar functionality was integrated in master.

syduki added 3 commits October 16, 2023 22:49

CPUBenchmark: run each iteration in new page

bd6bc7b

Merge branch 'krausest:master' into master

7c02d46

Merge branch 'krausest:master' into master

cfe12eb

krausest mentioned this pull request Nov 10, 2023

Tracing misses sometimes the click event #1493

Open

krausest added a commit that referenced this pull request Nov 12, 2023

new tab per iteration (similar to #1428)

9d7c1fc

krausest closed this Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve accuracy of CPU-bound benchmarks #1428

Improve accuracy of CPU-bound benchmarks #1428

syduki commented Oct 16, 2023

krausest commented Oct 17, 2023

fabiospampinato commented Oct 17, 2023

syduki commented Oct 17, 2023

krausest commented Oct 22, 2023

syduki commented Oct 22, 2023

leeoniya commented Oct 23, 2023 •

edited

Loading

syduki commented Oct 23, 2023

krausest commented Nov 6, 2023

syduki commented Nov 6, 2023

krausest commented Nov 12, 2023

Improve accuracy of CPU-bound benchmarks #1428

Improve accuracy of CPU-bound benchmarks #1428

Conversation

syduki commented Oct 16, 2023

krausest commented Oct 17, 2023

fabiospampinato commented Oct 17, 2023

syduki commented Oct 17, 2023

krausest commented Oct 22, 2023

syduki commented Oct 22, 2023

leeoniya commented Oct 23, 2023 • edited Loading

syduki commented Oct 23, 2023

krausest commented Nov 6, 2023

syduki commented Nov 6, 2023

krausest commented Nov 12, 2023

leeoniya commented Oct 23, 2023 •

edited

Loading