A benchmark's runtime can depend on the presence/absence of other benchmarks #60

mikeizbicki · 2014-09-02T00:11:57Z

The following code compares two functions for summing over a vector:

{-# LANGUAGE BangPatterns #-}

import Control.DeepSeq
import Criterion
import Criterion.Main
import Data.Vector.Unboxed as VU
import Data.Vector.Generic as VG
import qualified Data.Vector.Fusion.Stream as Stream

sumV :: (VG.Vector v a, Num a) => v a -> a
sumV = Stream.foldl' (+) 0 . VG.stream

main = do   
    let v10 = VU.fromList [0..9]  :: VU.Vector Double
    deepseq v10 $ return ()

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        ]

But suppose I change the last few lines to the following:

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        , bench "VU.sum" $ nf VU.sum v10    -- Added this line
        ]

This, surprisingly, affects the runtime of the sumV benchmark. It makes it about 20% faster. Similarly, if we remove the sumV benchmark and leave the VU.sum benchmark, the VU.sum benchmark becomes about 20% slower. Tests were run with the patched criterion-1.0.0.2 I sent on ghc-7.8.3 with the -O2 -fllvm flags.

What's going on is that different core is generated for the sumV and VU.sum benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.

I'm not sure if this should be considered a "proper bug," but it confused me for a an hour or so. It's something that criterion users (especially those performing really small benchmarks) probably should be aware of.

The text was updated successfully, but these errors were encountered:

harendra-kumar · 2016-05-08T15:40:01Z

I am observing a similar behavior. When I add a second benchmark the first one takes longer time. The two pieces of code I am benchmarking are completely unrelated. Also have tried this with many combinations and consistently seeing the same result.

The measurement should measure only the function under test and should be independent of the surrounding code if that's the case. I do not think the generated code of the two benchmarked functions per se is different in the two cases. They are completely unrelated.

Its hard to trust the benchmarking results because of this.

Here is a simplified test code:

import qualified Data.Text.ICU             as TI
...
main :: IO ()
main = do
    str <- readFile "data/English.txt"
    let str' = take 1000000 (cycle str)
        txt = T.pack str'
    str' `deepseq` txt `deepseq` defaultMain
        [ bgroup "text-icu" $ [bench "1" (nf (TI.normalize TI.NFD) txt)]
        , bgroup "just-count" $ [bench "1" (nf (show . length) str')]
        ]

The first benchmark is measuring text-icu normalize function. When run it with only the first benchmark alone it reports:

benchmarking text-icu/1
time                 2.830 ms   (2.777 ms .. 2.913 ms)

When I add the second one it becomes:

cueball:/vol/hosts/cueball/workspace/play/criterion$ ./Benchmark1 
benchmarking text-icu/1
time                 3.709 ms   (3.570 ms .. 3.846 ms)

benchmarking just-count/1
time                 2.677 ms   (2.516 ms .. 2.872 ms)

A 30% degradation by just adding a line. The difference is even more marked in several other cases.

This problem is forcing me to run criterion with only one benchmark at a time. Also note that the benchmark result is wrong even when only one benchmark is chosen out of many via command line. Just the presence of another benchmark is enough irrespective of the runtime selection.

I am running criterion-1.1.1.0 and ghc-7.10.3.

harendra-kumar · 2016-05-08T18:44:34Z

It seems my problem was due to the sharing of input data across benchmarks which caused undue memory pressure for the later benchmarks. The problem got resolved by using env. I rewrote the above code like this:

setup = fmap (take 1000000 . cycle) (readFile "data/English1.txt")

main :: IO ()
main = do
    defaultMain
            [ bgroup "text-icu" $
                [
                    env (fmap T.pack setup) (\txt -> bench "1" (nf (TI.normalize TI.NFD) txt))
                ]
            , bgroup "just-count" $
                [
                    env setup (\str -> bench "1" (nf (show . length) str))
                ]
            ]

One possible enhancement could be to strongly recommend using env in the documentation when using multiple benchmarks or even better if possible detect the case when env is not being used and issue a warning at runtime.

rrnewton · 2016-08-16T14:22:20Z

Yes, being cognizant of working set is hard with Haskell's lazy semantics and GHC's optimizations. I don't know what would be detected here for a warning though -- what would the check for a linter be?

The original issue is a tricky one of compilation units. You can always put benchmarks in separate modules but that's a pain. I'm not sure what would be a good solution to alleviate this pain. TH doesn't seem sufficient.

It kind of seems like you'd want something like fragnix to create minimal compliation units that isolate your benchmarks.

RyanGlScott · 2016-08-16T14:24:42Z

FWIW, the hspec test suite has a tool called hspec-discover that automates the process of discovering other modules in a directory that contain tests (modules that end with the suffix -Spec). If isolating benchmarks into other modules is the recommended approach to solving this particular issue, we could consider implementing hspec-discover-style functionality to automate discover of benchmarks in other modules.

RyanGlScott · 2017-11-02T16:51:11Z

See #166 for another example.

RyanGlScott added Bug Enhancement labels Jun 23, 2017

RyanGlScott mentioned this issue Nov 2, 2017

When using multiple benchmarks earlier ones affect the ones coming later #166

Open

jwaldmann mentioned this issue Mar 26, 2023

first benchmark is expensive #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

mikeizbicki commented Sep 2, 2014

harendra-kumar commented May 8, 2016

harendra-kumar commented May 8, 2016

rrnewton commented Aug 16, 2016 •

edited

Loading

RyanGlScott commented Aug 16, 2016

RyanGlScott commented Nov 2, 2017

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

Comments

mikeizbicki commented Sep 2, 2014

harendra-kumar commented May 8, 2016

harendra-kumar commented May 8, 2016

rrnewton commented Aug 16, 2016 • edited Loading

RyanGlScott commented Aug 16, 2016

RyanGlScott commented Nov 2, 2017

rrnewton commented Aug 16, 2016 •

edited

Loading