Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A benchmark's runtime can depend on the presence/absence of other benchmarks #60

Open
mikeizbicki opened this issue Sep 2, 2014 · 5 comments

Comments

@mikeizbicki
Copy link

The following code compares two functions for summing over a vector:

{-# LANGUAGE BangPatterns #-}

import Control.DeepSeq
import Criterion
import Criterion.Main
import Data.Vector.Unboxed as VU
import Data.Vector.Generic as VG
import qualified Data.Vector.Fusion.Stream as Stream

sumV :: (VG.Vector v a, Num a) => v a -> a
sumV = Stream.foldl' (+) 0 . VG.stream

main = do   
    let v10 = VU.fromList [0..9]  :: VU.Vector Double
    deepseq v10 $ return ()

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        ]

But suppose I change the last few lines to the following:

    defaultMain 
        [ bench "sumV"   $ nf sumV v10
        , bench "VU.sum" $ nf VU.sum v10    -- Added this line
        ]

This, surprisingly, affects the runtime of the sumV benchmark. It makes it about 20% faster. Similarly, if we remove the sumV benchmark and leave the VU.sum benchmark, the VU.sum benchmark becomes about 20% slower. Tests were run with the patched criterion-1.0.0.2 I sent on ghc-7.8.3 with the -O2 -fllvm flags.

What's going on is that different core is generated for the sumV and VU.sum benchmarks depending on whether the other benchmark is present. Essentially, the common bits are being factored out and placed in a function, and this function is getting called in both benchmarks. This happens to make both benchmarks faster.

I'm not sure if this should be considered a "proper bug," but it confused me for a an hour or so. It's something that criterion users (especially those performing really small benchmarks) probably should be aware of.

@harendra-kumar
Copy link

I am observing a similar behavior. When I add a second benchmark the first one takes longer time. The two pieces of code I am benchmarking are completely unrelated. Also have tried this with many combinations and consistently seeing the same result.

The measurement should measure only the function under test and should be independent of the surrounding code if that's the case. I do not think the generated code of the two benchmarked functions per se is different in the two cases. They are completely unrelated.

Its hard to trust the benchmarking results because of this.

Here is a simplified test code:

import qualified Data.Text.ICU             as TI
...
main :: IO ()
main = do
    str <- readFile "data/English.txt"
    let str' = take 1000000 (cycle str)
        txt = T.pack str'
    str' `deepseq` txt `deepseq` defaultMain
        [ bgroup "text-icu" $ [bench "1" (nf (TI.normalize TI.NFD) txt)]
        , bgroup "just-count" $ [bench "1" (nf (show . length) str')]
        ]

The first benchmark is measuring text-icu normalize function. When run it with only the first benchmark alone it reports:

benchmarking text-icu/1
time                 2.830 ms   (2.777 ms .. 2.913 ms)

When I add the second one it becomes:

cueball:/vol/hosts/cueball/workspace/play/criterion$ ./Benchmark1 
benchmarking text-icu/1
time                 3.709 ms   (3.570 ms .. 3.846 ms)

benchmarking just-count/1
time                 2.677 ms   (2.516 ms .. 2.872 ms)

A 30% degradation by just adding a line. The difference is even more marked in several other cases.

This problem is forcing me to run criterion with only one benchmark at a time. Also note that the benchmark result is wrong even when only one benchmark is chosen out of many via command line. Just the presence of another benchmark is enough irrespective of the runtime selection.

I am running criterion-1.1.1.0 and ghc-7.10.3.

@harendra-kumar
Copy link

It seems my problem was due to the sharing of input data across benchmarks which caused undue memory pressure for the later benchmarks. The problem got resolved by using env. I rewrote the above code like this:

setup = fmap (take 1000000 . cycle) (readFile "data/English1.txt")

main :: IO ()
main = do
    defaultMain
            [ bgroup "text-icu" $
                [
                    env (fmap T.pack setup) (\txt -> bench "1" (nf (TI.normalize TI.NFD) txt))
                ]
            , bgroup "just-count" $
                [
                    env setup (\str -> bench "1" (nf (show . length) str))
                ]
            ]

One possible enhancement could be to strongly recommend using env in the documentation when using multiple benchmarks or even better if possible detect the case when env is not being used and issue a warning at runtime.

@rrnewton
Copy link
Member

rrnewton commented Aug 16, 2016

Yes, being cognizant of working set is hard with Haskell's lazy semantics and GHC's optimizations. I don't know what would be detected here for a warning though -- what would the check for a linter be?

The original issue is a tricky one of compilation units. You can always put benchmarks in separate modules but that's a pain. I'm not sure what would be a good solution to alleviate this pain. TH doesn't seem sufficient.

It kind of seems like you'd want something like fragnix to create minimal compliation units that isolate your benchmarks.

@RyanGlScott
Copy link
Member

FWIW, the hspec test suite has a tool called hspec-discover that automates the process of discovering other modules in a directory that contain tests (modules that end with the suffix -Spec). If isolating benchmarks into other modules is the recommended approach to solving this particular issue, we could consider implementing hspec-discover-style functionality to automate discover of benchmarks in other modules.

@RyanGlScott
Copy link
Member

See #166 for another example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants