Skip to content
This repository has been archived by the owner on Sep 20, 2023. It is now read-only.

When using multiple benchmarks earlier ones affect the ones coming later #2

Closed
harendra-kumar opened this issue Nov 4, 2017 · 7 comments

Comments

@harendra-kumar
Copy link
Collaborator

harendra-kumar commented Nov 4, 2017

I reported this against criterion, reporting it here as well, I have tested it using gauge and the same problem persists as expected.

I have the following benchmarks in a group:

        bgroup "map"
          [ bench "machines" $ whnf drainM (M.mapping (+1))
          , bench "streaming" $ whnf drainS (S.map (+1))
          , bench "pipes" $ whnf drainP (P.map (+1))
          , bench "conduit" $ whnf drainC (C.map (+1))
          , bench "list-transformer" $ whnf drainL (lift . return . (+1))
          ]

The last two benchmarks take significantly more time when I run all these benchmarks in one go using stack bench --benchmark-arguments "-m glob ops/map/*".

$ stack bench --benchmark-arguments "-m glob ops/map/*"

benchmarking ops/map/machines
time                 30.23 ms   (29.22 ms .. 31.04 ms)

benchmarking ops/map/streaming
time                 17.91 ms   (17.48 ms .. 18.37 ms)

benchmarking ops/map/pipes
time                 29.30 ms   (28.12 ms .. 30.03 ms)

benchmarking ops/map/conduit
time                 36.69 ms   (35.73 ms .. 37.58 ms)

benchmarking ops/map/list-transformer
time                 84.06 ms   (75.02 ms .. 90.34 ms)

However when I run individual benchmarks the results are different:

$ stack bench --benchmark-arguments "-m glob ops/map/conduit"

benchmarking ops/map/conduit
time                 31.64 ms   (31.30 ms .. 31.86 ms)

$ stack bench --benchmark-arguments "-m glob ops/map/list-transformer"

benchmarking ops/map/list-transformer
time                 68.67 ms   (66.84 ms .. 70.96 ms)

To reproduce the issue just run those commands in this repo. The repo works with gauge.

I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.

I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.

@harendra-kumar
Copy link
Collaborator Author

There are two ways to solve the problem:

  1. Figure out the cause of interference among benchmarks and fix that.
  2. Provide an option to run benchmarks in a non-interfering isolated manner i.e. run each benchmark in a separate process.

I am not able to figure out the real cause of interference but I have solution along the lines of the second option mentioned above. This solves the problem generically and with more confidence irrespective of how the Haskell runtime or garbage collector behaves. The isolated process must run only the minimal function being benchmarked and nothing else so that there is no interference by anything else.

When we run more than one benchmark, we invoke gauge once more as a separate process with just one benchmark selected. The report generated by this run is kept in a file. The file is created and filename is passed by the parent process when invoking. The parent process then consumes this output file to generate a combined report or to show the output to the user.

There are a few issues/things to be taken care of in invoking a child process:

  1. We need to identify how we were invoked and invoke the child using the same executable, arguments
  2. We will need a way to identify that we are the child process so that we do not recurse again, this will need a command line flag (may be hidden from users).
  3. Command line arguments for invoking the child will have to be edited, replacing the argument identifying the set of benchmarks to be run by a single benchmark.

Let me know if this sounds reasonable or if this should be fixed outside the library. If we fix it in a wrapper then we will need the core executable to generate a list of all benchmarks requested by the user which the wrapper can then request it to execute one at a time. We will also need a way to pass on core library arguments from the wrapper.

Another point to consider is whether we may want this as a default behavior in future to make the tool more reliable by default. In that case this should perhaps be implemented in the core library. Also this will add a dependency on at least unix and at most process for process creation.

I want to emphasize that it is important to fix because the first thing required of a benchmarking library is reliability, other features do not matter as much.

@harendra-kumar
Copy link
Collaborator Author

I am modularizing to separate the running and analysis parts as much as possible. This will make running in a separate process easier. I will send a PR soon. Let me know if this is ok or not ok or interferes with any pending work you may have.

One of the reasons I am suspecting for the variance in later benchmarks could be due to the GC overhead generated by the previous benchmarks as well as the analysis of the previous benchmark. A performGC is done after each iteration but I guess that only triggers GC and does not guarantee finishing of all the pending work we may have an overhead carried over to the next benchmark.

Also even when running in a separate process we should perhaps always throw away the first measurement because it starts with zero gc overhead and the rest of the iterations start with a potential gc overhead, making the first one always an outlier.

@vincenthz
Copy link
Owner

Related to the GC thing, I just made some similar mental note about the GC issues. I'm currently adding some debugging print to see what's the behavior like

I'm trying to modernize and lower the cost of the measurements part at the moment, so if possible keep this part as is, it would make things easier.

One thing that I'm planning to do soon, is to dump all the data as measured without analysing anything. I think it will make it easier to process this data dump or to understand the behavior seen if we have the raw data available conveniently

@harendra-kumar
Copy link
Collaborator Author

It seems it is already doing what I was suggesting i.e. discarding the first measurement:

runBenchmark bm timeLimit = do
  runBenchmarkable_ bm 1
  start <- performGC >> getTime

One thing that I'm planning to do soon, is to dump all the data as measured without analysing

That's sort of what I did too! Let me push the code as a PR and you can take a look if that tallies with what you have in mind or you can build on that if you have not already made your changes.

@vincenthz
Copy link
Owner

vincenthz commented Nov 5, 2017

sure that sounds good, I've got distracted with passing time measurement in Double into Word64 to avoid Double issues. Also removing some of the unrelastic assumption about the time measuring (no need to mention femto-second anywhere :P), measuring in nanoseconds and doing our "floating" calculation in 100s or 10s of picoseconds.

@harendra-kumar
Copy link
Collaborator Author

Added PR #3 . The results are pretty good with the change, it drastically reduces variance and standard deviation. And the later benchmark results are not inflated anymore.

The --measure-with option is a bit clumsy but I could not come up with a better portable solution. Let me know if you can think of any improvements.

The change adds dependencies on directory, process, unix and Win32, let me know if that's ok or you have a way to remove them.

I am open to any improvements/feedback.

@harendra-kumar
Copy link
Collaborator Author

Fixed by PR #3.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants