-
Notifications
You must be signed in to change notification settings - Fork 9
When using multiple benchmarks earlier ones affect the ones coming later #2
Comments
There are two ways to solve the problem:
I am not able to figure out the real cause of interference but I have solution along the lines of the second option mentioned above. This solves the problem generically and with more confidence irrespective of how the Haskell runtime or garbage collector behaves. The isolated process must run only the minimal function being benchmarked and nothing else so that there is no interference by anything else. When we run more than one benchmark, we invoke gauge once more as a separate process with just one benchmark selected. The report generated by this run is kept in a file. The file is created and filename is passed by the parent process when invoking. The parent process then consumes this output file to generate a combined report or to show the output to the user. There are a few issues/things to be taken care of in invoking a child process:
Let me know if this sounds reasonable or if this should be fixed outside the library. If we fix it in a wrapper then we will need the core executable to generate a list of all benchmarks requested by the user which the wrapper can then request it to execute one at a time. We will also need a way to pass on core library arguments from the wrapper. Another point to consider is whether we may want this as a default behavior in future to make the tool more reliable by default. In that case this should perhaps be implemented in the core library. Also this will add a dependency on at least I want to emphasize that it is important to fix because the first thing required of a benchmarking library is reliability, other features do not matter as much. |
I am modularizing to separate the running and analysis parts as much as possible. This will make running in a separate process easier. I will send a PR soon. Let me know if this is ok or not ok or interferes with any pending work you may have. One of the reasons I am suspecting for the variance in later benchmarks could be due to the GC overhead generated by the previous benchmarks as well as the analysis of the previous benchmark. A performGC is done after each iteration but I guess that only triggers GC and does not guarantee finishing of all the pending work we may have an overhead carried over to the next benchmark. Also even when running in a separate process we should perhaps always throw away the first measurement because it starts with zero gc overhead and the rest of the iterations start with a potential gc overhead, making the first one always an outlier. |
Related to the GC thing, I just made some similar mental note about the GC issues. I'm currently adding some debugging print to see what's the behavior like I'm trying to modernize and lower the cost of the measurements part at the moment, so if possible keep this part as is, it would make things easier. One thing that I'm planning to do soon, is to dump all the data as measured without analysing anything. I think it will make it easier to process this data dump or to understand the behavior seen if we have the raw data available conveniently |
It seems it is already doing what I was suggesting i.e. discarding the first measurement:
That's sort of what I did too! Let me push the code as a PR and you can take a look if that tallies with what you have in mind or you can build on that if you have not already made your changes. |
sure that sounds good, I've got distracted with passing time measurement in |
Added PR #3 . The results are pretty good with the change, it drastically reduces variance and standard deviation. And the later benchmark results are not inflated anymore. The The change adds dependencies on I am open to any improvements/feedback. |
Fixed by PR #3. |
I reported this against criterion, reporting it here as well, I have tested it using gauge and the same problem persists as expected.
I have the following benchmarks in a group:
The last two benchmarks take significantly more time when I run all these benchmarks in one go using
stack bench --benchmark-arguments "-m glob ops/map/*"
.However when I run individual benchmarks the results are different:
To reproduce the issue just run those commands in this repo. The repo works with gauge.
I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.
I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.
The text was updated successfully, but these errors were encountered: