-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using multiple benchmarks earlier ones affect the ones coming later #166
Comments
Thanks for the bug report. This is effectively the same issue as #60, so I'll close this in favor of that issue. |
@RyanGlScott I am aware of #60 but this may not be the same issue. The reason is that the binary is the same in this case, therefore there is no issue of the generated code being different. In the same binary if I select multiple benchmarks vs a single benchmark the result is different and therefore it is not the same as #60. Please reopen if you agree with this reasoning or let me know If I am missing something. |
The reason I suspect they're the same issue is because internally, both of those examples are running the same flavor of code. Before That being said, it's difficult for me to verify this claim since I can't reproduce the results in #60 (comment) anymore, and the program here relies on same text file that isn't provided. |
As I understand from @mikeizbicki's comments in #60 the issue there was code generation being different when the source code was actually changed. The code generated was more efficient in one case than in the other.
However, in this case it is a dynamic issue rather than a static one. For example this could be due to something done at runtime by the previous tests (e.g. more garbage generated that is collected later during the other tests) that affects the later tests. When the other tests are not run the dynamic issue goes away created by the previous test is not present. The fundamental cause in both the cases is entirely different (static vs dynamic), even though the symptoms are similar. |
I fixed this in gauge: vincenthz/hs-gauge#3. It can be pulled from there. The fix runs each benchmark in a separate process. However the root cause seems to be space hold up in the benchmark - see this vincenthz/hs-gauge#10 (comment). Perhaps the space is not released because other benchmarks are sharing it, I have not investigated yet. |
@harendra-kumar : could it be that while running the benchmarks, the CPU temperature increases, and when the CPU reaches a given temperature it is slown down by the OS (or by the CPU itself)? Maybe monitoring the cpu temperature in parallel of the test could help to see if that is the case. I once encountered this kind of behaviour in performance tests : the tests that would run first were running faster just because the CPU was cooler at that moment! |
The heap may be another confounding factor. I ran the I would imagine (completely untested) that running each benchmark in a separate process would perform something of a hard reset on the heap. While this would make results more stable when changing the set of benchmarks run, I'm not sure that it necessarily makes them more accurate. Programs won't execute them in isolation, so why should the benchmark? I don't have a particular solution, but thought I'd provide my two cents while this is all fresh in my head. |
Indeed. It's worth noting that |
I have the following benchmarks in a group:
The last two benchmarks take significantly more time when I run all these benchmarks in one go using
stack bench --benchmark-arguments "-m glob ops/map/*"
.However when I run individual benchmarks the results are different:
To reproduce the issue just run those commands in this repo.
I cannot figure what the problem is here. I tried using "env" to run the benchmarks and putting a "threadDelay" for a few seconds and a "performGC" in it but nothing helps.
I am now resorting to always running each benchmark individually in a separate process. Maybe we can have support for running each benchmark in a separate process in criterion itself to guarantee isolation of benchmarks, as I have seen this sort of problem too often. Now I am always skeptical of the results produced by criterion.
The text was updated successfully, but these errors were encountered: