Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a standardized, consistent benchmark for hardware performance testing. #5354

Closed
wants to merge 1 commit into from

Conversation

Sopel97
Copy link
Member

@Sopel97 Sopel97 commented Jun 4, 2024

This is an idea I need more feedback on. Recently, while testing the new NUMA awareness code, we've discovered multiple issues with the current common ways to test performance.

So far, everyone relies on either search from startpos or bench for performance testing. It's all flawed in some way or another. Searching from startpos is not representative of the overall average performance in games, and bench is tuned for single-threaded execution - variance hits +-15% on high thread counts.

The idea is to have a single simple command that by default (no specific arguments) tests the maximum performance attainable on the target machine in common workloads. The purpose would be presence on popular benchmarks, like ipmanchess, openbenchmarking, and potentially more in the future.

Replacing bench is not desirable, as it serves its purpose well, so a new command would be introduced. The current working name is benchmark.


Operation outline:

The bench has the same operating principle as the current bench, executing go commands on a preselected set of positions.

Position selection:

Settings selection:

  • 8GB of hash, considering how cheap RAM is and how important it is for longer analysis. This is a minimum setting that should be satisfied by all reasonable hardware while being high enough to cause some realistic TLB pressure.
  • get_hardware_concurrency() number of threads, don't leave any performance behind. If running with fewer is faster then we consider it a hardware configuration issue.
  • fixed movetime per position, to minimize effects of nondeterministic multithreaded search as much as possible. Selected as 1000 ms, to take a bit less than 5 minutes in total.

Other considerations:

  • Positions from a single game are sent sequentially; ucinewgame before every game
  • Ideally supressed outputs to minimize the impact of abnormal IO amount on performance (currently not implemented)
  • allow overriding settings to allow more in-depth testing for advanced users, but keep defaults good and popular
  • only execution time between go to search end is measured
  • potentially add a warmup run of a few seconds (a few positions)
  • while this will be usable for testing performance improvements within Stockfish it is primarily intended as a hardware benchmark

I need some feedback on this direction, whether it's desired, and if so whether the implementation is in the desired shape, before further testing and tuning.

The choice of positions may need to change, we need to find a set of 5 or at most 6 games that produce minimal variance across runs while providing good coverage of positions. It is also important that we avoid positions that lead to search explosions that take long time to resolve, or otherwise reach near-cyclic (fortress) setups, or positions that reach maximum depth and terminate early. The current set of positions is preliminary, remains untested.

Copy link

github-actions bot commented Jun 4, 2024

clang-format 18 needs to be run on this PR.
If you do not have clang-format installed, the maintainer will run it when merging.
For the exact version please see https://packages.ubuntu.com/noble/clang-format-18.

(execution 11084913654 / attempt 1)

@dubslow
Copy link
Contributor

dubslow commented Jun 4, 2024

Overall I'm in favor.

Quibbles:

  1. bench is already short for "benchmark" innit?
  2. not so sure about taking multiple positions from the same game, tho i suppose it's good to also include tt-reuse in the test
  3. 8GB is probably too large a default, people will wind up wanting to run this on laptops or phones or raspberry pis no matter how wise or useful that isnt, 1-2 GB would be a much less problematic choice of default hash (even 1GB is way more than 16MB bench default)
  4. search explosions can happen anywhere at any time AFAICT, not sure how to minimize/dodge that issue
  5. warmup is probably necessary, im no expert but i know the speedup testing scripts do that and what with all the frequency boosting these days it's almost certain to improve reliability of the result on most consumer hardware

Either way I think this will be a big benefit to have

@R-Goc
Copy link
Contributor

R-Goc commented Jun 4, 2024

I can see LP cores on new intel laptop chips being an issue, but it will need some testing.
Reasonable amount of RAM isn't that obvious, as @dubslow mentioned, but the solution for this could be some profiles.
Because I don't see a single logical number. 8GB is not much on newer machines either, but too much for certain devices.
I do agree that a warmup is needed.

Copy link
Member

@Disservin Disservin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, but I don't have an opinion yet about the entire idea of a common benchmark. In the past we always said that a user should configure bench correctly.. so what do we actually get from this?

Already possible:

  • threads can be set for bench
  • movetime can be configured
  • hash can be configured
  • custom positions can be configured as well

New:

  • Sends ucinewgame before every game
  • ?

Advertising this has a hardware benchmark? I'm not sure...

  • Regarding the naming, I'm unsure I think it'd be a bit confusing to have bench and benchmark..
  • Regarding the implementation, can be ignored for now to avoid unnecessary effort, I think it makes more sense to have this complete decoupled from uci.. meaning a new wrapper for the sole purpose of running this benchmark, which has it's entrypoint in main.cpp and exists right after.
    Reason: I'd like to avoid leaking settings which were set for the benchmark to the engine or at least make it only be invocable from the cli.

@dubslow
Copy link
Contributor

dubslow commented Jun 4, 2024

in the past we always said that a user should configure bench correctly

The biggest issue with this is the gross lack of suitable "realistic play" conditions in regular bench, namely FENs. It's highly nontrivial for the average user to come up with a set of FENs vaguely resembling real gameplay.

Also, too much configuration = users not able to consistently hit the target -- no useful "standard". If we added the positions in this PR as, say, realistic book option in bench, then it could be configured as you say for the purpose, however setting 5 arguments everytime you want to run a standard benchmark is a great way to ensure that we don't have a "standard" at all. Users will mess up the arguments and post slightly different tests. If we want a standard "realistic benchmark" it needs to have a sane default so that everyone can trivially refer to the same default for comparisons.

@R-Goc
Copy link
Contributor

R-Goc commented Jun 4, 2024

Stockfish is already being used as a hardware benchmark. A standardized way to do it would be nice. However, it doesn't have to be this. A note in the docs about a suggested way to benchmark would do.
If possible the biggest difference would probably be the lower variance on multi-threaded runs, which is like you said possible to currently achieve with bench, but so annoying to set up, unstandardized, and would need testing positions, that no one does this as Sopel said.

@vondele
Copy link
Member

vondele commented Jun 4, 2024

Some ideas to be considered

  • naming could be performance (our current bench is also about correctness).
  • For memory, I would rather have some memory proportional to the number of threads e.g. 256MB or 512 MB per thread (picked so we have about 30% hashfull during the game)
  • should the positions be sent as in games (i.e. initial pos plus a sequence of moves) ? (Not sure about this one, bloats things).
  • giving every position the same time is not very representative go game play, in the endgame we clearly have less time per position.

basically, you want to extract/replay the uci commands from a game played (debug log) ?

@Sopel97
Copy link
Member Author

Sopel97 commented Jun 4, 2024

bench is already short for "benchmark" innit?

well, yes, though I'd argue misleadingly

not so sure about taking multiple positions from the same game, tho i suppose it's good to also include tt-reuse in the test

yea, on one hand it's less diversity, on the other it's more realistic. Hard choice under time constraints.

8GB is probably too large a default [...]

I don't consider such hardware relevant to be honest, doesn't matter much if it gets 100k nps or 1M.

In the past we always said that a user should configure bench correctly..

and how did this work out? consider this more about standardization than functionality

Regarding the implementation, can be ignored for now to avoid unnecessary effort, I think it makes more sense to have this complete decoupled from uci..

should be done for bench too, maybe in a subsequent PR

naming could be performance

solid consideration

For memory, I would rather have some memory proportional to the number of threads e.g. 256MB or 512 MB per thread (picked so we have about 30% hashfull during the game)

alright, maybe would be better indeed, will have to see how large it would have to be on typical hardware

should the positions be sent as in games (i.e. initial pos plus a sequence of moves) ? (Not sure about this one, bloats things).

I think the only difference is in the frontend? wouldn't change anything measurable afaik

giving every position the same time is not very representative go game play, in the endgame we clearly have less time per position.

right, possibly explore some time curve

basically, you want to extract/replay the uci commands from a game played (debug log) ?

would be very close, yea, but not sure if exactly matching them is desirable

@dubslow
Copy link
Contributor

dubslow commented Jun 4, 2024

I don't consider such hardware relevant to be honest, doesn't matter much if it gets 100k nps or 1M.

What you consider relevant has nothing to do with the fact that users will run this on such low hardware, and if they have bad experiences with default settings on crap hardware, that still affects their overall opinion of Stockfish. It's something we need to be careful about, no matter how much we don't actually care about such hardware.

That said, a per-thread default is likely a big improvement, probably the best way to do it.

and how did this work out? consider this more about standardization than functionality

exactly, exactly.

naming could be performance

speedtest occurred to me, but I like performance too. (And there's something to be said even for renaming what is presently bench, it is indeed not much a benchmark in the traditional sense anyways. But that's probably out of scope and way, way harder to do, considering how much other infrastructure we have based on that name.)

should the positions be sent as in games (i.e. initial pos plus a sequence of moves) ? (Not sure about this one, bloats things).

In my experience ("manually kibitizing" tcec games), sending just FENs is perfectly fine. Stockfish still reuses TT with no issue whether or not it has move history.

giving every position the same time is not very representative go game play, in the endgame we clearly have less time per position.

good point, but probably not needed for a first version of this thing. The two major concerns, at present, are 1) standardizing a performance test, and 2) getting a game-realistic spread of positions. The improvements from these two things are much greater than the improvement of adding game-realistic thinking time. This is ultimately a speed-test, not an elo-test, and we can get reliable speed measurements without worrying about thinking time. (Note "reliable" not "accurate", just needs to be comparable rather than having a non-arbitrary meaning)

@Sopel97
Copy link
Member Author

Sopel97 commented Aug 24, 2024

Alright, let's revive because otherwise I'm just gonna forget about this.

More diagnostics has been added so the result looks like this

Version                    : Stockfish dev-20240824-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : no
Thread count               : 16
TT size [MiB]              : 4096
Nodes/second               : 15433372

No easy way to get the system configuration so that's left out.

Renamed to perf. Which I think is a good compromise, and a common abbreviation.

Silenced all engine output. On a fresh windows console this was actually close to 10% performance difference. With how the buffering is done this is potentially all variance depending on the setup, which we want to minimize.

Added a small warmup step, currently 3 first positions.

Split setting options and running the bench. Mostly due to warmup interfering with the logic here.

Did a rough regression on some fishtest LTC to figure out a good time curve. Ended up with this:

            // time per move is fit roughly based on LTC games
            // seconds = 50/{ply+15}
            // ms = 50000/{ply+15}
            // with this fit 10th move gets 2000ms
            // adjust for desired 10th move time

This is objectionable. I agree with some comments earlier that it's some what desirable. I don't think it matters that much what exact curve we use, as long as there's some higher weight for early-mid game positions.

The movetimes end up looking like this

1 2343
2 2205
3 2083
4 1973
5 1875
6 1785
7 1704
8 1630
9 1562
10 1500
11 1442
12 1388
13 1339
14 1293
15 1250
16 1209
17 1171
18 1136
19 1102
20 1071
21 1041
22 1013
23 986
24 961
25 937
26 914
27 892
28 872
29 852
30 833
31 815
32 797
33 781
34 765
35 750
36 735
37 721
38 707
39 694
40 681
41 669
42 657
43 646
44 635
45 625
46 614
47 604
48 595
49 585
50 576

I have not tested if it's more stable than bench yet. I'll do it once I have some idle time on my machine. Other tests are welcome too. I think the list of positions is final, unless something bad pops up.

@vondele
Copy link
Member

vondele commented Aug 31, 2024

I have played around with it, and most things look good to me (e.g. positions, scheme to select time etc). I still need to do some benchmarking on larger hardware, but it looks like it gives pretty stable nps.

A few comments for consideration.

  • is it possible not to pass around the bool silent, but instead configure the engine with a engine.set_on_verify_network() I think that's a pretty flexible way.
  • Echo input parameters to the output so we know how the performance test has been run.
  • Also print total runtime, total nodes to the output
  • If possible capture and show maximum hashfull during the run.
  • The fen output is not so useful, I think, since we get no other info associated. Maybe just output a . for each fen on one line (optionally with like 50 dots per line).

The current runtime is ~5min, which is a bit long. I suggest default runtime and hash are reduced 2x. (sites like openbenchmarking run at least 3 times to measure the variance, with additional rounds until the average is well converged).

For the user input right now it is
./stockfish perf ttSizeTotal threads desiredMsPerMoveAtMove10
I think it would be better to change this to
./stockfish perf totalRuntime threads TTperThread

Finally, personally I don't like so much the perf abbreviation. We already have perft in use and there is the linux perf tool. I would prefer the full performance, but won't argue if the current name is important to you.

@vondele
Copy link
Member

vondele commented Aug 31, 2024

Some benchmark results on 288 threads:

out.perf.1:Nodes/second               : 259902974
out.perf.2:Nodes/second               : 258764326
out.perf.3:Nodes/second               : 258912568
out.perf.4:Nodes/second               : 261002145
out.perf.5:Nodes/second               : 257904984
out.perf.6:Nodes/second               : 257532292
out.perf.7:Nodes/second               : 260163572
out.perf.8:Nodes/second               : 257981736
out.perf.9:Nodes/second               : 258291383

< 1% variance, which I think is very good for a threaded run.

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 3, 2024

Alright, I'll rename it to performance, I don't care too much about the exact name.

is it possible not to pass around the bool silent, but instead configure the engine with a engine.set_on_verify_network() I think that's a pretty flexible way.

yea, I think that's a nice solution, slightly outside of this PR but probably best to do it now

Echo input parameters to the output so we know how the performance test has been run.

good idea

Also print total runtime, total nodes to the output

I thought that this isn't really important information, but I guess it doesn't hurt to include it, if only as a sanity check.

If possible capture and show maximum hashfull during the run.

The current hashfull calculation is not very useful, because it completely ignores prior searches. Would have to use a custom implementation to get a more meaningful value (ignore generation). It's not ideal because there's some positions that are no longer reachable, but better than getting 3-5% hashfull just because the current search is small. Would it be reasonable?

The fen output is not so useful, I think, since we get no other info associated. Maybe just output a . for each fen on one line (optionally with like 50 dots per line).

Would \r be a problem? I want to keep the number current % progress visible.

The current runtime is ~5min, which is a bit long. I suggest default runtime and hash are reduced 2x. (sites like openbenchmarking run at least 3 times to measure the variance, with additional rounds until the average is well converged).

alright, should still be fairly reasonable, the last searches would be around 250ms.


is this being considered for sf17? provided I finish it in a timely manner

@vondele
Copy link
Member

vondele commented Sep 3, 2024

If possible capture and show maximum hashfull during the run.

The current hashfull calculation is not very useful, because it completely ignores prior searches. Would have to use a custom implementation to get a more meaningful value (ignore generation). It's not ideal because there's some positions that are no longer reachable, but better than getting 3-5% hashfull just because the current search is small. Would it be reasonable?

I don't think we should do something special, but aren't we sending essentially the same uci commands as we would do during a game?

The fen output is not so useful, I think, since we get no other info associated. Maybe just output a . for each fen on one line (optionally with like 50 dots per line).

Would \r be a problem? I want to keep the number current % progress visible.

I don't think so, to be tried.

is this being considered for sf17? provided I finish it in a timely manner

While it would have been nice, I'd rather take the conservative approach, and merge it as one of the first things of SF17dev.

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 3, 2024

I don't think we should do something special, but aren't we sending essentially the same uci commands as we would do during a game?

yes, and the hashful has the same meaning, so imo meaningless. It's fine for singular position analysis, but as soon as you do start playing games it's worthless. But sure, can include.

@vondele
Copy link
Member

vondele commented Sep 3, 2024

I don't think we should do something special, but aren't we sending essentially the same uci commands as we would do during a game?

yes, and the hashful has the same meaning, so imo meaningless. It's fine for singular position analysis, but as soon as you do start playing games it's worthless. But sure, can include.

yeah, but that still means it is fine to compare to https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-cost-of-small-hash which was captured during game play.

@vondele
Copy link
Member

vondele commented Sep 23, 2024

I'd be happy to see this merged sooner than later, would be great if you could implement the last missing pieces.

@robertnurnberg
Copy link
Contributor

Just to recall here that we discussed on discord a possible alternative name for this: speedtest.

@R-Goc
Copy link
Contributor

R-Goc commented Sep 23, 2024

Personally I don't think speedtest Is a good name. Speedtest might imply that what is being tested is the speed of stockfish itself. While with benchmark it is a standard name that means test the performance of this hardware. As benchmark is already used by bench, perf seems more standard to me. Also it is shorter.

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 23, 2024

Ok, I think I addressed everything now, leaving commits for clarity on what was changed since then. I will squash for the final merge.

Output looks like this now

C:\dev\stockfish-master\src>stockfish.exe speedtest 4
Stockfish dev-20240923-nogit by the Stockfish developers (see AUTHORS file)
info string Using 4 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest 4
Filled invocation          : speedtest 4 512 750
Thread count               : 4
TT size [MiB]              : 512
Nodes/second               : 4576363
Total nodes searched       : 641372719
Total search time [s]      : 140.149
Hash max, avg [per mille]  : 57, 30
Hash max, avg at age <= 2  : 155, 87
Hash max, avg at age <= 4  : 241, 141
Hash max, avg at age <= 8  : 379, 239
Hash max, avg at age <= 16 : 595, 399
Hash max, avg at age <= 32 : 792, 575
Total search time [s]      : 140.149

I felt compelled to include more detailed hashfull information, since the basic hashfull doesn't paint the whole picture for gameplay.

@Torom
Copy link
Contributor

Torom commented Sep 24, 2024

Is it intended to have Total search time [s] twice in the output?

@Torom
Copy link
Contributor

Torom commented Sep 24, 2024

I have tested the speedtest on my desktop and Raspberry Pi 5. LGTM

Intel Core i7-6700K CPU @ 4.00GHz

$ ./stockfish.betterbench speedtest 8
Stockfish dev-20240923-73356c5d by the Stockfish developers (see AUTHORS file)
info string Using 8 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-73356c5d
Compiled by                : clang++ 18.1.8 on Linux
Compilation architecture   : x86-64-bmi2
Compilation settings       : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Clang 18.1.8
Large pages                : yes
User invocation            : speedtest 8
Filled invocation          : speedtest 8 1024 750
Thread count               : 8
TT size [MiB]              : 1024
Nodes/second               : 3548381
Total nodes searched       : 497461801
Total search time [s]      : 140.194
Hash max, avg [per mille]  : 26, 12
Hash max, avg at age <= 2  : 72, 36
Hash max, avg at age <= 4  : 105, 58
Hash max, avg at age <= 8  : 178, 99
Hash max, avg at age <= 16 : 295, 170
Hash max, avg at age <= 32 : 391, 257
Total search time [s]      : 140.194

Raspberry Pi 5

$ ./stockfish.betterbench speedtest 4
Stockfish dev-20240923-73356c5d by the Stockfish developers (see AUTHORS file)
info string Using 4 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-73356c5d
Compiled by                : clang++ 18.1.3 on Linux
Compilation architecture   : armv8-dotprod
Compilation settings       : 64bit POPCNT NEON_DOTPROD
Compiler __VERSION__ macro : Ubuntu Clang 18.1.3 (1ubuntu1)
Large pages                : yes
User invocation            : speedtest 4
Filled invocation          : speedtest 4 512 750
Thread count               : 4
TT size [MiB]              : 512
Nodes/second               : 338769
Total nodes searched       : 47814564
Total search time [s]      : 141.142
Hash max, avg [per mille]  : 8, 2
Hash max, avg at age <= 2  : 18, 7
Hash max, avg at age <= 4  : 28, 12
Hash max, avg at age <= 8  : 45, 21
Hash max, avg at age <= 16 : 76, 37
Hash max, avg at age <= 32 : 96, 57
Total search time [s]      : 141.142

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 24, 2024

Is it intended to have Total search time [s] twice in the output?

good catch

@R-Goc
Copy link
Contributor

R-Goc commented Sep 24, 2024

What is the default thread count? Maybe for this it would make sense to default to all threads?

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 24, 2024

What is the default thread count? Maybe for this it would make sense to default to all threads?

it does default to all threads


3 runs on almost idle 7800x3d

C:\dev\stockfish-master\src>stockfish.exe speedtest
Stockfish dev-20240923-nogit by the Stockfish developers (see AUTHORS file)
info string Using 16 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 750
Thread count               : 16
TT size [MiB]              : 2048
Nodes/second               : 15630014
Total nodes searched       : 2189686830
Total search time [s]      : 140.095
Hash max, avg [per mille]  : 41, 22
Hash max, avg at age <= 2  : 111, 65
Hash max, avg at age <= 4  : 169, 106
Hash max, avg at age <= 8  : 277, 180
Hash max, avg at age <= 16 : 473, 305
Hash max, avg at age <= 32 : 662, 450
Total search time [s]      : 140.095

C:\dev\stockfish-master\src>stockfish.exe speedtest
Stockfish dev-20240923-nogit by the Stockfish developers (see AUTHORS file)
info string Using 16 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 750
Thread count               : 16
TT size [MiB]              : 2048
Nodes/second               : 15643845
Total nodes searched       : 2192140797
Total search time [s]      : 140.128
Hash max, avg [per mille]  : 41, 22
Hash max, avg at age <= 2  : 115, 63
Hash max, avg at age <= 4  : 172, 102
Hash max, avg at age <= 8  : 294, 174
Hash max, avg at age <= 16 : 476, 293
Hash max, avg at age <= 32 : 623, 433
Total search time [s]      : 140.128

C:\dev\stockfish-master\src>stockfish.exe speedtest
Stockfish dev-20240923-nogit by the Stockfish developers (see AUTHORS file)
info string Using 16 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240923-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 750
Thread count               : 16
TT size [MiB]              : 2048
Nodes/second               : 15650008
Total nodes searched       : 2192597427
Total search time [s]      : 140.102
Hash max, avg [per mille]  : 41, 22
Hash max, avg at age <= 2  : 109, 65
Hash max, avg at age <= 4  : 173, 105
Hash max, avg at age <= 8  : 287, 179
Hash max, avg at age <= 16 : 481, 303
Hash max, avg at age <= 32 : 649, 445
Total search time [s]      : 140.102

NPS stddev: 8360 (0.05%)
Nodes stddev: 1278108 (0.05%)

@vondele
Copy link
Member

vondele commented Sep 28, 2024

Some final things from my point of view, if this is fixed, please undraft and it can be merged.

  • I'm not so sure the additional output on hashfull is really useful, the age of the entries is really 'specialist knowledge'... even I don't actually know what to do with this output. So, I would strongly suggest to remove it. If you want to keep it nevertheless, can you merge the two versions of 'hashfull' in one function, and is it possible to somehow use the 'relative_age()' function ?
  • MS_PER_MOVE_AT_MOVE_10 .. please use the total runtime of the test instead for user input, it can be a simple scaling based on one measurement

@noobpwnftw
Copy link
Contributor

Thread count as in thread used or available?

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 28, 2024

I'm not so sure the additional output on hashfull is really useful, the age of the entries is really 'specialist knowledge'... even I don't actually know what to do with this output.

I understand that it may be a little too much. Would it perhaps make more sense to just display hashful as is commonly understood along with total TT occupation (i.e. the amount of entries touched during the search)? Otherwise I'll remove that output.

and is it possible to somehow use the 'relative_age()' function ?

that's what I wanted to use initially but it doesn't output a normalized value

MS_PER_MOVE_AT_MOVE_10 .. please use the total runtime of the test instead for user input, it can be a simple scaling based on one measurement

makes sense

Thread count as in thread used or available?

good point. it may be best to actually use the numa output we get normally for more info

@vondele
Copy link
Member

vondele commented Sep 28, 2024

I'm not so sure the additional output on hashfull is really useful, the age of the entries is really 'specialist knowledge'... even I don't actually know what to do with this output.

I understand that it may be a little too much. Would it perhaps make more sense to just display hashful as is commonly understood along with total TT occupation (i.e. the amount of entries touched during the search)? Otherwise I'll remove that
output.

I would just provide the output that our standard hashfull provide, or none at all. Also for the user experience, I think having nodes/second as the last line output makes sense, it somehow is the final result of the run.

Thread count as in thread used or available?

good point. it may be best to actually use the numa output we get normally for more info

usual numa ouput is good.

@Sopel97
Copy link
Member Author

Sopel97 commented Sep 28, 2024

Alright, current candidate

C:\dev\stockfish-master\src>stockfish.exe speedtest 4 64 10
Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file)
info string Using 4 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240928-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest 4 64 10
Filled invocation          : speedtest 4 64 10
Available processors       : 0-15
Thread count               : 4
Thread binding             : none
TT size [MiB]              : 64
Hash max, avg [per mille]  :
    single search          : 42, 22
    single game            : 674, 446
Total nodes searched       : 55849928
Total search time [s]      : 10.129
Nodes/second               : 5513863

If the single game : hash info is undesirable even in this form then I'll remove it and I think we'll be at the final version.

@vondele
Copy link
Member

vondele commented Sep 28, 2024

thanks, looks good to me. Please squash in one commit with a good commit message and undraft.

`speedtest [threads] [hash_MiB] [time_s]`. `threads` default to system concurrency. `hash_MiB` defaults to `threads*128`. `time_s` defaults to 150.

Intended to be used with default parameters, as a stable hardware benchmark.

Example:
```
C:\dev\stockfish-master\src>stockfish.exe speedtest
Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file)
info string Using 16 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240928-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 150
Available processors       : 0-15
Thread count               : 16
Thread binding             : none
TT size [MiB]              : 2048
Hash max, avg [per mille]  :
    single search          : 40, 21
    single game            : 631, 428
Total nodes searched       : 2099917842
Total search time [s]      : 153.937
Nodes/second               : 13641410
```

-------------------------------

Small unrelated tweaks:
 - Network verification output is now handled as a callback.
 - TT hashfull queries allow specifying maximum entry age.

No functional changes.
@Sopel97 Sopel97 marked this pull request as ready for review September 28, 2024 15:09
@Sopel97 Sopel97 changed the title [RFC] Add a standardized, consistent benchmark for hardware performance testing. Add a standardized, consistent benchmark for hardware performance testing. Sep 28, 2024
@vondele vondele added the to be merged Will be merged shortly label Sep 28, 2024
vondele pushed a commit to vondele/Stockfish that referenced this pull request Sep 28, 2024
`speedtest [threads] [hash_MiB] [time_s]`. `threads` default to system concurrency. `hash_MiB` defaults to `threads*128`. `time_s` defaults to 150.

Intended to be used with default parameters, as a stable hardware benchmark.

Example:
```
C:\dev\stockfish-master\src>stockfish.exe speedtest
Stockfish dev-20240928-nogit by the Stockfish developers (see AUTHORS file)
info string Using 16 threads
Warmup position 3/3
Position 258/258
===========================
Version                    : Stockfish dev-20240928-nogit
Compiled by                : g++ (GNUC) 13.2.0 on MinGW64
Compilation architecture   : x86-64-vnni256
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 13.2.0
Large pages                : yes
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 150
Available processors       : 0-15
Thread count               : 16
Thread binding             : none
TT size [MiB]              : 2048
Hash max, avg [per mille]  :
    single search          : 40, 21
    single game            : 631, 428
Total nodes searched       : 2099917842
Total search time [s]      : 153.937
Nodes/second               : 13641410
```

-------------------------------

Small unrelated tweaks:
 - Network verification output is now handled as a callback.
 - TT hashfull queries allow specifying maximum entry age.

closes official-stockfish#5354

No functional change.
@vondele vondele closed this in 3ac75cd Sep 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants