Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune PUCT parameter for chess. #435

Merged
merged 1 commit into from
Apr 29, 2018
Merged

Tune PUCT parameter for chess. #435

merged 1 commit into from
Apr 29, 2018

Conversation

gcp
Copy link
Contributor

@gcp gcp commented Apr 25, 2018

CLOP tuning indicates the optimal values are around 0.4 - 0.9, with an
estimated maximum around 0.6, when measured in games with 2000-5000
playouts per move.

Verifying at faster timecontrols (600-1600 playouts) confirms this is a
strenght gain:

Score of lczero-tuned vs lczero: 188 - 72 - 184 [0.631] 444
Elo difference: 92.93 +/- 24.96
SPRT: llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted

CLOP tuning indicates the optimal values are around 0.4 - 0.9, with an
estimated maximum around 0.6, when measured in games with 2000-5000
playouts per move.

Verifying at faster timecontrols (600-1600 playouts) confirms this is a
strenght gain:

Score of lczero-tuned vs lczero: 188 - 72 - 184  [0.631] 444
Elo difference: 92.93 +/- 24.96
SPRT: llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted
@jkiliani
Copy link
Contributor

jkiliani commented Apr 25, 2018

Well this is definitely going to interact with the FPU method testing I'm doing in #364. Glad you have some robust data here, thank you for contributing! However, may I ask exactly which binary you used for these tuning tests? Is it the 0.7 master branch, which uses dynamic parent eval with the bugged virtual loss effect in it, or some other branch?

Also, can you please specify the network used for CLOP? My tests are rather conclusive that tuning with different nets produces (somewhat) different results.

Edit: We were planning on doing a thorough combined CLOP run together with FPU reduction, but it looks like simulating virtual loss actually works quite well for FPU reduction as well, so there has been no resolution to this yet.

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

Testing was based on master. I did a diff with "next" and there are no relevant changes there.

Network was 182 | 33f9938a.

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

I see what you mean:

   auto fpu_eval = (cfg_fpu_dynamic_eval ? get_eval(color) : net_eval) - fpu_reduction;

Which is of course broken.

I'll change my local builds to enforce single-threaded mode. But I assume that the current value (or the worth of having fpu_dynamic_eval!) was tuned incorporating this bug.

That would be especially true as virtual loss would still be applied in single-threaded mode.

@jkiliani
Copy link
Contributor

jkiliani commented Apr 25, 2018

Yes, that was introduced unintentionally, but when we compared it to "clean" dynamic parent eval, it looked like allowing the bug actually helped strength so we didn't fix it (yet).

I think the virtual losses act like a different method of FPU reduction, but likely this has problems at very low visits counts (and may also have issues with multithreading obviously). I like that its effects on root level are very small, while deep into the tree it prunes more strongly.

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

So if I have even more invasive changes to search, what should I use as the baseline, if not master/next?

@jkiliani
Copy link
Contributor

I wish I had a good answer for you there but unfortunately I don't. I'm trying to find a good an robust FPU method but am bottlenecked by a slow machine to test them. If you a good idea what to do with FPU (and capacity to test on multiple nets), that would be very welcome. Until then, I think we have to take your results as one measurement point simply showing that 0.6 works well with dynamic parent eval bugged by virtual loss.

My current impression is that the search parameter tuning is not extremely high on the todo list of the main project admins, which I understand considering how fast it is currently developing.

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

Well, it was never high on mine either, because tuning (in Go) always indicated strength changes of only a few Elo.

I was quite astonished to get 100 Elo for free here.

Was this ever tuned, or just copied from upstream?

@jkiliani
Copy link
Contributor

Technical question, could you give me a template of the script you use for CLOP? It is the version implemented in cutechess-cli, correct? We have a few volunteers who have better hardware and would be willing to run tuning in the future, and obviously CLOP would be better than the round-robin tournaments I used so far.

@jkiliani
Copy link
Contributor

Oh and by the way, did you a temperature decay schedule on those tuning tests, or Dirichlet noise, or just an external opening book?

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

I used an external opening book. Small book, only openings that happened a lot in GM play.

@jkiliani
Copy link
Contributor

Should have no issue with game variety then. For match games we switched to a fractional temperature decay schedule, since it turned out that the variety from only Dirichlet noise became insufficient to avoid duplicate games: #267

Go doesn't have this problem due to the much larger search space, and the symmetries in NN evals.

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

I used this one: https://github.com/cutechess/cutechess/blob/master/tools/clop-cutechess-cli.py

For lczero you need to explicitly set the working dir, and unfortunately in master fixed visits is broken (this is fixed in next, it seems), which is why I used time controls.

@Tilps
Copy link
Contributor

Tilps commented Apr 25, 2018

odd, I thought the fix for 'go nodes' got merged forward to master

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

go nodes works, but -v and having a timecontrol specified (which cutechess requires) works the wrong way around.

@Tilps
Copy link
Contributor

Tilps commented Apr 25, 2018

I'm surprised that 'works' on next either - -v is pretty much supposed to be ignored unless go is given without any qualification. But anyway drifting offtopic I guess.

The methodology seems pretty good, and I'm not too surprised that puct would be worth tuning for again after the 'virtual loss bug' feature appeared. But probably best to do a custom build and share it with some of the people who run multi-engine gauntlets to see how it fares outside of self-play. (Because as you say 100 Elo is kind of surprising.)

@gcp
Copy link
Contributor Author

gcp commented Apr 25, 2018

Puct is configurable via UCI, no custom build required.

@Tilps
Copy link
Contributor

Tilps commented Apr 25, 2018

right, of course... my bad!

@killerducky killerducky changed the base branch from master to next April 25, 2018 13:11
@killerducky
Copy link
Collaborator

@gcp - I think testing against next is best, similar to LZGo right? The plan is for master to have only fixes like MSVC doesn't compile, plus changes to the server code (by mistake I pulled a few commits outside this definition). As you said right now next and master do not have any changes that impact this PR.

Let's ask a few of the other testers to use this PUCT value in their gauntlet testing.

@jkiliani
Copy link
Contributor

jkiliani commented Apr 25, 2018

@killerducky Are you planning to merge this? I'm preparing a new PR right now to restrict the parameter space, and enable a combined puct/FPU tuning with only one FPU parameter. I'm thinking of switching to simulated VL for this and actually removing "normal" FPU reduction so the optimisation problem doesn't become unmanageable, as @gcp rightly remarked in #364.

While I'm impressed with the 100 Elo gain, I think testing on a few more nets would have helped, and we should really do the puct and FPU reduction tuning at the same time.

Edit @killerducky Would you have a problem with me removing the static parent eval for FPU? I think tests showed dynamic eval behaving better as base FPU before reduction, and we don't really need more free parameters to tune if puct needs tuning.

@killerducky
Copy link
Collaborator

We can take some more testing time, but this is the best thing we have so far, so you need to at least get close to it.

@jkiliani
Copy link
Contributor

jkiliani commented Apr 25, 2018

I agree with you that it sounds good, but the method (use only self-play with external book and no temperature, only one net (which one?) is somewhat problematic in my opinion. I had tests as well that showed similar improvement margins for one net, they just didn't hold up to closer scrutiny.

@jjoshua2
Copy link
Contributor

jjoshua2 commented Apr 26, 2018

all at 40/1 TC all independent games from 3 tests

0 Laser-1_0 4CPU                 11      50     132   51.5%   28.8%
1 lczero v7 id185 fpu 0.1        16      75      66   52.3%   22.7%
2 lczero v7 id185 puct 0.7      -37      69      66   44.7%   34.8%

0 Laser-1_0 4CPU                -44     130      24   43.8%   20.8%
1 lczero v7 id185 slowmover 120      89     227      12   62.5%    8.3%
2 lczero v7 id185 fpu 0.1 puct .9       0     173      12   50.0%   33.3

Score of lczero v7 id185 slowmover 120 vs Laser-1_0 4CPU: 20 - 10 - 10 [0.625]
Elo difference: 88.74 +/- 98.10

Nothing beat the default. It appears that against engines which are fast and good at tactics you cannot do better with more exploiting, because the exploring is necessary to not fall for something eventually...

@killerducky
Copy link
Collaborator

@jjoshua2 I'm confused why you didn't test PUCT=0.6 as in this PR?

@zz4032
Copy link

zz4032 commented Apr 26, 2018

Very short test in selfplay with Puct=0.6 has shown an improvement of about 100Elo. But vs. other engines it looks differently.

Time control was used (not visits) because probably the setting affects calculation time.
~LCZero vs. Stockfish 1.01 @ ~1min/game

   # PLAYER                        :  RATING  ERROR  LOS(%)  POINTS   GAMES  WON  DRAWN  LOST  DRAWS(%)
   1 Stockfish1.01                 :       0   ----    98.0   123.5     200  108     31    61      15.5
   2 LCZero_Id184_60db             :     -68     65    76.1    40.5     100   31     19    50      19.0
   3 LCZero_Id184_60db_Puct=0.6    :    -101     66     ---    36.0     100   30     12    58      12.0

Calculation time on first move out of book: ~1.0s (~3500 visits)
High error bars..

Gauntlet Stockfish 1.01 @ ~3s/game vs. LCZero @ ~10s/game (for about equal strength).

   # PLAYER                                 :  RATING  ERROR  LOS(%)  POINTS   GAMES  WON  DRAWN  LOST  DRAWS(%)
   1 Stockfish1.01_3s/game                  :       0   ----   100.0  1016.0    1600  857    318   425      19.9
   2 LCZero_Id184_60db_Puct=0.4_10s/game    :     -53     31    87.6   170.0     400  136     68   196      17.0
   3 LCZero_Id184_60db_Puct=0.6_10s/game    :     -79     31    73.8   155.5     400  111     89   200      22.2
   4 LCZero_Id184_60db_Puct=0.8_10s/game    :     -93     31   100.0   148.0     400  110     76   214      19.0
   5 LCZero_Id184_60db_Puct=1.0_10s/game    :    -169     33     ---   110.5     400   68     85   247      21.2

Calculation time on first move out of book: ~0.14s (~350 visits)
Now it looks weird, probably time control sensitive? I'm afraid we're going to need Leela testing framework for things like this. :)

@jjoshua2
Copy link
Contributor

jjoshua2 commented Apr 26, 2018

@killerducky because i tested that and it was bad too. even worse turns out. it makes sense you can't prune as much against tactical monster like a 4 core alpha beta engine, so I thought maybe just like half way inbetween would help, but it didn't.

Score of lczero v7 id185 puct 0.6 vs Laser-1_0 4CPU: 6 - 8 - 2 [0.438]
Elo difference: -43.66 +/- 173.99

@remdu
Copy link

remdu commented Apr 26, 2018

Do we really want to tune these parameters for specific outside opponents ?

@jkiliani
Copy link
Contributor

Based on the tests I posted in #438, I think this could actually be pulled along with a fix to change the dynamic parent FPU with virtual loss bug to the fixed, dynamic parent eval.

@gcp
Copy link
Contributor Author

gcp commented Apr 28, 2018

you cannot do better with more exploiting, because the exploring is necessary to not fall for something eventually...

Somewhat unrelated to this patch, but with the UCT formula used in LZ(C) the parameter doesn't really control exploring as much as you would think, and this is a notable difference between DeepMind's formula and the original UCT one.

The problem is that the "exploration term" is still weighted by the policy prior. So no matter what weight you put here, you won't broaden the search to explore low policy prior moves much.

However, I did not manage to gain significant strength by "fixing" this and re-tuning, even if performance in many of the currently very bad tactical positions was much better.

@killerducky
Copy link
Collaborator

@zz4032

# PLAYER                        :  RATING  ERROR  LOS(%)  POINTS   GAMES  WON  DRAWN  LOST  DRAWS(%)
   1 Stockfish1.01                 :       0   ----    98.0   123.5     200  108     31    61      15.5
   2 LCZero_Id184_60db             :     -68     65    76.1    40.5     100   31     19    50      19.0
   3 LCZero_Id184_60db_Puct=0.6    :    -101     66     ---    36.0     100   30     12    58      12.0

100 games and showing 33 Elo difference? That seems rather small to me.

@jjoshua2 combining two tests:

0 Laser-1_0 4CPU                -44     130      24   43.8%   20.8%
1 lczero v7 id185 slowmover 120      89     227      12   62.5%    8.3%
Score of lczero v7 id185 puct 0.6 vs Laser-1_0 4CPU: 6 - 8 - 2 [0.438]
Elo difference: -43.66 +/- 173.99

First your test of the PR was only 16 games, that's clearly not enough. Second why does one say slowmover 120 and the other doesn't?

@gcp Lc0 is very close to deterministic due to lack of symmetries. What was your test method? The server is using --tempdecay=10 to inject more variety in matches. Other people are using opening books.

@gcp
Copy link
Contributor Author

gcp commented Apr 28, 2018

As already stated, I used an opening book.

@killerducky
Copy link
Collaborator

My thoughts and plan for this PR:

  1. There are some conflicting measurements for self-play and other bots. I think we should go on self-play because the training feedback loop is based on that. The idea is if the feedback loop is 100 Elo stronger it will improve faster. As we improve I hope this will translate into better play vs other bots as well. I view tests vs other bots as sanity checks.

  2. I'd like to see a test for this PR in self-play for at least one more net.

@killerducky killerducky changed the base branch from next to master April 29, 2018 15:55
@killerducky killerducky changed the base branch from master to next April 29, 2018 15:56
@killerducky
Copy link
Collaborator

There are more self-play tests in #466, all showing puct=0.6 is good.

@killerducky killerducky merged commit aac8099 into glinscott:next Apr 29, 2018
@glinscott
Copy link
Owner

Thanks for this @gcp!!

@zz4032
Copy link

zz4032 commented May 13, 2018

@killerducky I've set up CLOP tuning trying to reproduce the results and to find out if 1) the Puct setting is indeed time control dependent and 2) what the best values of Puct and FPU are against other engines (with regard to the next TCEC games).
@jjoshua2 suggested testing directly vs. alpha-beta engines of similar strength so that's what I did.

First, LCZero at very short time control (visits on opening moves ~ 400, 192x15 net).
For alpha-beta engines time control was further cut in half otherwise they are too strong on that TC and I wanted winrate to be close to 0.5 for tuning.
Tuning was done on the two parameters simultaneously, so possible interaction between the parameters is considered.
tuning02
Best values: Puct=0.60 and FPU=0.18.

Next, same tuning on 4x longer time control (visits on opening moves ~ 1200):
tuning02

Again, Puct=0.60, so probably not TC dependent, but lower FPU=0.10. One further test at 4min/game would be interesting.

Not sure how precise CLOP results are to be honest, but I think it makes sense so far.
The jumps of tuned values look a bit strange in the graph, maybe that's how CLOP finds new optima. :)

And one more tuning at selfplay (different net and lczero version). Should have been running a little longer, but still valuable:
tuning_self
Puct=0.55, FPU=0.19, so not much difference to gauntlet results.

Unrelated CLOP question: Anyone familiar with CLOP who can confirm that CLOP does not support UCI command names with spaces? I had to recompile Leela with UCI command 'FPU Reduction' renamed without the space character to get it working.

@zz4032
Copy link

zz4032 commented May 15, 2018

Ran a tuning on ~4min/game time control (visits on opening moves ~ 4000).
tuning03
Puct=0.68, FPU=0.02
There is a clear trend visible for both parameters now. Not sure however if a logarithmic trend is applicable in this case and if negative FPU values make sense.
tuning_sum

@mooskagh
Copy link
Contributor

Negative FPU reduction value does make sense.
It compensates bad policy head similarly to like positive FPU reduction amplifies good policy head.

@remdu
Copy link

remdu commented May 15, 2018

Very interesting. This means ideally, FPU reduction should fade away as the number of visits rise.

@jjoshua2
Copy link
Contributor

Doesn't that mean instead of:
fpu_reduction = cfg_fpu_reduction * std::sqrt(total_visited_policy);
Maybe it should be more like:
fpu_reduction = cfg_fpu_reduction * log(total_visited_policy);

@Videodr0me
Copy link

I had some thoughs on this issue here: https://github.com/glinscott/leela-chess/issues/606

I think we should never change puct during training, regardless if "tuning" the parameter yields some very short term self-play elo. The problem is that learning will readjust, policy head spread and over the long term quasi nullify any changes we made. This leads to learning chasing a moving target - and even worse the adjustment in learning can run into the softmax sum-to-one contraint and/or regularization contraints. I believe that this (in a small part) amplified our current oversampling problems of the value head. There is some room for using the parameter very gently and very gradually, maybe if learning stagnates, or generalization deterioates.

Also MCTS will never (!) find the required tactcs (especially in the "trappy" chess ) no matter how we "tweak" it, without policy already putting a substantial mass of probability on the tactical moves. This might seem counterintuitive at first but i think that in order for tactics to work, we should strengthen policy. If a higher puct shifts the balance more towards searching along "policies" lines, it will at the end of the deep lines sees it errors even with a "worse" value head and make policy better. Which in turn makes the value head better (as games outcome tend to better correlate with the positions). So one can make a case that the current regression in leelas tactics (chess) is exactly what to expect if we shift search towards an oversampled value head.

@zz4032
Copy link

zz4032 commented May 17, 2018

@Videodr0me Main reason for my tunings is having optimal values for matches, you're probably right about training.

I have tested the proposal
fpu_reduction = cfg_fpu_reduction * log(total_visited_policy+1);
with @jjoshua2 's support. +1 to avoid log(0), otherwise leela crashes.

cfg_fpu_reduction and Puct coefficient were tuned again (with a restart at 320 games for smaller clop bounds, wasn't necessary as it looks like). The final tuned values Puct=0.81 and FPU=0.36 can be considered as well tuned as those from clop tuning # 1 above at the same TC. Interestingly Puct has changed as much as FPU.
tuning-log
Match results at TC used for tuning (~1/4 min/game):

   # PLAYER                         :  RATING  ERROR  POINTS   GAMES  DRAWS(%)
   1 LCZero_v0.10_Id251_c4c2        :       0   ----   126.5     200      49.5
   2 LCZero_v0.10_Id251_c4c2_log    :     -95     35    73.5     200      49.5

Match results at longer TC (~1 min/game):

   # PLAYER                         :  RATING  ERROR  POINTS   GAMES  DRAWS(%)
   1 LCZero_v0.10_Id251_c4c2        :       0   ----   110.0     200      61.0
   2 LCZero_v0.10_Id251_c4c2_log    :     -35     35    90.0     200      61.0

Unfortunately no improvement. Maybe a linear function is now worth a try.

Edit:
fpu_reduction = cfg_fpu_reduction * total_visited_policy;
TC ~1/4 min/game:

   # PLAYER                            :  RATING  ERROR  POINTS   GAMES  DRAWS(%)
   1 LCZero_v0.10_Id251_c4c2           :       0   ----    65.5     100      53.0
   2 LCZero_v0.10_Id251_c4c2_linear    :    -113     54    34.5     100      53.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants