-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune PUCT parameter for chess. #435
Conversation
CLOP tuning indicates the optimal values are around 0.4 - 0.9, with an estimated maximum around 0.6, when measured in games with 2000-5000 playouts per move. Verifying at faster timecontrols (600-1600 playouts) confirms this is a strenght gain: Score of lczero-tuned vs lczero: 188 - 72 - 184 [0.631] 444 Elo difference: 92.93 +/- 24.96 SPRT: llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted
Well this is definitely going to interact with the FPU method testing I'm doing in #364. Glad you have some robust data here, thank you for contributing! However, may I ask exactly which binary you used for these tuning tests? Is it the 0.7 master branch, which uses dynamic parent eval with the bugged virtual loss effect in it, or some other branch? Also, can you please specify the network used for CLOP? My tests are rather conclusive that tuning with different nets produces (somewhat) different results. Edit: We were planning on doing a thorough combined CLOP run together with FPU reduction, but it looks like simulating virtual loss actually works quite well for FPU reduction as well, so there has been no resolution to this yet. |
Testing was based on master. I did a diff with "next" and there are no relevant changes there. Network was 182 | 33f9938a. |
I see what you mean:
Which is of course broken. I'll change my local builds to enforce single-threaded mode. But I assume that the current value (or the worth of having fpu_dynamic_eval!) was tuned incorporating this bug. That would be especially true as virtual loss would still be applied in single-threaded mode. |
Yes, that was introduced unintentionally, but when we compared it to "clean" dynamic parent eval, it looked like allowing the bug actually helped strength so we didn't fix it (yet). I think the virtual losses act like a different method of FPU reduction, but likely this has problems at very low visits counts (and may also have issues with multithreading obviously). I like that its effects on root level are very small, while deep into the tree it prunes more strongly. |
So if I have even more invasive changes to search, what should I use as the baseline, if not master/next? |
I wish I had a good answer for you there but unfortunately I don't. I'm trying to find a good an robust FPU method but am bottlenecked by a slow machine to test them. If you a good idea what to do with FPU (and capacity to test on multiple nets), that would be very welcome. Until then, I think we have to take your results as one measurement point simply showing that 0.6 works well with dynamic parent eval bugged by virtual loss. My current impression is that the search parameter tuning is not extremely high on the todo list of the main project admins, which I understand considering how fast it is currently developing. |
Well, it was never high on mine either, because tuning (in Go) always indicated strength changes of only a few Elo. I was quite astonished to get 100 Elo for free here. Was this ever tuned, or just copied from upstream? |
Technical question, could you give me a template of the script you use for CLOP? It is the version implemented in cutechess-cli, correct? We have a few volunteers who have better hardware and would be willing to run tuning in the future, and obviously CLOP would be better than the round-robin tournaments I used so far. |
Oh and by the way, did you a temperature decay schedule on those tuning tests, or Dirichlet noise, or just an external opening book? |
I used an external opening book. Small book, only openings that happened a lot in GM play. |
Should have no issue with game variety then. For match games we switched to a fractional temperature decay schedule, since it turned out that the variety from only Dirichlet noise became insufficient to avoid duplicate games: #267 Go doesn't have this problem due to the much larger search space, and the symmetries in NN evals. |
I used this one: https://github.com/cutechess/cutechess/blob/master/tools/clop-cutechess-cli.py For lczero you need to explicitly set the working dir, and unfortunately in master fixed visits is broken (this is fixed in next, it seems), which is why I used time controls. |
odd, I thought the fix for 'go nodes' got merged forward to master |
go nodes works, but -v and having a timecontrol specified (which cutechess requires) works the wrong way around. |
I'm surprised that 'works' on next either - -v is pretty much supposed to be ignored unless go is given without any qualification. But anyway drifting offtopic I guess. The methodology seems pretty good, and I'm not too surprised that puct would be worth tuning for again after the 'virtual loss bug' feature appeared. But probably best to do a custom build and share it with some of the people who run multi-engine gauntlets to see how it fares outside of self-play. (Because as you say 100 Elo is kind of surprising.) |
Puct is configurable via UCI, no custom build required. |
right, of course... my bad! |
@gcp - I think testing against next is best, similar to LZGo right? The plan is for master to have only fixes like MSVC doesn't compile, plus changes to the server code (by mistake I pulled a few commits outside this definition). As you said right now next and master do not have any changes that impact this PR. Let's ask a few of the other testers to use this PUCT value in their gauntlet testing. |
@killerducky Are you planning to merge this? I'm preparing a new PR right now to restrict the parameter space, and enable a combined puct/FPU tuning with only one FPU parameter. I'm thinking of switching to simulated VL for this and actually removing "normal" FPU reduction so the optimisation problem doesn't become unmanageable, as @gcp rightly remarked in #364. While I'm impressed with the 100 Elo gain, I think testing on a few more nets would have helped, and we should really do the puct and FPU reduction tuning at the same time. Edit @killerducky Would you have a problem with me removing the static parent eval for FPU? I think tests showed dynamic eval behaving better as base FPU before reduction, and we don't really need more free parameters to tune if puct needs tuning. |
We can take some more testing time, but this is the best thing we have so far, so you need to at least get close to it. |
I agree with you that it sounds good, but the method (use only self-play with external book and no temperature, only one net (which one?) is somewhat problematic in my opinion. I had tests as well that showed similar improvement margins for one net, they just didn't hold up to closer scrutiny. |
all at 40/1 TC all independent games from 3 tests
Nothing beat the default. It appears that against engines which are fast and good at tactics you cannot do better with more exploiting, because the exploring is necessary to not fall for something eventually... |
@jjoshua2 I'm confused why you didn't test PUCT=0.6 as in this PR? |
Very short test in selfplay with Puct=0.6 has shown an improvement of about 100Elo. But vs. other engines it looks differently. Time control was used (not visits) because probably the setting affects calculation time.
Calculation time on first move out of book: ~1.0s (~3500 visits) Gauntlet Stockfish 1.01 @ ~3s/game vs. LCZero @ ~10s/game (for about equal strength).
Calculation time on first move out of book: ~0.14s (~350 visits) |
@killerducky because i tested that and it was bad too. even worse turns out. it makes sense you can't prune as much against tactical monster like a 4 core alpha beta engine, so I thought maybe just like half way inbetween would help, but it didn't.
|
Do we really want to tune these parameters for specific outside opponents ? |
Based on the tests I posted in #438, I think this could actually be pulled along with a fix to change the dynamic parent FPU with virtual loss bug to the fixed, dynamic parent eval. |
Somewhat unrelated to this patch, but with the UCT formula used in LZ(C) the parameter doesn't really control exploring as much as you would think, and this is a notable difference between DeepMind's formula and the original UCT one. The problem is that the "exploration term" is still weighted by the policy prior. So no matter what weight you put here, you won't broaden the search to explore low policy prior moves much. However, I did not manage to gain significant strength by "fixing" this and re-tuning, even if performance in many of the currently very bad tactical positions was much better. |
100 games and showing 33 Elo difference? That seems rather small to me. @jjoshua2 combining two tests:
First your test of the PR was only 16 games, that's clearly not enough. Second why does one say slowmover 120 and the other doesn't? @gcp Lc0 is very close to deterministic due to lack of symmetries. What was your test method? The server is using |
As already stated, I used an opening book. |
My thoughts and plan for this PR:
|
There are more self-play tests in #466, all showing puct=0.6 is good. |
Thanks for this @gcp!! |
@killerducky I've set up CLOP tuning trying to reproduce the results and to find out if 1) the Puct setting is indeed time control dependent and 2) what the best values of Puct and FPU are against other engines (with regard to the next TCEC games). First, LCZero at very short time control (visits on opening moves ~ 400, 192x15 net). Next, same tuning on 4x longer time control (visits on opening moves ~ 1200): Again, Puct=0.60, so probably not TC dependent, but lower FPU=0.10. One further test at 4min/game would be interesting. Not sure how precise CLOP results are to be honest, but I think it makes sense so far. And one more tuning at selfplay (different net and lczero version). Should have been running a little longer, but still valuable: Unrelated CLOP question: Anyone familiar with CLOP who can confirm that CLOP does not support UCI command names with spaces? I had to recompile Leela with UCI command 'FPU Reduction' renamed without the space character to get it working. |
Negative FPU reduction value does make sense. |
Very interesting. This means ideally, FPU reduction should fade away as the number of visits rise. |
Doesn't that mean instead of: |
I had some thoughs on this issue here: https://github.com/glinscott/leela-chess/issues/606 I think we should never change puct during training, regardless if "tuning" the parameter yields some very short term self-play elo. The problem is that learning will readjust, policy head spread and over the long term quasi nullify any changes we made. This leads to learning chasing a moving target - and even worse the adjustment in learning can run into the softmax sum-to-one contraint and/or regularization contraints. I believe that this (in a small part) amplified our current oversampling problems of the value head. There is some room for using the parameter very gently and very gradually, maybe if learning stagnates, or generalization deterioates. Also MCTS will never (!) find the required tactcs (especially in the "trappy" chess ) no matter how we "tweak" it, without policy already putting a substantial mass of probability on the tactical moves. This might seem counterintuitive at first but i think that in order for tactics to work, we should strengthen policy. If a higher puct shifts the balance more towards searching along "policies" lines, it will at the end of the deep lines sees it errors even with a "worse" value head and make policy better. Which in turn makes the value head better (as games outcome tend to better correlate with the positions). So one can make a case that the current regression in leelas tactics (chess) is exactly what to expect if we shift search towards an oversampled value head. |
@Videodr0me Main reason for my tunings is having optimal values for matches, you're probably right about training. I have tested the proposal
Match results at longer TC (~1 min/game):
Unfortunately no improvement. Maybe a linear function is now worth a try. Edit:
|
CLOP tuning indicates the optimal values are around 0.4 - 0.9, with an
estimated maximum around 0.6, when measured in games with 2000-5000
playouts per move.
Verifying at faster timecontrols (600-1600 playouts) confirms this is a
strenght gain:
Score of lczero-tuned vs lczero: 188 - 72 - 184 [0.631] 444
Elo difference: 92.93 +/- 24.96
SPRT: llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted