-
Notifications
You must be signed in to change notification settings - Fork 0
Value Repair method
Even with advanced large nets such as T60, some small fraction of positional evaluations are badly inaccurate. This may contribute to misguided search in normal play. [Note - for practical hardware reasons I use SF12 to find bad positional evaluations, but direct testing shows that doing the same with Leela after search will be very similar in outcome.]
To find such positions, I extract random positions from human games (lichess, Caissabase, ICCF) or from Leela training pgns. Each position is evaluated by Leela (1-node, no search) and by either SF12 (currently 1M nodes) or Leela (1k nodes). When the no search evaluation disagrees with the evaluation after search, the position is a candidate for value repair training. In the figure below, this corresponds to points that are far off the main diagonal. Direct testing of such instances shows that, for the large majority, the search evaluation (in this case SF12 1M nodes) is more accurate than the Leela 1-node evaluation. Occasionally the reverse is true, but these cases won't substantially hurt training since they will simply reinforce the evaluation.
The value repair training itself simply self-plays out a game starting from the value repair candidate position, using the book option for Leela self-play. Currently the self-play uses a recent T60 net with parameters similar to T60 main run, except temperature is 0 so as to make the game outcome (Z) as accurate as possible.
Value repair self-play games are mixed with normal training games for network training. Currently I am using about 40% value repair games, though each game contributes fewer positions to training because books moves are excluded.
Because I have a large excess of under-utilized CPU, I am currently using SF12dev for evaluation. The centipawn score is converted to a Q-like value using a logistic function with output in the range [-1.0, 1.0]. Positions where Leela no-search evaluation (Q) differs from the Q-like value by 0.55 or more are candidates for value repair training. Currently about 2-3% of positions meet these criteria.
# logistic function adjusted to range [-1, +1]
def logistic(cp, alpha):
return 2/(1+math.exp(-cp/alpha))-1
(alpha=170 currently, adjusted to minimize Leela and SF evaluation difference after conversion)
Efforts are underway to incorporate the method in future training runs by weighting training positions based on disagreement in evaluation before and after search for each move.
Graph of 1 million such positions, with a lowess regression line in blue:
Thanks to oscardsmith and dkappe for suggestions and help.