The central focus of this repository is developing strategies (i.e. a policy) and analyzing how well they perform. This section of the documentation shows the basic steps to develop an agent strategy by explaining the strategies used by the hand-crafted agents in this repository. In another documentation file, I will explain and explore how to craft deep RL agents.
For an analysis of how well these hand-crafted agents perform, see the documentation discussing multi-player ELO.
Note: in my standard import of the agents package:
from dominoes import agents as da
it is assumed that all agents are imported in the
__init__.py
file. If you want this to
work, make sure you add any agents you code to those imports.
Coding a new policy requires overwriting the default dominoes agent functions. For simple agents (examples below), this only requires overwriting the methods documented under "Choosing an option to play" in the agents documentation. For more advanced agents (such as the TD-lambda agents), some methods related to processing the game state will need to be overwritten too.
In general, the only methods that need overwriting are the optionValue
method, which assigns a value to each legal play, and the makeChoice
method,
which chooses a dominoe based on the option values of each legal play.
However, selectPlay
sometimes needs overwriting, in particular for the
TD-lambda agents, which cannot measure option value in parallel as is done
by the basic hand-crafted agents.
This repository contains several hand-crafted agents that play the game with simple policies. This section of the documentation explains these policies and in doing so, demonstrates how to develop a policy.
The default dominoe agent (e.g. the top-level class found in the
dominoeAgent.py
file) uses the
simplest strategy: it plays a random option out of all legal plays. To
accomplish this, it assigns an optionValue of 1 to each legal move, then
chooses an option with Thompson sampling. (Since all options have the same
value, this means choosing randomly with a uniform distribution).
Greedy agents set the option value of each legal option to the total number of
points on each legal dominoe. For example, the dominoe (3|4) will be assigned
a value of 7. To accomplish this: it overwrites the optionValue
function as
follows:
def optionValue(self, locations, dominoes):
return self.dominoeValue[dominoes]
Then, the greedy agent simply plays whichever option has the highest value by
overwriting the makeChoice
function:
def makeChoice(self, optionValue):
return np.argmax(optionValue)
Stupid agents work just like the greedy agent, except they play the dominoe
with the least value by using np.argmin
rather than np.argmax
.
Double agents assign an infinite value to any double option (because it allows the agent to play again). Then, to any other option that isn't a double, the value is set to the number of points on the option, just like for greedy agents.
This repository also contains a slightly more advanced agent that chooses plays in a way that maximizes it's ability to play all of it's dominoes during a hand based on the way they connect to each other - i.e. where each dominoe represents the edge of a graph. It requires additional functions beyond those that are used in the default dominoe agent.
Identifying a set of dominoes that form a sequence of legal plays, starting at whatever number is available to play on an agent's own line, is fundamental to success in dominoes. This is a challenging computation, related to the traveling salesman problem. The best-line agent solves this with an abbreviated brute-force search, in which it stops sampling unique sequences after they reach a certain length, and in which it efficiently updates a precomputed list of all possible sequences rather than recomputing it every time.
Once all sequences are computed, it picks the "best one" based on several meta-parameters. To do this, it measures the total discounted value of each sequence based on how many turns it will take to play each dominoe in the sequence, then subtracts the discounted value of each dominoe in the agent's hand that isn't in that sequence.
Finally, the agent assigns the full sequence value to the first dominoe of the "best" sequence, assigns an infinite value to any double, and assigns the standard dominoe value to all other dominoes (e.g. the total number of points on the dominoe). Then, it chooses the play with the highest value. This means that the bestLine agent will always play a double, then will usually play dominoes according to the best sequence on its own line, and only plays a different dominoe if that dominoe has a higher value than all dominoes in it's best line.
This strategy is good (it's the best out of all the hand-crafted agents, see the section on ELO), but is missing several key ideas for strong dominoes play. For example, the best sequence isn't necessarily the one with the highest overall value, it's more important to get rid of as many points as possible before the current hand is over. Suppose the best line agent had the dominoes (0|1)-(1|2)-(2|0)-(0|9)-(9|9) left in it's hand. Playing these in order would get all of it's dominoes out (assuming it has a 0 available on it's line). However, if a different agent is likely to go out in two turns, it should probably skip to (0|9)-(9|9) to get those points out before the turn is over. This is something I hope to analyze in deep RL agents. However, letting the line discount factor be dependent on the expected number of turns remaining might be a way to dramatically improve the agents performance (I might get to coding this later).
- inLineDiscount: temporal discounting for dominoes in each line
- offLineDiscount: temporal discounting for dominoes not in each line
- lineTemperature: varies how noisy the choice is for the "best line". In practice, this isn't really used because it only picks the line with the highest value rather than Thompson sampling or some other strategy...
- maxLineLength: this is the maximum number of dominoes to sequence together into a possible line. If this is set to a high number, then the computation time can become prohibitively long, especially for numPlayer/highestDominoe combinations that lead to each agent having many dominoes in their line.
- needsLineUpdate: boolean variable indicating whether the agent needs to fully recompute all possible lines it can play (set to True at the beginning of a new hand or if a different agent plays on this agents line).
- useSmartUpdate: boolean variable indicating whether to use the efficient update of all possible lines (see below).
The best line agent adds an intermediary step to computing option value for
it's policy: the dominoeLineValue
method. Line value is composed of two
components: the inLineValue and the offLineValue.
- inLineValue: the total number of discounted points in a sequence of dominoes where the discount factor is applied once for every turn required to get to that dominoe in the sequence (e.g. first dominoe in a sequence has value = 1*dominoeValue, second dominoe has value = inLineDiscount*dominoeValue, etc, skipping turns if they use a double).
- offLineValue: thte total number of discounted points of dominoes not in a particular sequence (where the offLineDiscount is applied as many times as there are turns in each sequence).
The best line agent depends on some supporting functions in the functions.py file:
- constructLineRecursive: takes as input the list of dominoes in the current set, the list of indices of dominoes in an agent's hand, and whatever value is available on the agent's line, and recursively creates a list of lists of every possible sequence that the agent can play on it's own line. It stops at a certain line length if the "maxLineLength" input is specified.
- updateLine: takes as input the previously computed list of sequences, along with the index of the dominoe that was just played and a boolean "onOwn" indicating whether the dominoe was played on the agent's own line. It then updates each sequence within lineSequence if it includes the "nextPlay".
The persistent line agent is a slight variation of the best line agent.
Instead of choosing the best line on every turn (which requires lots of
computation to recalculate all the possible lines, even with the efficient
updateLine
method), it chooses a best line once (at the beginning), and only
updates it if the line is disrupted. This is a little bit faster (~20% faster
when highestDominoe=9 and maxLineLength=12) and results in small variation in
policy. For a comparison of their policy performance, see the
experiment where their ELOs are compared in a special
league.