Skip to content

Commit

Permalink
minor update
Browse files Browse the repository at this point in the history
  • Loading branch information
hesic73 committed Mar 2, 2024
1 parent 4087b3a commit 546f153
Show file tree
Hide file tree
Showing 14 changed files with 113 additions and 58 deletions.
34 changes: 16 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,10 @@
# Gomoku RL

**Empirically, Independent RL is enough (and in fact much better than PSRO).** As mentioned in [[1]](#refer-anchor-1), due to Gomoku's asymmetry, it's hard to train a network to play both black and white.
Documentation: https://hesic73.github.io/gomoku_rl/

![](/assets//images/screenshot_0.gif)

## TO DO

- [x] Restructure the code to decouple rollout functionality from `GomokuEnv`.
- [ ] Enhance documentaion.
- [ ] Further improvement
[TOC]

## Introduction

Expand All @@ -19,6 +15,8 @@
Install *gomoku_rl* with the following command:

```bash
git clone git@github.com:hesic73/gomoku_rl.git
cd gomoku_rl
conda create -n gomoku python=3.11.5
conda activate gomoku
pip install -e .
Expand All @@ -32,7 +30,7 @@ I use python 3.11.5, torch 2.1.0 and **torchrl 0.2.1**. Lower versions of python

```bash
# override default settings in cfg/train_InRL.yaml
python scripts/train_InRL.py num_env=1024 device=cuda epochs=3000 wandb.mode=online
python scripts/train_InRL.py num_env=1024 device=cuda epochs=500 wandb.mode=online
# or simply:
python scripts/train_InRL.py.py
```
Expand All @@ -49,18 +47,14 @@ python scripts/demo.py device=cpu grid_size=56 piece_radius=24 checkpoint=/model
python scripts/demo.py
```

Pretrained models for a $15\times15$ board are available under `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPOPolicy` object with a shared encoder cannot load from a checkpoint without sharing.

## Documentation

https://hesic73.github.io/gomoku_rl/
Pretrained models for a $15\times15$ board are available under `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPO` object with a shared encoder cannot load from a checkpoint without sharing.

## GUI

**Note: for deployment, we opt for `torch.jit.ScriptModule` instead of `torch.nn.Module`.** The `*.pt` files used in `scripts/train_*.py` are state dicts of a `torch.nn.Module` and cannot be directly utilized in this context.


In addition to `scripts/demo.py`, there is a standalone C++ GUI application. To compile the source code, make sure to have Qt5, Libtorch and cmake installed. Refer to [https://pytorch.org/cppdocs/installing.html](https://pytorch.org/cppdocs/installing.html) for instructions on how to install C++ distributions of Pytorch.
In addition to `scripts/demo.py`, there is a standalone C++ GUI application. To compile the source code, make sure to have Qt, Libtorch and cmake installed. Refer to [https://pytorch.org/cppdocs/installing.html](https://pytorch.org/cppdocs/installing.html) for instructions on how to install C++ distributions of Pytorch.

Here are the commands to build the executable:

Expand All @@ -83,19 +77,23 @@ cmake --build . --config Release
**PS**: If CMake cannot find Torch, try `set(Torch_DIR /absolute/path/to/libtorch/share/cmake/torch)`.


## Supported Algorithms
## Algorithms

Presently, the framework incorporates PPO and DQN algorithms, with a designed flexibility for incorporating additional RL methods. In the realm of multi-agent training, it supports Independent RL and PSRO.

- PPO
- DQN
Notably, Independent RL has demonstrated superior efficacy over PSRO. As mentioned in [[1]](#refer-anchor-1), due to Gomoku's asymmetry, it's hard to train a network to play both black and white.

(Maybe I need to tune hyperparameters for PSRO.)

## Details

Free-style Gomoku is a two-player zero-sum extensive-form game. Two players alternatively place black and white stones on a board and the first who forms an unbroken line of five or more stones of his color wins. In the context of Multi-Agent Reinforcement Learning (MARL), two agents learn in the environment competitively. During each agent's turn, its observation is the (encoded) current board state, and its action is the selection of a position on the board to place a stone. We use action masking to prevent illegal moves. Winning rewards the agent with +1, while losing incurs a penalty of -1.

## Limitations
## TO DO

- Constrained to Free-style Gomoku support only.
- [x] Restructure the code to decouple rollout functionality from `GomokuEnv`.
- [ ] Enhance documentaion.
- [ ] Further improvement

## References

Expand Down
2 changes: 1 addition & 1 deletion cfg/algo/ppo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ share_network: true
optimizer:
name: adam
kwargs:
lr: 5e-4
lr: 1e-4


num_channels: 64
Expand Down
6 changes: 3 additions & 3 deletions cfg/train_InRL.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
seed: 0
num_envs: 1024
num_envs: 512
board_size: 15
device: cuda
out_device: cpu
Expand All @@ -12,8 +12,8 @@ steps: 128
save_interval: 300


black_checkpoint: black_final.pt # pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt
white_checkpoint: white_final.pt # pretrained_models/${board_size}_${board_size}/${algo.name}/1.pt
black_checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt
white_checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/1.pt

wandb:
group: ${board_size}_${board_size}_${algo.name}_InRL
Expand Down
2 changes: 1 addition & 1 deletion cfg/train_psro.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
seed: 0
num_envs: 1024
num_envs: 512
board_size: 15
device: cuda
out_device: cpu
Expand Down
6 changes: 3 additions & 3 deletions cfg/train_psro_sp.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
seed: 0
num_envs: 1024
num_envs: 512
board_size: 15
device: cuda
out_device: cpu
Expand All @@ -12,15 +12,15 @@ steps: 128
save_interval: -1

# psro
meta_solver: uniform
meta_solver: last_2
mean_threshold: 0.99
std_threshold: 0.005
min_iter_steps: 30
max_iter_steps: 500

checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt

population_dir: pretrained_models/${board_size}_${board_size}/${algo.name}
# population_dir: pretrained_models/${board_size}_${board_size}/${algo.name}

wandb:
group: ${board_size}_${board_size}_${algo.name}_sp
Expand Down
10 changes: 4 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,12 @@ After training, play Gomoku with your model using the `scripts/demo.py` script:
python scripts/demo.py
Pretrained models for a :math:`15\times15` board are available under `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPOPolicy` object with a shared encoder cannot load from a checkpoint without sharing.



Pretrained models for a :math:`15\times15` board are available under `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPO` object with a shared encoder cannot load from a checkpoint without sharing.

.. toctree::
:maxdepth: 2
:maxdepth: 4
:caption: Contents:

modules
gomoku_rl


30 changes: 26 additions & 4 deletions gomoku_rl/collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
self._t,
) = self_play_step(self._env, self._policy, self._t_minus_1, self._t)

# truncate the last transition
if i == steps-2:
transition["next", "done"] = torch.ones(
transition["next", "done"].shape, dtype=torch.bool, device=transition.device)

if self._augment:
transition = augment_transition(transition)

Expand Down Expand Up @@ -310,10 +315,10 @@ def rollout(self, steps: int) -> tuple[TensorDict, TensorDict, dict]:
steps (int): The number of steps to execute in the environment for this rollout. It is adjusted to be an even number to ensure an equal number of actions for both players.
Returns:
tuple: A tuple containing three elements:
- A TensorDict of transitions collected for the black player, with each transition representing a game state before the black player's action, the action taken, and the resulting state.
- A TensorDict of transitions collected for the white player, structured similarly to the black player's transitions. Note that for the first step, the white player does not take an action, so their collection starts from the second step.
- A dictionary containing additional information about the rollout.
tuple: A tuple containing three elements:
- A TensorDict of transitions collected for the black player, with each transition representing a game state before the black player's action, the action taken, and the resulting state.
- A TensorDict of transitions collected for the white player, structured similarly to the black player's transitions. Note that for the first step, the white player does not take an action, so their collection starts from the second step.
- A dictionary containing additional information about the rollout.
"""

Expand Down Expand Up @@ -363,6 +368,13 @@ def rollout(self, steps: int) -> tuple[TensorDict, TensorDict, dict]:
self._t,
) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t)

# truncate the last transition
if i == steps//2-1:
transition_black["next", "done"] = torch.ones(
transition_black["next", "done"].shape, dtype=torch.bool, device=transition_black.device)
transition_white["next", "done"] = torch.ones(
transition_white["next", "done"].shape, dtype=torch.bool, device=transition_white.device)

if self._augment:
transition_black = augment_transition(transition_black)
if i != 0:
Expand Down Expand Up @@ -468,6 +480,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
self._t,
) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t, return_black_transitions=True, return_white_transitions=False)

# truncate the last transition
if i == steps//2-1:
transition_black["next", "done"] = torch.ones(
transition_black["next", "done"].shape, dtype=torch.bool, device=transition_black.device)

if self._augment:
transition_black = augment_transition(transition_black)

Expand Down Expand Up @@ -569,6 +586,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
self._t,
) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t, return_black_transitions=False, return_white_transitions=True)

# truncate the last transition
if i == steps//2-1:
transition_white["next", "done"] = torch.ones(
transition_white["next", "done"].shape, dtype=torch.bool, device=transition_white.device)

if self._augment:
if i != 0 and len(transition_white) > 0:
transition_white = augment_transition(transition_white)
Expand Down
44 changes: 33 additions & 11 deletions gomoku_rl/policy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
from .base import Policy
from .ppo import PPOPolicy
from .dqn import DQNPolicy
from .ppo import PPO
from .dqn import DQN

from torchrl.data.tensor_specs import DiscreteTensorSpec, TensorSpec
from omegaconf import DictConfig
from torch.cuda import _device_t
import torch


Expand All @@ -13,14 +12,23 @@ def get_policy(
cfg: DictConfig,
action_spec: DiscreteTensorSpec,
observation_spec: TensorSpec,
device: _device_t = "cuda",
device="cuda",
) -> Policy:
policies = {
"ppo": PPOPolicy,
"dqn": DQNPolicy,
}
assert name.lower() in policies
cls = policies[name.lower()]
"""
Retrieves a policy object based on the specified policy name, configuration, action and observation specifications, and device.
Args:
name (str): The name of the policy to retrieve, which should match a key in the Policy registry.
cfg (DictConfig): Configuration settings for the policy, typically containing hyperparameters and other policy-specific settings.
action_spec (DiscreteTensorSpec): The specification of the action space, defining the shape, type, and bounds of actions the policy can take.
observation_spec (TensorSpec): The specification of the observation space, defining the shape and type of observations the policy will receive from the environment.
device: The computing device ('cuda' or 'cpu') where the policy computations will be performed. Defaults to "cuda".
Returns:
Policy: An instance of the requested policy class, initialized with the provided configurations, action and observation specifications, and device.
"""

cls = Policy.REGISTRY[name.lower()]
return cls(
cfg=cfg,
action_spec=action_spec,
Expand All @@ -35,8 +43,22 @@ def get_pretrained_policy(
action_spec: DiscreteTensorSpec,
observation_spec: TensorSpec,
checkpoint_path: str,
device: _device_t = "cuda",
device="cuda",
) -> Policy:
"""
Initializes and returns a pretrained policy object based on the specified policy name, configuration, action and observation specifications, checkpoint path, and device.
Args:
name (str): The name of the policy to be loaded, corresponding to a key in the Policy registry.
cfg (DictConfig): Configuration settings for the policy, typically containing hyperparameters and other policy-specific settings.
action_spec (DiscreteTensorSpec): The specification of the action space, detailing the shape, type, and bounds of actions the policy can execute.
observation_spec (TensorSpec): The specification of the observation space, detailing the shape and type of observations the policy will receive from the environment.
checkpoint_path (str): The file path to the saved model checkpoint from which the policy's state should be loaded.
device: The computing device ('cuda' or 'cpu') on which the policy computations will be executed. Defaults to "cuda".
Returns:
Policy: An instance of the specified policy class, initialized with the provided configurations, action and observation specifications, and pretrained weights loaded from the given checkpoint path.
"""
policy = get_policy(
name=name,
cfg=cfg,
Expand Down
17 changes: 14 additions & 3 deletions gomoku_rl/policy/base.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,31 @@
import abc
from typing import Dict
from typing import Dict, Type
from tensordict import TensorDict
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.tensor_specs import DiscreteTensorSpec, TensorSpec
from omegaconf import DictConfig


class Policy(abc.ABC):

REGISTRY: dict[str, Type["Policy"]] = {}

@classmethod
def __init_subclass__(cls, **kwargs):
if cls.__name__ in Policy.REGISTRY:
raise ValueError
super().__init_subclass__(**kwargs)
Policy.REGISTRY[cls.__name__] = cls
Policy.REGISTRY[cls.__name__.lower()] = cls

@abc.abstractmethod
def __init__(
self,
cfg: DictConfig,
action_spec: DiscreteTensorSpec,
observation_spec: TensorSpec,
device = "cuda",
) -> None:
device="cuda",
):
"""Initializes the policy.
Args:
Expand Down
8 changes: 5 additions & 3 deletions gomoku_rl/policy/common.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
from typing import Generator


from torch.optim import Optimizer, Adam, AdamW
import torch
import torch.nn as nn
Expand Down Expand Up @@ -166,10 +169,9 @@ def make_ppo_ac(
return ActorValueOperator(common_module, policy_module, value_module)


def make_dataset_naive(tensordict: TensorDict, batch_size: int):
tensordict=tensordict.reshape(-1)
assert tensordict.shape[0] >= batch_size
def make_dataset_naive(tensordict: TensorDict, batch_size: int) -> Generator[TensorDict, None, None]:
tensordict = tensordict.reshape(-1)
assert tensordict.shape[0] >= batch_size
perm = torch.randperm(
(tensordict.shape[0] // batch_size) * batch_size,
device=tensordict.device,
Expand Down
2 changes: 1 addition & 1 deletion gomoku_rl/policy/dqn.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def get_replay_buffer(
return buffer


class DQNPolicy(Policy):
class DQN(Policy):
def __init__(
self,
cfg: DictConfig,
Expand Down
2 changes: 1 addition & 1 deletion gomoku_rl/policy/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
)


class PPOPolicy(Policy):
class PPO(Policy):
def __init__(
self,
cfg: DictConfig,
Expand Down
Loading

0 comments on commit 546f153

Please sign in to comment.