minor update

hesic73 · Mar 2, 2024 · 546f153 · 546f153
1 parent 4087b3a
commit 546f153
Show file tree

Hide file tree

Showing 14 changed files with 113 additions and 58 deletions.
diff --git a/README.md b/README.md
@@ -1,14 +1,10 @@
 # Gomoku RL
 
-**Empirically, Independent RL is enough (and in fact much better than PSRO).** As mentioned in [[1]](#refer-anchor-1), due to Gomoku's asymmetry, it's hard to train a network to play both black and white.
+Documentation: https://hesic73.github.io/gomoku_rl/
 
 ![](/assets//images/screenshot_0.gif)
 
-## TO DO
-
-- [x] Restructure the code to decouple rollout functionality from `GomokuEnv`.
-- [ ] Enhance documentaion.
-- [ ] Further improvement
+[TOC]
 
 ## Introduction
 
@@ -19,6 +15,8 @@
 Install *gomoku_rl* with the following command:
 
 ```bash
+git clone git@github.com:hesic73/gomoku_rl.git
+cd gomoku_rl
 conda create -n gomoku python=3.11.5
 conda activate gomoku
 pip install -e .
@@ -32,7 +30,7 @@ I use python 3.11.5, torch 2.1.0 and **torchrl 0.2.1**. Lower versions of python
 
 ```bash
 # override default settings in cfg/train_InRL.yaml
-python scripts/train_InRL.py num_env=1024 device=cuda epochs=3000 wandb.mode=online
+python scripts/train_InRL.py num_env=1024 device=cuda epochs=500 wandb.mode=online
 # or simply:
 python scripts/train_InRL.py.py
 ```
@@ -49,18 +47,14 @@ python scripts/demo.py device=cpu grid_size=56 piece_radius=24 checkpoint=/model
 python scripts/demo.py
 ```
 
-Pretrained models for a $15\times15$ board are available under  `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPOPolicy` object with a shared encoder cannot load from a checkpoint without sharing.
-
-## Documentation
-
-https://hesic73.github.io/gomoku_rl/
+Pretrained models for a $15\times15$ board are available under  `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPO` object with a shared encoder cannot load from a checkpoint without sharing.
 
 ## GUI
 
 **Note:  for deployment, we opt for `torch.jit.ScriptModule` instead of `torch.nn.Module`.** The `*.pt` files used in `scripts/train_*.py` are state dicts of a `torch.nn.Module` and cannot be directly utilized in this context.
 
 
-In addition to `scripts/demo.py`, there is a standalone C++ GUI application. To compile the source code, make sure to have Qt5, Libtorch and cmake installed. Refer to [https://pytorch.org/cppdocs/installing.html](https://pytorch.org/cppdocs/installing.html) for instructions on how to install C++ distributions of Pytorch.
+In addition to `scripts/demo.py`, there is a standalone C++ GUI application. To compile the source code, make sure to have Qt, Libtorch and cmake installed. Refer to [https://pytorch.org/cppdocs/installing.html](https://pytorch.org/cppdocs/installing.html) for instructions on how to install C++ distributions of Pytorch.
 
 Here are the commands to build the executable:
 
@@ -83,19 +77,23 @@ cmake --build . --config Release
 **PS**: If CMake cannot find Torch, try `set(Torch_DIR /absolute/path/to/libtorch/share/cmake/torch)`.
 
 
-## Supported Algorithms
+## Algorithms
+
+Presently, the framework incorporates PPO and DQN algorithms, with a designed flexibility for incorporating additional RL methods. In the realm of multi-agent training, it supports Independent RL and PSRO.
 
-- PPO
-- DQN
+Notably, Independent RL has demonstrated superior efficacy over PSRO. As mentioned in [[1]](#refer-anchor-1), due to Gomoku's asymmetry, it's hard to train a network to play both black and white.
 
+(Maybe I need to tune hyperparameters for PSRO.)
 
 ## Details
 
 Free-style Gomoku is a two-player zero-sum extensive-form game. Two players alternatively place black and white stones on a board and the first who forms an unbroken line of five or more stones of his color wins. In the context of Multi-Agent Reinforcement Learning (MARL), two agents learn in the environment competitively. During each agent's turn, its observation is the (encoded) current board state, and its action is the selection of a position on the board to place a stone. We use action masking to prevent illegal moves. Winning rewards the agent with +1, while losing incurs a penalty of -1. 
 
-## Limitations
+## TO DO
 
-- Constrained to Free-style Gomoku support only.
+- [x] Restructure the code to decouple rollout functionality from `GomokuEnv`.
+- [ ] Enhance documentaion.
+- [ ] Further improvement
 
 ## References
 

diff --git a/cfg/algo/ppo.yaml b/cfg/algo/ppo.yaml
@@ -17,7 +17,7 @@ share_network: true
 optimizer:
   name: adam
   kwargs:
-    lr: 5e-4
+    lr: 1e-4
 
 
 num_channels: 64

diff --git a/cfg/train_InRL.yaml b/cfg/train_InRL.yaml
@@ -1,5 +1,5 @@
 seed: 0
-num_envs: 1024
+num_envs: 512
 board_size: 15
 device: cuda
 out_device: cpu
@@ -12,8 +12,8 @@ steps: 128
 save_interval: 300
 
 
-black_checkpoint: black_final.pt # pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt
-white_checkpoint: white_final.pt # pretrained_models/${board_size}_${board_size}/${algo.name}/1.pt
+black_checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt
+white_checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/1.pt
 
 wandb:
   group: ${board_size}_${board_size}_${algo.name}_InRL

diff --git a/cfg/train_psro.yaml b/cfg/train_psro.yaml
@@ -1,5 +1,5 @@
 seed: 0
-num_envs: 1024
+num_envs: 512
 board_size: 15
 device: cuda
 out_device: cpu

diff --git a/cfg/train_psro_sp.yaml b/cfg/train_psro_sp.yaml
@@ -1,5 +1,5 @@
 seed: 0
-num_envs: 1024
+num_envs: 512
 board_size: 15
 device: cuda
 out_device: cpu
@@ -12,15 +12,15 @@ steps: 128
 save_interval: -1
 
 # psro
-meta_solver: uniform
+meta_solver: last_2
 mean_threshold: 0.99
 std_threshold: 0.005
 min_iter_steps: 30
 max_iter_steps: 500
 
 checkpoint: pretrained_models/${board_size}_${board_size}/${algo.name}/0.pt
 
-population_dir: pretrained_models/${board_size}_${board_size}/${algo.name}
+# population_dir: pretrained_models/${board_size}_${board_size}/${algo.name}
 
 wandb:
   group: ${board_size}_${board_size}_${algo.name}_sp

diff --git a/docs/index.rst b/docs/index.rst
@@ -56,14 +56,12 @@ After training, play Gomoku with your model using the `scripts/demo.py` script:
    python scripts/demo.py
 
 
-Pretrained models for a :math:`15\times15` board are available under  `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPOPolicy` object with a shared encoder cannot load from a checkpoint without sharing.
-
-
-
+Pretrained models for a :math:`15\times15` board are available under  `pretrained_models/15_15/`. Be aware that using the wrong model for the board size will lead to loading errors due to mismatches in AI architectures. In PPO, when `share_network=True`, the actor and the critic could utilize a shared encoding module. At present, a `PPO` object with a shared encoder cannot load from a checkpoint without sharing.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 4
    :caption: Contents:
 
-   modules
+   gomoku_rl
+
 
diff --git a/gomoku_rl/collector.py b/gomoku_rl/collector.py
@@ -259,6 +259,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
                 self._t,
             ) = self_play_step(self._env, self._policy, self._t_minus_1, self._t)
 
+            # truncate the last transition
+            if i == steps-2:
+                transition["next", "done"] = torch.ones(
+                    transition["next", "done"].shape, dtype=torch.bool, device=transition.device)
+
             if self._augment:
                 transition = augment_transition(transition)
 
@@ -310,10 +315,10 @@ def rollout(self, steps: int) -> tuple[TensorDict, TensorDict, dict]:
             steps (int): The number of steps to execute in the environment for this rollout. It is adjusted to be an even number to ensure an equal number of actions for both players.
 
         Returns:
-        tuple: A tuple containing three elements:
-            - A TensorDict of transitions collected for the black player, with each transition representing a game state before the black player's action, the action taken, and the resulting state.
-            - A TensorDict of transitions collected for the white player, structured similarly to the black player's transitions. Note that for the first step, the white player does not take an action, so their collection starts from the second step.
-            - A dictionary containing additional information about the rollout.
+            tuple: A tuple containing three elements:
+                - A TensorDict of transitions collected for the black player, with each transition representing a game state before the black player's action, the action taken, and the resulting state.
+                - A TensorDict of transitions collected for the white player, structured similarly to the black player's transitions. Note that for the first step, the white player does not take an action, so their collection starts from the second step.
+                - A dictionary containing additional information about the rollout.
 
         """
 
@@ -363,6 +368,13 @@ def rollout(self, steps: int) -> tuple[TensorDict, TensorDict, dict]:
                 self._t,
             ) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t)
 
+            # truncate the last transition
+            if i == steps//2-1:
+                transition_black["next", "done"] = torch.ones(
+                    transition_black["next", "done"].shape, dtype=torch.bool, device=transition_black.device)
+                transition_white["next", "done"] = torch.ones(
+                    transition_white["next", "done"].shape, dtype=torch.bool, device=transition_white.device)
+
             if self._augment:
                 transition_black = augment_transition(transition_black)
                 if i != 0:
@@ -468,6 +480,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
                 self._t,
             ) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t, return_black_transitions=True, return_white_transitions=False)
 
+            # truncate the last transition
+            if i == steps//2-1:
+                transition_black["next", "done"] = torch.ones(
+                    transition_black["next", "done"].shape, dtype=torch.bool, device=transition_black.device)
+
             if self._augment:
                 transition_black = augment_transition(transition_black)
 
@@ -569,6 +586,11 @@ def rollout(self, steps: int) -> tuple[TensorDict, dict]:
                 self._t,
             ) = round(self._env, self._policy_black, self._policy_white, self._t_minus_1, self._t, return_black_transitions=False, return_white_transitions=True)
 
+            # truncate the last transition
+            if i == steps//2-1:
+                transition_white["next", "done"] = torch.ones(
+                    transition_white["next", "done"].shape, dtype=torch.bool, device=transition_white.device)
+
             if self._augment:
                 if i != 0 and len(transition_white) > 0:
                     transition_white = augment_transition(transition_white)

diff --git a/gomoku_rl/policy/__init__.py b/gomoku_rl/policy/__init__.py
@@ -1,10 +1,9 @@
 from .base import Policy
-from .ppo import PPOPolicy
-from .dqn import DQNPolicy
+from .ppo import PPO
+from .dqn import DQN
 
 from torchrl.data.tensor_specs import DiscreteTensorSpec, TensorSpec
 from omegaconf import DictConfig
-from torch.cuda import _device_t
 import torch
 
 
@@ -13,14 +12,23 @@ def get_policy(
     cfg: DictConfig,
     action_spec: DiscreteTensorSpec,
     observation_spec: TensorSpec,
-    device: _device_t = "cuda",
+    device="cuda",
 ) -> Policy:
-    policies = {
-        "ppo": PPOPolicy,
-        "dqn": DQNPolicy,
-    }
-    assert name.lower() in policies
-    cls = policies[name.lower()]
+    """
+    Retrieves a policy object based on the specified policy name, configuration, action and observation specifications, and device.
+
+    Args:
+        name (str): The name of the policy to retrieve, which should match a key in the Policy registry.
+        cfg (DictConfig): Configuration settings for the policy, typically containing hyperparameters and other policy-specific settings.
+        action_spec (DiscreteTensorSpec): The specification of the action space, defining the shape, type, and bounds of actions the policy can take.
+        observation_spec (TensorSpec): The specification of the observation space, defining the shape and type of observations the policy will receive from the environment.
+        device: The computing device ('cuda' or 'cpu') where the policy computations will be performed. Defaults to "cuda".
+
+    Returns:
+        Policy: An instance of the requested policy class, initialized with the provided configurations, action and observation specifications, and device.
+    """
+
+    cls = Policy.REGISTRY[name.lower()]
     return cls(
         cfg=cfg,
         action_spec=action_spec,
@@ -35,8 +43,22 @@ def get_pretrained_policy(
     action_spec: DiscreteTensorSpec,
     observation_spec: TensorSpec,
     checkpoint_path: str,
-    device: _device_t = "cuda",
+    device="cuda",
 ) -> Policy:
+    """
+    Initializes and returns a pretrained policy object based on the specified policy name, configuration, action and observation specifications, checkpoint path, and device.
+
+    Args:
+        name (str): The name of the policy to be loaded, corresponding to a key in the Policy registry.
+        cfg (DictConfig): Configuration settings for the policy, typically containing hyperparameters and other policy-specific settings.
+        action_spec (DiscreteTensorSpec): The specification of the action space, detailing the shape, type, and bounds of actions the policy can execute.
+        observation_spec (TensorSpec): The specification of the observation space, detailing the shape and type of observations the policy will receive from the environment.
+        checkpoint_path (str): The file path to the saved model checkpoint from which the policy's state should be loaded.
+        device: The computing device ('cuda' or 'cpu') on which the policy computations will be executed. Defaults to "cuda".
+
+    Returns:
+        Policy: An instance of the specified policy class, initialized with the provided configurations, action and observation specifications, and pretrained weights loaded from the given checkpoint path.
+    """
     policy = get_policy(
         name=name,
         cfg=cfg,

diff --git a/gomoku_rl/policy/base.py b/gomoku_rl/policy/base.py
@@ -1,20 +1,31 @@
 import abc
-from typing import Dict
+from typing import Dict, Type
 from tensordict import TensorDict
 from torchrl.data.replay_buffers import ReplayBuffer
 from torchrl.data.tensor_specs import DiscreteTensorSpec, TensorSpec
 from omegaconf import DictConfig
 
 
 class Policy(abc.ABC):
+
+    REGISTRY: dict[str, Type["Policy"]] = {}
+
+    @classmethod
+    def __init_subclass__(cls, **kwargs):
+        if cls.__name__ in Policy.REGISTRY:
+            raise ValueError
+        super().__init_subclass__(**kwargs)
+        Policy.REGISTRY[cls.__name__] = cls
+        Policy.REGISTRY[cls.__name__.lower()] = cls
+
     @abc.abstractmethod
     def __init__(
         self,
         cfg: DictConfig,
         action_spec: DiscreteTensorSpec,
         observation_spec: TensorSpec,
-        device = "cuda",
-    ) -> None:
+        device="cuda",
+    ):
         """Initializes the policy.
 
         Args:

diff --git a/gomoku_rl/policy/common.py b/gomoku_rl/policy/common.py
@@ -1,3 +1,6 @@
+from typing import Generator
+
+
 from torch.optim import Optimizer, Adam, AdamW
 import torch
 import torch.nn as nn
@@ -166,10 +169,9 @@ def make_ppo_ac(
     return ActorValueOperator(common_module, policy_module, value_module)
 
 
-def make_dataset_naive(tensordict: TensorDict, batch_size: int):
-    tensordict=tensordict.reshape(-1)
-    assert tensordict.shape[0] >= batch_size
+def make_dataset_naive(tensordict: TensorDict, batch_size: int) -> Generator[TensorDict, None, None]:
     tensordict = tensordict.reshape(-1)
+    assert tensordict.shape[0] >= batch_size
     perm = torch.randperm(
         (tensordict.shape[0] // batch_size) * batch_size,
         device=tensordict.device,

diff --git a/gomoku_rl/policy/dqn.py b/gomoku_rl/policy/dqn.py
@@ -36,7 +36,7 @@ def get_replay_buffer(
     return buffer
 
 
-class DQNPolicy(Policy):
+class DQN(Policy):
     def __init__(
         self,
         cfg: DictConfig,

diff --git a/gomoku_rl/policy/ppo.py b/gomoku_rl/policy/ppo.py
@@ -20,7 +20,7 @@
 )
 
 
-class PPOPolicy(Policy):
+class PPO(Policy):
     def __init__(
         self,
         cfg: DictConfig,