Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return deterministic actions #5597

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions com.unity.ml-agents/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ and this project adheres to
1. env_params.max_lifetime_restarts (--max-lifetime-restarts) [default=10]
2. env_params.restarts_rate_limit_n (--restarts-rate-limit-n) [default=1]
3. env_params.restarts_rate_limit_period_s (--restarts-rate-limit-period-s) [default=60]

- Added a new `--deterministic` cli flag to deterministically select the most probable actions in policy. The same thing can
be achieved by adding `deterministic: true` under `network_settings` of the run options configuration.
### Bug Fixes
- Fixed the bug where curriculum learning would crash because of the incorrect run_options parsing. (#5586)

Expand Down
1 change: 1 addition & 0 deletions docs/Training-Configuration-File.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ choice of the trainer (which we review on subsequent sections).
| `network_settings -> normalize` | (default = `false`) Whether normalization is applied to the vector observation inputs. This normalization is based on the running average and variance of the vector observation. Normalization can be helpful in cases with complex continuous control problems, but may be harmful with simpler discrete control problems. |
| `network_settings -> vis_encode_type` | (default = `simple`) Encoder type for encoding visual observations. <br><br> `simple` (default) uses a simple encoder which consists of two convolutional layers, `nature_cnn` uses the CNN implementation proposed by [Mnih et al.](https://www.nature.com/articles/nature14236), consisting of three convolutional layers, and `resnet` uses the [IMPALA Resnet](https://arxiv.org/abs/1802.01561) consisting of three stacked layers, each with two residual blocks, making a much larger network than the other two. `match3` is a smaller CNN ([Gudmundsoon et al.](https://www.researchgate.net/publication/328307928_Human-Like_Playtesting_with_Deep_Learning)) that can capture more granular spatial relationships and is optimized for board games. `fully_connected` uses a single fully connected dense layer as encoder without any convolutional layers. <br><br> Due to the size of convolution kernel, there is a minimum observation size limitation that each encoder type can handle - `simple`: 20x20, `nature_cnn`: 36x36, `resnet`: 15 x 15, `match3`: 5x5. `fully_connected` doesn't have convolutional layers and thus no size limits, but since it has less representation power it should be reserved for very small inputs. Note that using the `match3` CNN with very large visual input might result in a huge observation encoding and thus potentially slow down training or cause memory issues. |
| `network_settings -> conditioning_type` | (default = `hyper`) Conditioning type for the policy using goal observations. <br><br> `none` treats the goal observations as regular observations, `hyper` (default) uses a HyperNetwork with goal observations as input to generate some of the weights of the policy. Note that when using `hyper` the number of parameters of the network increases greatly. Therefore, it is recommended to reduce the number of `hidden_units` when using this `conditioning_type`
| `network_settings -> deterministic` | (default = `false`) When set to true, ensures that actions are selected from the models output deterministically to ensure predictable and reproducible results. This can be overwritten by the `--deterministic` flag on the CLI.


## Trainer-specific Configurations
Expand Down
7 changes: 7 additions & 0 deletions ml-agents/mlagents/trainers/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,13 @@ def _create_parser() -> argparse.ArgumentParser:
"before resuming training. This option is only valid when the models exist, and have the same "
"behavior names as the current agents in your scene.",
)
argparser.add_argument(
"--deterministic",
default=False,
dest="deterministic",
action=DetectDefaultStoreTrue,
help="Whether to select actions deterministically in policy. `dist.mean` for continuous action space, and `dist.argmax` for deterministic action space ",
)
argparser.add_argument(
"--force",
default=False,
Expand Down
13 changes: 13 additions & 0 deletions ml-agents/mlagents/trainers/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ def _check_valid_memory_size(self, attribute, value):
vis_encode_type: EncoderType = EncoderType.SIMPLE
memory: Optional[MemorySettings] = None
goal_conditioning_type: ConditioningType = ConditioningType.HYPER
deterministic: bool = parser.get_default("deterministic")


@attr.s(auto_attribs=True)
Expand Down Expand Up @@ -928,9 +929,11 @@ def from_argparse(args: argparse.Namespace) -> "RunOptions":
key
)
)

# Override with CLI args
# Keep deprecated --load working, TODO: remove
argparse_args["resume"] = argparse_args["resume"] or argparse_args["load_model"]

for key, val in argparse_args.items():
if key in DetectDefault.non_default_args:
if key in attr.fields_dict(CheckpointSettings):
Expand All @@ -950,6 +953,16 @@ def from_argparse(args: argparse.Namespace) -> "RunOptions":
if isinstance(final_runoptions.behaviors, TrainerSettings.DefaultTrainerDict):
# configure whether or not we should require all behavior names to be found in the config YAML
final_runoptions.behaviors.set_config_specified(_require_all_behaviors)

_non_default_args = DetectDefault.non_default_args

# Prioritize the deterministic mode from the cli for deterministic actions.
if "deterministic" in _non_default_args:
for behaviour in final_runoptions.behaviors.keys():
final_runoptions.behaviors[
behaviour
].network_settings.deterministic = argparse_args["deterministic"]

return final_runoptions

@staticmethod
Expand Down
8 changes: 7 additions & 1 deletion ml-agents/mlagents/trainers/tests/test_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,7 @@ def test_exportable_settings(use_defaults):
init_entcoef: 0.5
reward_signal_steps_per_update: 10.0
network_settings:
deterministic: true
normalize: false
hidden_units: 256
num_layers: 3
Expand Down Expand Up @@ -528,7 +529,10 @@ def test_environment_settings():

def test_default_settings():
# Make default settings, one nested and one not.
default_settings = {"max_steps": 1, "network_settings": {"num_layers": 1000}}
default_settings = {
"max_steps": 1,
"network_settings": {"num_layers": 1000, "deterministic": True},
}
behaviors = {"test1": {"max_steps": 2, "network_settings": {"hidden_units": 2000}}}
run_options_dict = {"default_settings": default_settings, "behaviors": behaviors}
run_options = RunOptions.from_dict(run_options_dict)
Expand All @@ -541,7 +545,9 @@ def test_default_settings():
test1_settings = run_options.behaviors["test1"]
assert test1_settings.max_steps == 2
assert test1_settings.network_settings.hidden_units == 2000
assert test1_settings.network_settings.deterministic is True
assert test1_settings.network_settings.num_layers == 1000

# Change the overridden fields back, and check if the rest are equal.
test1_settings.max_steps = 1
test1_settings.network_settings.hidden_units == default_settings_cls.network_settings.hidden_units
Expand Down
34 changes: 32 additions & 2 deletions ml-agents/mlagents/trainers/tests/torch/test_action_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
from mlagents_envs.base_env import ActionSpec


def create_action_model(inp_size, act_size):
def create_action_model(inp_size, act_size, deterministic=False):
mask = torch.ones([1, act_size * 2])
action_spec = ActionSpec(act_size, tuple(act_size for _ in range(act_size)))
action_model = ActionModel(inp_size, action_spec)
action_model = ActionModel(inp_size, action_spec, deterministic=deterministic)
return action_model, mask


Expand Down Expand Up @@ -43,6 +43,36 @@ def test_sample_action():
assert _disc.shape == (1, 1)


def test_deterministic_sample_action():
inp_size = 4
act_size = 2
action_model, masks = create_action_model(inp_size, act_size, deterministic=True)
sample_inp = torch.ones((1, inp_size))
dists = action_model._get_dists(sample_inp, masks=masks)
agent_action1 = action_model._sample_action(dists)
agent_action2 = action_model._sample_action(dists)
agent_action3 = action_model._sample_action(dists)
assert torch.equal(agent_action1.continuous_tensor, agent_action2.continuous_tensor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some tests on discrete actions would be great!

assert torch.equal(agent_action1.continuous_tensor, agent_action3.continuous_tensor)
assert torch.equal(agent_action1.discrete_tensor, agent_action2.discrete_tensor)
assert torch.equal(agent_action1.discrete_tensor, agent_action3.discrete_tensor)

action_model, masks = create_action_model(inp_size, act_size, deterministic=False)
sample_inp = torch.ones((1, inp_size))
dists = action_model._get_dists(sample_inp, masks=masks)
agent_action1 = action_model._sample_action(dists)
agent_action2 = action_model._sample_action(dists)
agent_action3 = action_model._sample_action(dists)
assert not torch.equal(
agent_action1.continuous_tensor, agent_action2.continuous_tensor
)
assert not torch.equal(
agent_action1.continuous_tensor, agent_action3.continuous_tensor
)
assert not torch.equal(agent_action1.discrete_tensor, agent_action2.discrete_tensor)
assert not torch.equal(agent_action1.discrete_tensor, agent_action3.discrete_tensor)


def test_get_probs_and_entropy():
inp_size = 4
act_size = 2
Expand Down
18 changes: 15 additions & 3 deletions ml-agents/mlagents/trainers/torch/action_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ def __init__(
action_spec: ActionSpec,
conditional_sigma: bool = False,
tanh_squash: bool = False,
deterministic: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the docstring

):
"""
A torch module that represents the action space of a policy. The ActionModel may contain
Expand All @@ -43,6 +44,7 @@ def __init__(
:params action_spec: The ActionSpec defining the action space dimensions and distributions.
:params conditional_sigma: Whether or not the std of a Gaussian is conditioned on state.
:params tanh_squash: Whether to squash the output of a Gaussian with the tanh function.
:params deterministic: Whether to select actions deterministically in policy.
"""
super().__init__()
self.encoding_size = hidden_size
Expand All @@ -66,22 +68,32 @@ def __init__(
# During training, clipping is done in TorchPolicy, but we need to clip before ONNX
# export as well.
self._clip_action_on_export = not tanh_squash
self._deterministic = deterministic

def _sample_action(self, dists: DistInstances) -> AgentAction:
"""
Samples actions from a DistInstances tuple
:params dists: The DistInstances tuple
:return: An AgentAction corresponding to the actions sampled from the DistInstances
"""

continuous_action: Optional[torch.Tensor] = None
discrete_action: Optional[List[torch.Tensor]] = None
# This checks None because mypy complains otherwise
print(self._deterministic)
if dists.continuous is not None:
continuous_action = dists.continuous.sample()
if self._deterministic:
continuous_action = dists.continuous.deterministic_sample()
else:
continuous_action = dists.continuous.sample()
if dists.discrete is not None:
discrete_action = []
for discrete_dist in dists.discrete:
discrete_action.append(discrete_dist.sample())
if self._deterministic:
for discrete_dist in dists.discrete:
discrete_action.append(discrete_dist.deterministic_sample())
else:
for discrete_dist in dists.discrete:
discrete_action.append(discrete_dist.sample())
return AgentAction(continuous_action, discrete_action)

def _get_dists(self, inputs: torch.Tensor, masks: torch.Tensor) -> DistInstances:
Expand Down
13 changes: 13 additions & 0 deletions ml-agents/mlagents/trainers/torch/distributions.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@ def sample(self) -> torch.Tensor:
"""
pass

@abc.abstractmethod
def deterministic_sample(self) -> torch.Tensor:
"""
Return the most probable sample from this distribution.
"""
pass

@abc.abstractmethod
def log_prob(self, value: torch.Tensor) -> torch.Tensor:
"""
Expand Down Expand Up @@ -59,6 +66,9 @@ def sample(self):
sample = self.mean + torch.randn_like(self.mean) * self.std
return sample

def deterministic_sample(self):
return self.mean

def log_prob(self, value):
var = self.std ** 2
log_scale = torch.log(self.std + EPSILON)
Expand Down Expand Up @@ -113,6 +123,9 @@ def __init__(self, logits):
def sample(self):
return torch.multinomial(self.probs, 1)

def deterministic_sample(self):
return torch.argmax(self.probs, dim=1, keepdim=True)

def pdf(self, value):
# This function is equivalent to torch.diag(self.probs.T[value.flatten().long()]),
# but torch.diag is not supported by ONNX export.
Expand Down
1 change: 1 addition & 0 deletions ml-agents/mlagents/trainers/torch/networks.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,6 +617,7 @@ def __init__(
action_spec,
conditional_sigma=conditional_sigma,
tanh_squash=tanh_squash,
deterministic=network_settings.deterministic,
)

@property
Expand Down