- mlagents.trainers.trainer.on_policy_trainer
- mlagents.trainers.trainer.off_policy_trainer
- mlagents.trainers.trainer.rl_trainer
- mlagents.trainers.trainer.trainer
- mlagents.trainers.settings
class OnPolicyTrainer(RLTrainer)
The PPOTrainer is an implementation of the PPO algorithm.
| __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
Responsible for collecting experiences and training an on-policy model.
Arguments:
behavior_name
: The name of the behavior associated with trainer configreward_buff_cap
: Max reward history to track in the reward buffertrainer_settings
: The parameters for the trainer.training
: Whether the trainer is set for training.load
: Whether the model should be loaded.seed
: The seed the model will be initialized withartifact_path
: The directory within which to store artifacts from this trainer.
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
Arguments:
parsed_behavior_id
: Behavior identifiers that the policy should belong to.policy
: Policy to associate with name_behavior_id.
class OffPolicyTrainer(RLTrainer)
The SACTrainer is an implementation of the SAC algorithm, with support for discrete actions and recurrent networks.
| __init__(behavior_name: str, reward_buff_cap: int, trainer_settings: TrainerSettings, training: bool, load: bool, seed: int, artifact_path: str)
Responsible for collecting experiences and training an off-policy model.
Arguments:
behavior_name
: The name of the behavior associated with trainer configreward_buff_cap
: Max reward history to track in the reward buffertrainer_settings
: The parameters for the trainer.training
: Whether the trainer is set for training.load
: Whether the model should be loaded.seed
: The seed the model will be initialized withartifact_path
: The directory within which to store artifacts from this trainer.
| save_model() -> None
Saves the final training model to memory Overrides the default to save the replay buffer.
| save_replay_buffer() -> None
Save the training buffer's update buffer to a pickle file.
| load_replay_buffer() -> None
Loads the last saved replay buffer from a file.
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
class RLTrainer(Trainer)
This class is the base class for trainers that use Reward Signals.
| end_episode() -> None
A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
| @abc.abstractmethod
| create_optimizer() -> TorchOptimizer
Creates an Optimizer object
| save_model() -> None
Saves the policy associated with this trainer.
| advance() -> None
Steps the trainer, taking in trajectories and updates if ready. Will block and wait briefly if there are no trajectories.
class Trainer(abc.ABC)
This class is the base class for the mlagents_envs.trainers
| __init__(brain_name: str, trainer_settings: TrainerSettings, training: bool, load: bool, artifact_path: str, reward_buff_cap: int = 1)
Responsible for collecting experiences and training a neural network model.
Arguments:
brain_name
: Brain name of brain to be trained.trainer_settings
: The parameters for the trainer (dictionary).training
: Whether the trainer is set for training.artifact_path
: The directory within which to store artifacts from this trainerreward_buff_cap
:
| @property
| stats_reporter()
Returns the stats reporter associated with this Trainer.
| @property
| parameters() -> TrainerSettings
Returns the trainer parameters of the trainer.
| @property
| get_max_steps() -> int
Returns the maximum number of steps. Is used to know when the trainer should be stopped.
Returns:
The maximum number of steps of the trainer
| @property
| get_step() -> int
Returns the number of steps the trainer has performed
Returns:
the step count of the trainer
| @property
| threaded() -> bool
Whether or not to run the trainer in a thread. True allows the trainer to update the policy while the environment is taking steps. Set to False to enforce strict on-policy updates (i.e. don't update the policy when taking steps.)
| @property
| should_still_train() -> bool
Returns whether or not the trainer should train. A Trainer could stop training if it wasn't training to begin with, or if max_steps is reached.
| @property
| reward_buffer() -> Deque[float]
Returns the reward buffer. The reward buffer contains the cumulative rewards of the most recent episodes completed by agents using this trainer.
Returns:
the reward buffer.
| @abc.abstractmethod
| save_model() -> None
Saves model file(s) for the policy or policies associated with this trainer.
| @abc.abstractmethod
| end_episode()
A signal that the Episode has ended. The buffer must be reset. Get only called when the academy resets.
| @abc.abstractmethod
| create_policy(parsed_behavior_id: BehaviorIdentifiers, behavior_spec: BehaviorSpec) -> Policy
Creates a Policy object
| @abc.abstractmethod
| add_policy(parsed_behavior_id: BehaviorIdentifiers, policy: Policy) -> None
Adds policy to trainer.
| get_policy(name_behavior_id: str) -> Policy
Gets policy associated with name_behavior_id
Arguments:
name_behavior_id
: Fully qualified behavior name
Returns:
Policy associated with name_behavior_id
| @abc.abstractmethod
| advance() -> None
Advances the trainer. Typically, this means grabbing trajectories from all subscribed trajectory queues (self.trajectory_queues), and updating a policy using the steps in them, and if needed pushing a new policy onto the right policy queues (self.policy_queues).
| publish_policy_queue(policy_queue: AgentManagerQueue[Policy]) -> None
Adds a policy queue to the list of queues to publish to when this Trainer makes a policy update
Arguments:
policy_queue
: Policy queue to publish to.
| subscribe_trajectory_queue(trajectory_queue: AgentManagerQueue[Trajectory]) -> None
Adds a trajectory queue to the list of queues for the trainer to ingest Trajectories from.
Arguments:
trajectory_queue
: Trajectory queue to read from.
deep_update_dict(d: Dict, update_d: Mapping) -> None
Similar to dict.update(), but works for nested dicts of dicts as well.
@attr.s(auto_attribs=True)
class RewardSignalSettings()
| @staticmethod
| structure(d: Mapping, t: type) -> Any
Helper method to structure a Dict of RewardSignalSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of RewardSignalSettings classes.
@attr.s(auto_attribs=True)
class ParameterRandomizationSettings(abc.ABC)
| __str__() -> str
Helper method to output sampler stats to console.
| @staticmethod
| structure(d: Union[Mapping, float], t: type) -> "ParameterRandomizationSettings"
Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure(). This is needed to handle the special Enum selection of ParameterRandomizationSettings classes.
| @staticmethod
| unstructure(d: "ParameterRandomizationSettings") -> Mapping
Helper method to a ParameterRandomizationSettings class. Meant to be registered with cattr.register_unstructure_hook() and called with cattr.unstructure().
| @abc.abstractmethod
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the appropriate sampler type set method.
Arguments:
key
: environment parameter to be sampledenv_channel
: The EnvironmentParametersChannel to communicate sampler settings to environment
@attr.s(auto_attribs=True)
class ConstantSettings(ParameterRandomizationSettings)
| __str__() -> str
Helper method to output sampler stats to console.
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the constant sampler type set method.
Arguments:
key
: environment parameter to be sampledenv_channel
: The EnvironmentParametersChannel to communicate sampler settings to environment
@attr.s(auto_attribs=True)
class UniformSettings(ParameterRandomizationSettings)
| __str__() -> str
Helper method to output sampler stats to console.
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the uniform sampler type set method.
Arguments:
key
: environment parameter to be sampledenv_channel
: The EnvironmentParametersChannel to communicate sampler settings to environment
@attr.s(auto_attribs=True)
class GaussianSettings(ParameterRandomizationSettings)
| __str__() -> str
Helper method to output sampler stats to console.
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the gaussian sampler type set method.
Arguments:
key
: environment parameter to be sampledenv_channel
: The EnvironmentParametersChannel to communicate sampler settings to environment
@attr.s(auto_attribs=True)
class MultiRangeUniformSettings(ParameterRandomizationSettings)
| __str__() -> str
Helper method to output sampler stats to console.
| apply(key: str, env_channel: EnvironmentParametersChannel) -> None
Helper method to send sampler settings over EnvironmentParametersChannel Calls the multirangeuniform sampler type set method.
Arguments:
key
: environment parameter to be sampledenv_channel
: The EnvironmentParametersChannel to communicate sampler settings to environment
@attr.s(auto_attribs=True)
class CompletionCriteriaSettings()
CompletionCriteriaSettings contains the information needed to figure out if the next lesson must start.
| need_increment(progress: float, reward_buffer: List[float], smoothing: float) -> Tuple[bool, float]
Given measures, this method returns a boolean indicating if the lesson needs to change now, and a float corresponding to the new smoothed value.
@attr.s(auto_attribs=True)
class Lesson()
Gathers the data of one lesson for one environment parameter including its name, the condition that must be fullfiled for the lesson to be completed and a sampler for the environment parameter. If the completion_criteria is None, then this is the last lesson in the curriculum.
@attr.s(auto_attribs=True)
class EnvironmentParameterSettings()
EnvironmentParameterSettings is an ordered list of lessons for one environment parameter.
| @staticmethod
| structure(d: Mapping, t: type) -> Dict[str, "EnvironmentParameterSettings"]
Helper method to structure a Dict of EnvironmentParameterSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
@attr.s(auto_attribs=True)
class TrainerSettings(ExportableSettings)
| @staticmethod
| structure(d: Mapping, t: type) -> Any
Helper method to structure a TrainerSettings class. Meant to be registered with cattr.register_structure_hook() and called with cattr.structure().
@attr.s(auto_attribs=True)
class CheckpointSettings()
| prioritize_resume_init() -> None
Prioritize explicit command line resume/init over conflicting yaml options. if both resume/init are set at one place use resume
@attr.s(auto_attribs=True)
class RunOptions(ExportableSettings)
| @staticmethod
| from_argparse(args: argparse.Namespace) -> "RunOptions"
Takes an argparse.Namespace as specified in parse_command_line
, loads input configuration files
from file paths, and converts to a RunOptions instance.
Arguments:
args
: collection of command-line parameters passed to mlagents-learn
Returns:
RunOptions representing the passed in arguments, with trainer config, curriculum and sampler configs loaded from files.