Skip to content
HBP1969 edited this page Jul 17, 2024 · 40 revisions

This Wiki is merely a notebook from the author, rather than the "ultimate truth". Suggestions, remarks and corrections by raising an Issue are more than welcome!

Features of the model

The model demonstrates for instance:

  • The environment is a bounded grid world and the agents move within a Von Neumann neighborhood.
  • Restricted to one similar agent type per cell
  • Three agent types: Predator, Prey and Grass
  • Two learning agent types: Predator and Prey, learning to move in a Von Neumann neighborhood
  • Learning agents have partially observations of the entire model state; Prey can see farther than Predators in the baseline
  • Learned behavior of Predators and Prey as such to avoid being eaten or starving to death
  • Predators and Prey lose energy due to movement and homeostasis
  • Grass gains energy due to photosynthesis
  • Dynamically removing agents from the grid when eaten (Prey and Grass) or starving to death (Predator and Prey)
  • Dynamically adding agents to the grid when Predators or Prey gain sufficient energy to reproduce asexually
  • Optionally,a rectangular spawning area per agent type within the grid world can be specified and narrowed down
  • Part of the energy of the parent is transferred to the child if reproduction occurs
  • Grass is removed from the grid world after being eaten by prey, but regrows at the same spot after a certain number of steps
  • Episode ends when either all Predators or all Prey are dead

Design restrictions and workarounds for the environment and the PPO algorithm

The AEC Environment Architecture

Due to unexpected behavior when agents terminate during a simulation in PettingZoo AEC (https://github.com/Farama-Foundation/PettingZoo/issues/713), we modified the architecture. The 'AECEnv.agents' array remains unchanged after agent death or creation. The removal of agents is managed by 'PredPrey.predator_instance_list' and PredPreyGrass.[predator/prey]_instance_list. The active status of agents is furthermore tracked by the boolean attribute alive of the agents. Optionally, a number of agents have this attribute alive set to False at reset, which gives room for creation of agents during run time. If so, the agents are (re)added to PredPreyGrass.[predator/prey]_instance_list.

This architecture provides an alternative to the unexpected behavior of individual agents terminating during simulation in the standard PettingZoo API and circumvents the PPO-algorithm's requirement of an unchanged number of agents during training. In that sense it is comparable to SuperSuit's "Black Death" wrapper.

Restrictions for Multi Agent Reinforcement Learning and the PPO Algorithm

MarkovVectorEnv does not support environments with varying numbers of active agents. In the case of environments following the Farama Gymnasium interface, which is a common standard, the step method is used to advance the environment's state. The step method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.

Here's a simplified example:

observation, reward, done, info = env.step(action)

In this line of code, reward is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.

So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the environment that represents the reward. It just needs to know the structure of the data returned by the environment's step method.

Environment workarounds

ad 1. Implement an overall max_observation_range and a specific (smaller) observation range per agent by masking ("zero-ing") all non-observable cells. Note that setting the max_observation_range unneededly high will result in unneeded computing time loss.

ad 2. Implement an overall maximum observation space. In this case a specific observation channel can have a value high at max(n_predators,n_prey,n_max). Note that this is not yet implemented all learning agents (Predator and Prey) have the same action movements.

ad 3. Implement an overall (maximal) action_range and (heavily) penalize actions which are usually prohibited. Note that setting the action_range unneededly high will result in unneeded computing time loss.

ad 4. At reset a fixed number of agents is initialized and remains constant. However they are Active ("alive") or in Inactive ("dead" or not "born" yet), which is checked at the beginning of the step function. Inactive agents do not change at step. Inactive agents are handled similarly as in the "black death" wrapper.[See PettingZoo's Knights-Archer-Zombies environment: Not able to use the standard PettingZoo procedure to remove agents from 'self.agents' array. Knights-Archer-Zombies environment documentation states: "This environment allows agents to spawn and die, so it requires using SuperSuit’s Black Death wrapper, which provides blank observations to dead agents rather than removing them from the environment."]

Workaround used. Maintain the self.agents array from creation onwards and implement "alive" boolean. At death: -remove the agent from the agent layer, so other agents cannot observe the dead agents. -Change all relevant values to zero (observations, energy-level)

PPO in MARL

Proximal Policy Optimization (PPO) itself does not inherently distinguish between different types of agents in a multi-agent reinforcement learning (MARL) scenario. PPO is a policy optimization algorithm designed to work with environments where multiple agents are present, but it treats each agent as an independent learner. Each agent then uses these inputs to update its own policy. The PPO algorithm works independently for each agent, optimizing their policies to maximize their individual expected cumulative rewards. Each agent maintains its own policy, and learning is based on the individual experiences and rewards of each agent. PPO, as a learning algorithm, will adapt to these representations and learn policies for each agent based on their individual experiences in the environment.