Confusing results in simple spread environment #236

Destiny000621 · 2024-05-26T18:01:17Z

I ran the COMA, HATRPO, and MAPPO algorithms in the Simple Spread environment for 500,000 timesteps. None of them achieved a reward higher than -100. However, in the results folder, most rewards are in the range of -30 to -40. After training, the reward is even lower than the one at the start. The model parameters I used are the same as the ones in the results folder.

from marllib import marl

# prepare env
env = marl.make_env(environment_name="mpe", map_name="simple_spread", force_coop=True)

# initialize algorithm with appointed hyper-parameters
coma = marl.algos.coma(hyperparam_source='mpe')

# build agent model based on env + algorithms + user preference
model = marl.build_model(env, coma, {"core_arch": "gru", "encode_layer": "128-256"})

# start training
coma.fit(env, model, stop={'timesteps_total': 500000}, share_policy='group', checkpoint_freq=100000, checkpoint_end=True)

The text was updated successfully, but these errors were encountered:

florin-pop · 2024-06-24T13:37:32Z

Are you plotting the episode_reward_mean or episode_reward_max? I suspect that the "reward" in the results csv is the ray/tune/episode_reward_max, but I may be wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing results in simple spread environment #236

Confusing results in simple spread environment #236

Destiny000621 commented May 26, 2024 •

edited

Loading

florin-pop commented Jun 24, 2024

Confusing results in simple spread environment #236

Confusing results in simple spread environment #236

Comments

Destiny000621 commented May 26, 2024 • edited Loading

florin-pop commented Jun 24, 2024

Destiny000621 commented May 26, 2024 •

edited

Loading