[Bug] Inconsistencies with `EvalCallback` tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

skandermoalla-Q · 2021-06-02T11:48:41Z

🐛 Bug

EvalCallback records logs that can be used with Tensorboad, but does not dump them.

These logs remain recorded and are written by the Algorithm when it dumps its rollout logs during training. However, this will not necessarily happen at the timestep EvalCallback recorded its logs, resulting in those evaluation logs being recorded with the wrong timestep.

Synchronizing EvalCallback recordings with the Algorithm log dumping is someimes not possible as EvalCallback records the evaluation every n timesteps but some algorithms like DQN dump their rollout logs every n episodes.

Moreover, even with PPO which records its rollout logs every n rollout steps, setting EvalCallback to recording every n rollout steps leads to the last evaluation logs (if using a StopTrainingOnRewardThreshold callback) not being ~~recorded~~ dumped, as the rollout is interrupted.

To Reproduce

eval_env = gym.make('CartPole-v1')
eval_callback = EvalCallback(eval_env)

model = DQN('MlpPolicy', 'CartPole-v1', tensorboard_log="./tb_logs", verbose=1)
model.learn(50000, callback=eval_callback)

...

Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 9.4      |
|    mean_reward      | 9.4      |
| rollout/            |          |
|    ep_len_mean      | 22.9     |
|    ep_rew_mean      | 22.9     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 444      |
|    fps              | 13712    |
|    time_elapsed     | 0        |
|    total timesteps  | 10047    |
----------------------------------

...

Eval num_timesteps=30000, episode_reward=9.60 +/- 1.02
Episode length: 9.60 +/- 1.02
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 9.6      |
|    mean_reward      | 9.6      |
| rollout/            |          |
|    ep_len_mean      | 20.9     |
|    ep_rew_mean      | 20.9     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 1356     |
|    fps              | 14140    |
|    time_elapsed     | 2        |
|    total timesteps  | 30030    |
----------------------------------

...

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.7     |
|    ep_rew_mean      | 21.7     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 2252     |
|    fps              | 14347    |
|    time_elapsed     | 3        |
|    total timesteps  | 49947    |
----------------------------------
Eval num_timesteps=50000, episode_reward=9.20 +/- 0.40
Episode length: 9.20 +/- 0.40

Problems:

Evaluation at num_timesteps=10000 written with timestep=10047.
Last evaluation not written in logs.

Expected behavior

EvalCallback's logs should be written with the timestep at which they were recorded (i.e. at the timestep the environment was evaluated).
This should be consistent irrespectively from the Algorithm used for the training.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

araffin · 2021-06-02T12:42:25Z

Hello,
The inconsistencies you are describing do exist, however I'm afraid there is no good solution, unless you have one?

Synchronizing EvalCallback recordings with the Algorithm log dumping is someimes not possible as EvalCallback records the evaluation every n timesteps but some algorithms like DQN dump their rollout logs every n episodes.

The reason behind it is how those algorithms work. A2C, PPO usually use multiple envs, so every call to env.step() corresponds to n_envs timesteps, which makes it harder to rely on episodes for logging. DQN and other off-policy algorithm only use one env (for now, see #179 ) hence the logging by episode.

Overall, Tensorboard should be used only for visualization, for proper comparison please use the logs generated by the callbacks. We have scripts in the RL zoo training framework to make those comparison plots.

Evaluation at num_timesteps=10000 written with timestep=10047.

Is that really a big issue when visualizing different runs?

Last evaluation not written in logs.

For that, you can simply give a larger training budget to make sure the latest evaluation is logged.

not being recorded, as the rollout is interrupted.

Something can be probably done for that one though (moving the break after the logging part).

skandermoalla-Q · 2021-06-02T14:04:46Z

Hello,

I'm afraid there is no good solution, unless you have one?

Maybe I'm naive, but wouldn't dumping the logs with self.logger.dum(self.num_timesteps) after recording them solve the issue?
The potential problems with this, as far as I can imagine, could be:

Flushing recoded logs that were not meant to be dumped at that timestep. (For example rollout/exploration rate is in the logger's memory when EvalCallback's log are recorded). -> In this case I think it's those logs that shouldn't wait that much in the memory of the logger.
Dumping a second time with the same timestep overrides the first logs. -> I tested it and it works fine, there is no override.
There would be multiple tables printed if verbose=1. -> I don't think this is an issue.

Another solution could be splitting the logger's memory into different sections that can be dumped asynchronously.

Is that really a big issue when visualizing different runs?

Well in this case it's not, because this is a dummy example to highlight the bug, but when I came accross this inconsistency for the first time It was skewing the visualization and our interpretation of the results, so yes.
The number of timesteps varies between episodes, so in general the logs will not be recorded at the same timesteps.

For that, you can simply give a larger training budget to make sure the latest evaluation is logged.

Not if I am using StopTrainingOnRewardThreshold callback, which was the case when I discovered the issue.

Something can be probably done for that one though (moving the break after the logging part).

Sorry, I meant ~~not recorded~~ not dumped, but yes, it's still the same issue.

araffin · 2021-06-03T21:23:51Z

Maybe I'm naive, but wouldn't dumping the logs with self.logger.dum(self.num_timesteps) after recording them solve the issue?

good point, this should do the trick (even though it would log only the eval in the terminal but I think that's ok)

We would welcome a PR that fixes this issue then ;)

skandermoalla · 2021-06-25T15:20:28Z

Working on it.

* Make EvalCallback dump the evaluation logs it records #457. * Make test deterministic Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>

* Make EvalCallback dump the evaluation logs it records DLR-RM#457. * Make test deterministic Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>

skandermoalla-Q added the bug Something isn't working label Jun 2, 2021

skandermoalla mentioned this issue Jun 29, 2021

Fix Inconsistencies with EvalCallback tensorboard logs #492

Merged

14 tasks

skandermoalla added a commit to skandermoalla/stable-baselines3 that referenced this issue Jun 29, 2021

Make EvalCallback dump the evaluation logs it records DLR-RM#457.

9a68e68

araffin closed this as completed in #492 Jul 1, 2021

araffin added a commit that referenced this issue Jul 1, 2021

Fix Inconsistencies with EvalCallback tensorboard logs (#492)

abbf48e

* Make EvalCallback dump the evaluation logs it records #457. * Make test deterministic Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>

araffin mentioned this issue Jul 7, 2021

Not getting any TensorBoard logs for DQN model #506

Closed

araffin mentioned this issue Nov 3, 2022

Training exceeds total_timesteps #1150

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Inconsistencies with `EvalCallback` tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

[Bug] Inconsistencies with `EvalCallback` tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

skandermoalla-Q commented Jun 2, 2021 •

edited

Loading

araffin commented Jun 2, 2021

skandermoalla-Q commented Jun 2, 2021

araffin commented Jun 3, 2021

skandermoalla commented Jun 25, 2021

[Bug] Inconsistencies with EvalCallback tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

[Bug] Inconsistencies with EvalCallback tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

Comments

skandermoalla-Q commented Jun 2, 2021 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Checklist

araffin commented Jun 2, 2021

skandermoalla-Q commented Jun 2, 2021

araffin commented Jun 3, 2021

skandermoalla commented Jun 25, 2021

[Bug] Inconsistencies with `EvalCallback` tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

[Bug] Inconsistencies with `EvalCallback` tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

skandermoalla-Q commented Jun 2, 2021 •

edited

Loading