Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inconsistencies with EvalCallback tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. #457

Closed
3 tasks done
skandermoalla-Q opened this issue Jun 2, 2021 · 4 comments · Fixed by #492
Labels
bug Something isn't working

Comments

@skandermoalla-Q
Copy link

skandermoalla-Q commented Jun 2, 2021

🐛 Bug

EvalCallback records logs that can be used with Tensorboad, but does not dump them.

These logs remain recorded and are written by the Algorithm when it dumps its rollout logs during training. However, this will not necessarily happen at the timestep EvalCallback recorded its logs, resulting in those evaluation logs being recorded with the wrong timestep.

Synchronizing EvalCallback recordings with the Algorithm log dumping is someimes not possible as EvalCallback records the evaluation every n timesteps but some algorithms like DQN dump their rollout logs every n episodes.

Moreover, even with PPO which records its rollout logs every n rollout steps, setting EvalCallback to recording every n rollout steps leads to the last evaluation logs (if using a StopTrainingOnRewardThreshold callback) not being recorded dumped, as the rollout is interrupted.

To Reproduce

eval_env = gym.make('CartPole-v1')
eval_callback = EvalCallback(eval_env)

model = DQN('MlpPolicy', 'CartPole-v1', tensorboard_log="./tb_logs", verbose=1)
model.learn(50000, callback=eval_callback)
...

Eval num_timesteps=10000, episode_reward=9.40 +/- 0.80
Episode length: 9.40 +/- 0.80
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 9.4      |
|    mean_reward      | 9.4      |
| rollout/            |          |
|    ep_len_mean      | 22.9     |
|    ep_rew_mean      | 22.9     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 444      |
|    fps              | 13712    |
|    time_elapsed     | 0        |
|    total timesteps  | 10047    |
----------------------------------

...

Eval num_timesteps=30000, episode_reward=9.60 +/- 1.02
Episode length: 9.60 +/- 1.02
New best mean reward!
----------------------------------
| eval/               |          |
|    mean_ep_length   | 9.6      |
|    mean_reward      | 9.6      |
| rollout/            |          |
|    ep_len_mean      | 20.9     |
|    ep_rew_mean      | 20.9     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 1356     |
|    fps              | 14140    |
|    time_elapsed     | 2        |
|    total timesteps  | 30030    |
----------------------------------

...

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 21.7     |
|    ep_rew_mean      | 21.7     |
|    exploration rate | 0.05     |
| time/               |          |
|    episodes         | 2252     |
|    fps              | 14347    |
|    time_elapsed     | 3        |
|    total timesteps  | 49947    |
----------------------------------
Eval num_timesteps=50000, episode_reward=9.20 +/- 0.40
Episode length: 9.20 +/- 0.40

Problems:

  1. Evaluation at num_timesteps=10000 written with timestep=10047.
  2. Last evaluation not written in logs.

image

Expected behavior

EvalCallback's logs should be written with the timestep at which they were recorded (i.e. at the timestep the environment was evaluated).
This should be consistent irrespectively from the Algorithm used for the training.

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
@skandermoalla-Q skandermoalla-Q added the bug Something isn't working label Jun 2, 2021
@araffin araffin changed the title [Bug] Inconsistencies with EvalCallback logs. Logs are written at timesteps different from the ones at which were recorded.. [Bug] Inconsistencies with EvalCallback tensorboard logs. Logs are written at timesteps different from the ones at which were recorded.. Jun 2, 2021
@araffin
Copy link
Member

araffin commented Jun 2, 2021

Hello,
The inconsistencies you are describing do exist, however I'm afraid there is no good solution, unless you have one?

Synchronizing EvalCallback recordings with the Algorithm log dumping is someimes not possible as EvalCallback records the evaluation every n timesteps but some algorithms like DQN dump their rollout logs every n episodes.

The reason behind it is how those algorithms work. A2C, PPO usually use multiple envs, so every call to env.step() corresponds to n_envs timesteps, which makes it harder to rely on episodes for logging. DQN and other off-policy algorithm only use one env (for now, see #179 ) hence the logging by episode.

Overall, Tensorboard should be used only for visualization, for proper comparison please use the logs generated by the callbacks. We have scripts in the RL zoo training framework to make those comparison plots.

Evaluation at num_timesteps=10000 written with timestep=10047.

Is that really a big issue when visualizing different runs?

Last evaluation not written in logs.

For that, you can simply give a larger training budget to make sure the latest evaluation is logged.

not being recorded, as the rollout is interrupted.

Something can be probably done for that one though (moving the break after the logging part).

@skandermoalla-Q
Copy link
Author

Hello,

I'm afraid there is no good solution, unless you have one?

Maybe I'm naive, but wouldn't dumping the logs with self.logger.dum(self.num_timesteps) after recording them solve the issue?
The potential problems with this, as far as I can imagine, could be:

  1. Flushing recoded logs that were not meant to be dumped at that timestep. (For example rollout/exploration rate is in the logger's memory when EvalCallback's log are recorded). -> In this case I think it's those logs that shouldn't wait that much in the memory of the logger.
  2. Dumping a second time with the same timestep overrides the first logs. -> I tested it and it works fine, there is no override.
  3. There would be multiple tables printed if verbose=1. -> I don't think this is an issue.

Another solution could be splitting the logger's memory into different sections that can be dumped asynchronously.

Is that really a big issue when visualizing different runs?

Well in this case it's not, because this is a dummy example to highlight the bug, but when I came accross this inconsistency for the first time It was skewing the visualization and our interpretation of the results, so yes.
The number of timesteps varies between episodes, so in general the logs will not be recorded at the same timesteps.

For that, you can simply give a larger training budget to make sure the latest evaluation is logged.

Not if I am using StopTrainingOnRewardThreshold callback, which was the case when I discovered the issue.

Something can be probably done for that one though (moving the break after the logging part).

Sorry, I meant not recorded not dumped, but yes, it's still the same issue.

@araffin
Copy link
Member

araffin commented Jun 3, 2021

Maybe I'm naive, but wouldn't dumping the logs with self.logger.dum(self.num_timesteps) after recording them solve the issue?

good point, this should do the trick (even though it would log only the eval in the terminal but I think that's ok)

We would welcome a PR that fixes this issue then ;)

@skandermoalla
Copy link
Contributor

Working on it.

skandermoalla added a commit to skandermoalla/stable-baselines3 that referenced this issue Jun 29, 2021
araffin added a commit that referenced this issue Jul 1, 2021
* Make EvalCallback dump the evaluation logs it records #457.

* Make test deterministic

Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
leor-c pushed a commit to leor-c/stable-baselines3 that referenced this issue Aug 26, 2021
* Make EvalCallback dump the evaluation logs it records DLR-RM#457.

* Make test deterministic

Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants