KNOWN_BUGS

use max_rounds instead of max_generations.
allow deterministic action when evaluation.
add prep_training()=>train() and prep_rollout()=>eval().
misuse of obs and state. see return_compute, for example.
use actor.formard or critic.forward directly. instead we should use interfaces of policy.
DQN codes need to be updated like mappo.
the usage of active mask.
to implement the new_gae_trace.

ENHANCEMENT

eval rollouts can also be used as training data.
tensorboard graph should use num_steps not num_rollouts as x-axis.
plot grad norm.
tensorboard graph with time as x-axis.
learning rate decay.
(just set to 1 now, the speed woundn't be much different.) optimization: allow right controllable to be 0.
allow sampling steps instead of episodes.
add some classic ma-envs.
allow not to call value function again in return computation.
allow composition of configs.
true single-agent version of MAT.
add data shape def and check.
(?) re-add double clip (Tencent).
true v-trace, refer to IMPALA.

check the sync-training performance, especially the evaluation performance.
do we support mini-batch now? => we need to move move_to_gpu (also, return computation) inside mini-batch iteration.
the implementation of RNN data generator.

remove modified mappo codes.
make a model a class rathe than a module. merge similar models. maybe use registry?
mappo's loss codes.
maybe we should add a dedicated scheduler and runner for cooperative tasks.
maybe reshape in return_compute is unnecessary
use consistent names. for example, "share_obs"="states"!
refactor the implementation of popart(value normalization).