- use max_rounds instead of max_generations.
- allow deterministic action when evaluation.
- add prep_training()=>train() and prep_rollout()=>eval().
- misuse of obs and state. see return_compute, for example.
- use actor.formard or critic.forward directly. instead we should use interfaces of policy.
- DQN codes need to be updated like mappo.
- the usage of active mask.
- to implement the new_gae_trace.
- eval rollouts can also be used as training data.
- tensorboard graph should use num_steps not num_rollouts as x-axis.
- plot grad norm.
- tensorboard graph with time as x-axis.
- learning rate decay.
- (just set to 1 now, the speed woundn't be much different.) optimization: allow right controllable to be 0.
- allow sampling steps instead of episodes.
- add some classic ma-envs.
- allow not to call value function again in return computation.
- allow composition of configs.
- true single-agent version of MAT.
- add data shape def and check.
- (?) re-add double clip (Tencent).
- true v-trace, refer to IMPALA.
- check the sync-training performance, especially the evaluation performance.
- do we support mini-batch now? => we need to move move_to_gpu (also, return computation) inside mini-batch iteration.
- the implementation of RNN data generator.
- remove modified mappo codes.
- make a model a class rathe than a module. merge similar models. maybe use registry?
- mappo's loss codes.
- maybe we should add a dedicated scheduler and runner for cooperative tasks.
- maybe reshape in return_compute is unnecessary
- use consistent names. for example, "share_obs"="states"!
- refactor the implementation of popart(value normalization).