karroyan · karroyan · Jul 7, 2022 · Jun 14, 2022 · Jun 15, 2022 · Jun 15, 2022
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,45 @@
+2022.06.21(v0.4.0)
+- env: add MAPPO/MASAC all configs in SMAC (#310) **(SOTA results in SMAC!!!)**
+- env: add dmc2gym env (#344) (#360)
+- env: remove DI-star requirements of dizoo/smac, use official pysc2 (#302)
+- env: add latest GAIL mujoco config (#298)
+- env: polish procgen env (#311)
+- env: add MBPO ant and humanoid config for mbpo (#314)
+- env: fix slime volley env obs space bug when agent_vs_agent
+- env: fix smac env obs space bug
+- env: fix import path error in lunarlander (#362)
+- algo: add Decision Transformer algorithm (#327) (#364)
+- algo: add on-policy PPG algorithm (#312)
+- algo: add DDPPO & add model-based SAC with lambda-return algorithm (#332)
+- algo: add infoNCE loss and ST-DIM algorithm (#326)
+- algo: add FQF distributional RL algorithm (#274)
+- algo: add continuous BC algorithm (#318）
+- algo: add pure policy gradient PPO algorithm (#382)
+- algo: add SQIL + SAC algorithm (#348)
+- algo: polish NGU and related modules (#283) (#343) (#353)
+- algo: add marl distributional td loss (#331)
+- feature: add new worker middleware (#236)
+- feature: refactor model-based RL pipeline (ding/world_model) (#332)
+- feature: refactor logging system in the whole DI-engine (#316)
+- feature: add env supervisor design (#330)
+- feature: support async reset for envpool env manager (#250)
+- feature: add log videos to tensorboard (#320)
+- feature: refactor impala cnn encoder interface (#378)
+- fix: env save replay bug
+- fix: transformer mask inplace operation bug
+- fix: transtion_with_policy_data bug in SAC and PPG
+- style: add dockerfile for ding:hpc image (#337)
+- style: fix mpire 2.3.5 which handles default processes more elegantly (#306)
+- style: use FORMAT_DIR instead of ./ding (#309）
+- style: update quickstart colab link (#347)
+- style: polish comments in ding/model/common (#315)
+- style: update mujoco docker download path (#386)
+- style: fix protobuf new version compatibility bug
+- style: fix torch1.8.0 torch.div compatibility bug
+- style: update doc links in readme
+- style: add outline in readme and update wechat image
+- style: update head image and refactor docker dir
+
 2022.04.23(v0.3.1)
 - env: polish and standardize dizoo config (#252) (#255) (#249) (#246) (#262) (#261) (#266) (#273) (#263) (#280) (#259) (#286) (#277) (#290) (#289) (#299)
 - env: add GRF academic env and config (#281)

diff --git a/README.md b/README.md
@@ -32,22 +32,22 @@
 [![Contributors](https://img.shields.io/github/contributors/opendilab/DI-engine)](https://github.com/opendilab/DI-engine/graphs/contributors)
 [![GitHub license](https://img.shields.io/github/license/opendilab/DI-engine)](https://github.com/opendilab/DI-engine/blob/master/LICENSE)
 
-Updated on 2022.04.22 DI-engine-v0.3.1
+Updated on 2022.06.21 DI-engine-v0.4.0
 
 
 ## Introduction to DI-engine (beta)
 [DI-engine doc](https://di-engine-docs.readthedocs.io/en/latest/) | [中文文档](https://di-engine-docs.readthedocs.io/zh_CN/latest/)
 
 **DI-engine** is a generalized decision intelligence engine. It supports **various [deep reinforcement learning](https://di-engine-docs.readthedocs.io/en/latest/10_concepts/index.html) algorithms** ([link](https://di-engine-docs.readthedocs.io/en/latest/12_policies/index.html)):
 
-- Most basic DRL algorithms, such as DQN, PPO, SAC, R2D2
+- Most basic DRL algorithms, such as DQN, PPO, SAC, R2D2, IMPALA
 - Multi-agent RL algorithms like QMIX, MAPPO
 - Imitation learning algorithms (BC/IRL/GAIL) , such as GAIL, SQIL, Guided Cost Learning
-- Exploration algorithms like HER, RND, ICM
-- Offline RL algorithms: CQL, TD3BC
-- Model-based RL algorithms: MBPO
+- Exploration algorithms like HER, RND, ICM, NGU
+- Offline RL algorithms: CQL, TD3BC, Decision Transformer
+- Model-based RL algorithms: SVG, MVE, STEVE / MBPO, DDPPO
 
-**DI-engine** aims to **standardize different RL enviroments and applications**. Various training pipelines and customized decision AI applications are also supported.
+**DI-engine** aims to **standardize different Decision Intelligence enviroments and applications**. Various training pipelines and customized decision AI applications are also supported.
 
 - Traditional academic environments
   - [DI-zoo](https://github.com/opendilab/DI-engine#environment-versatility)
@@ -109,6 +109,7 @@ And our dockerhub repo can be found [here](https://hub.docker.com/repository/doc
 - mujoco: opendilab/ding:nightly-mujoco
 - smac: opendilab/ding:nightly-smac
 - grf: opendilab/ding:nightly-grf
+- dmc: opendilab/ding:nightly-dmc2gym
 
 The detailed documentation are hosted on [doc](https://di-engine-docs.readthedocs.io/en/latest/) | [中文文档](https://di-engine-docs.readthedocs.io/zh_CN/latest/).
 
@@ -118,8 +119,6 @@ The detailed documentation are hosted on [doc](https://di-engine-docs.readthedoc
 
 [3 Minutes Kickoff (colab)](https://colab.research.google.com/drive/1K3DGi3dOT9fhFqa6bBtinwCDdWkOM3zE?usp=sharing)
 
-[3 分钟上手中文版 (kaggle)](https://www.kaggle.com/fallinx/di-engine/)
-
 [How to migrate a new **RL Env**](https://di-engine-docs.readthedocs.io/en/latest/11_dizoo/index.html) | [如何迁移一个新的**强化学习环境**](https://di-engine-docs.readthedocs.io/zh_CN/latest/11_dizoo/index_zh.html)
 
 **Bonus: Train RL agent in one line code:**
@@ -170,12 +169,13 @@ ding -m serial -e cartpole -p dqn -s 0
 |  34  |           [ICM](https://arxiv.org/pdf/1705.05363.pdf)            |   ![exp](https://img.shields.io/badge/-exploration-orange)   | [ICM中文文档](https://di-engine-docs.readthedocs.io/zh_CN/latest/12_policies/icm_zh.html)<br>[reward_model/icm](https://github.com/opendilab/DI-engine/blob/main/ding/reward_model/icm_reward_model.py) |             python3 -u cartpole_ppo_icm_config.py              |
 |  35  |         [CQL](https://arxiv.org/pdf/2006.04779.pdf)          | ![offline](https://img.shields.io/badge/-offlineRL-darkblue) | [policy/cql](https://github.com/opendilab/DI-engine/blob/main/ding/policy/cql.py) |                 python3 -u d4rl_cql_main.py                  |
 |  36  |         [TD3BC](https://arxiv.org/pdf/2106.06860.pdf)          | ![offline](https://img.shields.io/badge/-offlineRL-darkblue) | [policy/td3_bc](https://github.com/opendilab/DI-engine/blob/main/ding/policy/td3_bc.py) |                 python3 -u mujoco_td3_bc_main.py                  |
-|  37  |         MBSAC([SAC](https://arxiv.org/abs/1801.01290)+[VE](https://arxiv.org/abs/1803.00101)+[SVG](https://arxiv.org/abs/1510.09142))         | ![continuous](https://img.shields.io/badge/-continous-green)![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [policy/mbpolicy/mbsac](https://github.com/opendilab/DI-engine/blob/main/ding/policy/mbpolicy/mbsac.py) |        python3 -u pendulum_mbsac_mbpo_config.py \ python3 -u pendulum_mbsac_ddppo_config.py    |
-|  38  |         [MBPO](https://arxiv.org/pdf/1906.08253.pdf)         | ![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [world_model/mbpo](https://github.com/opendilab/DI-engine/blob/main/ding/world_model/mbpo.py) |        python3 -u pendulum_sac_mbpo_config.py    |
-|  39  |         [DDPPO](https://openreview.net/forum?id=rzvOQrnclO0)         | ![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [world_model/ddppo](https://github.com/opendilab/DI-engine/blob/main/ding/world_model/ddppo.py) |        python3 -u pendulum_mbsac_ddppo_config.py    |
-|  40  |         [PER](https://arxiv.org/pdf/1511.05952.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [worker/replay_buffer](https://github.com/opendilab/DI-engine/blob/main/ding/worker/replay_buffer/advanced_buffer.py) |                        `rainbow demo`                        |
-|  41  |         [GAE](https://arxiv.org/pdf/1506.02438.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [rl_utils/gae](https://github.com/opendilab/DI-engine/blob/main/ding/rl_utils/gae.py) |                          `ppo demo`                          |
-|  42  |         [ST-DIM](https://arxiv.org/pdf/1906.08226.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [torch_utils/loss/contrastive_loss](https://github.com/opendilab/DI-engine/blob/main/ding/torch_utils/loss/contrastive_loss.py) |        ding -m serial -c cartpole_dqn_stdim_config.py -s 0       |
+|  37  |         MBSAC([SAC](https://arxiv.org/abs/1801.01290)+[MVE](https://arxiv.org/abs/1803.00101)+[SVG](https://arxiv.org/abs/1510.09142))         | ![continuous](https://img.shields.io/badge/-continous-green)![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [policy/mbpolicy/mbsac](https://github.com/opendilab/DI-engine/blob/main/ding/policy/mbpolicy/mbsac.py) |        python3 -u pendulum_mbsac_mbpo_config.py \ python3 -u pendulum_mbsac_ddppo_config.py    |
+|  38  |         STEVESAC([SAC](https://arxiv.org/abs/1801.01290)+[STEVE](https://arxiv.org/abs/1807.01675)+[SVG](https://arxiv.org/abs/1510.09142))         | ![continuous](https://img.shields.io/badge/-continous-green)![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [policy/mbpolicy/mbsac](https://github.com/opendilab/DI-engine/blob/main/ding/policy/mbpolicy/mbsac.py) |        python3 -u pendulum_stevesac_mbpo_config.py    |
+|  39  |         [MBPO](https://arxiv.org/pdf/1906.08253.pdf)         | ![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [world_model/mbpo](https://github.com/opendilab/DI-engine/blob/main/ding/world_model/mbpo.py) |        python3 -u pendulum_sac_mbpo_config.py    |
+|  40  |         [DDPPO](https://openreview.net/forum?id=rzvOQrnclO0)         | ![mbrl](https://img.shields.io/badge/-ModelBasedRL-lightblue) | [world_model/ddppo](https://github.com/opendilab/DI-engine/blob/main/ding/world_model/ddppo.py) |        python3 -u pendulum_mbsac_ddppo_config.py    |
+|  41  |         [PER](https://arxiv.org/pdf/1511.05952.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [worker/replay_buffer](https://github.com/opendilab/DI-engine/blob/main/ding/worker/replay_buffer/advanced_buffer.py) |                        `rainbow demo`                        |
+|  42  |         [GAE](https://arxiv.org/pdf/1506.02438.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [rl_utils/gae](https://github.com/opendilab/DI-engine/blob/main/ding/rl_utils/gae.py) |                          `ppo demo`                          |
+|  43  |         [ST-DIM](https://arxiv.org/pdf/1906.08226.pdf)          |   ![other](https://img.shields.io/badge/-other-lightgrey)    | [torch_utils/loss/contrastive_loss](https://github.com/opendilab/DI-engine/blob/main/ding/torch_utils/loss/contrastive_loss.py) |        ding -m serial -c cartpole_dqn_stdim_config.py -s 0       |
 
 ![discrete](https://img.shields.io/badge/-discrete-brightgreen) means discrete action space, which is only label in normal DRL algorithms (1-18)
 

diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -1,7 +1,7 @@
 {% set data = load_setup_py_data() %}
 package:
   name: di-engine
-  version: v0.3.1
+  version: v0.4.0
 
 source:
   path: ..

diff --git a/ding/__init__.py b/ding/__init__.py
@@ -1,7 +1,7 @@
 import os
 
 __TITLE__ = 'DI-engine'
-__VERSION__ = 'v0.3.1'
+__VERSION__ = 'v0.4.0'
 __DESCRIPTION__ = 'Decision AI Engine'
 __AUTHOR__ = "OpenDILab Contributors"
 __AUTHOR_EMAIL__ = "opendilab.contact@gmail.com"

diff --git a/ding/data/buffer/deque_buffer.py b/ding/data/buffer/deque_buffer.py
@@ -47,16 +47,34 @@ def clear(self):
 
 
 class DequeBuffer(Buffer):
+    """
+    Overview:
+        A buffer implementation based on the deque structure.
+    """
 
     def __init__(self, size: int) -> None:
+        """
+        Overview:
+            The initialization method of DequeBuffer.
+        Arguments:
+            - size (:obj:`int`): The maximum number of objects that the buffer can hold.
+        """
         super().__init__(size=size)
         self.storage = deque(maxlen=size)
-        # Meta index is a dict which use deque as values
         self.indices = BufferIndex(maxlen=size)
+        # Meta index is a dict which uses deque as values
         self.meta_index = {}
 
     @apply_middleware("push")
     def push(self, data: Any, meta: Optional[dict] = None) -> BufferedData:
+        """
+        Overview:
+            The method that input the objects and the related meta information into the buffer.
+        Arguments:
+            - data (:obj:`Any`): The input object which can be in any format.
+            - meta (:obj:`Optional[dict]`): A dict that helps describe data, such as\
+                category, label, priority, etc. Default to ``None``.
+        """
         return self._push(data, meta)
 
     @apply_middleware("sample")
@@ -70,6 +88,30 @@ def sample(
             groupby: Optional[str] = None,
             unroll_len: Optional[int] = None
     ) -> Union[List[BufferedData], List[List[BufferedData]]]:
+        """
+        Overview:
+            The method that randomly sample data from the buffer or retrieve certain data by indices.
+        Arguments:
+            - size (:obj:`Optional[int]`): The number of objects to be obtained from the buffer.
+                If ``indices`` is not specified, the ``size`` is required to randomly sample the\
+                corresponding number of objects from the buffer.
+            - indices (:obj:`Optional[List[str]]`): Only used when you want to retrieve data by indices.
+                Default to ``None``.
+            - replace (:obj:`bool`): As the sampling process is carried out one by one, this parameter\
+                determines whether the previous samples will be put back into the buffer for subsequent\
+                sampling. Default to ``False``, it means that duplicate samples will not appear in one\
+                ``sample`` call.
+            - sample_range (:obj:`Optional[slice]`): The indices range to sample data. Default to ``None``,\
+                it means no restrictions on the range of indices for the sampling process.
+            - ignore_insufficient (:obj:`bool`): whether throw `` ValueError`` if the sampled size is smaller\
+                than the required size. Default to ``False``.
+            - groupby (:obj:`Optional[str]`): If this parameter is activated, the method will return a\
+                target size of object groups.
+            - unroll_len (:obj:`Optional[int]`): The unroll length of a trajectory, used only when the\
+                ``groupby`` is activated.
+        Returns:
+            - sampled_data (Union[List[BufferedData], List[List[BufferedData]]]): The sampling result.
+        """
         storage = self.storage
         if sample_range:
             storage = list(itertools.islice(self.storage, sample_range.start, sample_range.stop, sample_range.step))
@@ -124,6 +166,14 @@ def sample(
 
     @apply_middleware("update")
     def update(self, index: str, data: Optional[Any] = None, meta: Optional[dict] = None) -> bool:
+        """
+        Overview:
+            the method that update data and the related meta information with a certain index.
+        Arguments:
+            - data (:obj:`Any`): The data which is supposed to replace the old one. If you set it\
+                to ``None``, nothing will happen to the old record.
+            - meta (:obj:`Optional[dict]`): The new dict which is supposed to merge with the old one.
+        """
         if not self.indices.has(index):
             return False
         i = self.indices.get(index)
@@ -138,6 +188,12 @@ def update(self, index: str, data: Optional[Any] = None, meta: Optional[dict] =
 
     @apply_middleware("delete")
     def delete(self, indices: Union[str, Iterable[str]]) -> None:
+        """
+        Overview:
+            The method that delete the data and related meta information by specific indices.
+        Arguments:
+            - indices (Union[str, Iterable[str]]): Where the data to be cleared in the buffer.
+        """
         if isinstance(indices, str):
             indices = [indices]
         del_idx = []
@@ -154,22 +210,46 @@ def delete(self, indices: Union[str, Iterable[str]]) -> None:
         self.indices = BufferIndex(self.storage.maxlen, key_value_pairs)
 
     def count(self) -> int:
+        """
+        Overview:
+            The method that returns the current length of the buffer.
+        """
         return len(self.storage)
 
     def get(self, idx: int) -> BufferedData:
+        """
+        Overview:
+            The method that returns the BufferedData object given a specific index.
+        """
         return self.storage[idx]
 
     @apply_middleware("clear")
     def clear(self) -> None:
+        """
+        Overview:
+            The method that clear all data, indices, and the meta information in the buffer.
+        """
         self.storage.clear()
         self.indices.clear()
         self.meta_index = {}
 
     def import_data(self, data_with_meta: List[Tuple[Any, dict]]) -> None:
+        """
+        Overview:
+            The method that push data by sequence.
+        Arguments:
+            data_with_meta (List[Tuple[Any, dict]]): The sequence of (data, meta) tuples.
+        """
         for data, meta in data_with_meta:
             self._push(data, meta)
 
     def export_data(self) -> List[BufferedData]:
+        """
+        Overview:
+            The method that export all data in the buffer by sequence.
+        Returns:
+            storage (List[BufferedData]): All ``BufferedData`` objects stored in the buffer.
+        """
         return list(self.storage)
 
     def _push(self, data: Any, meta: Optional[dict] = None) -> BufferedData:

diff --git a/ding/data/buffer/middleware/clone_object.py b/ding/data/buffer/middleware/clone_object.py
@@ -5,9 +5,10 @@
 
 def clone_object():
     """
-    This middleware freezes the objects saved in memory buffer as a copy,
-    try this middleware when you need to keep the object unchanged in buffer, and modify
-    the object after sampling it (usually in multiple threads)
+    Overview:
+        This middleware freezes the objects saved in memory buffer and return copies during sampling,
+        try this middleware when you need to keep the object unchanged in buffer, and modify\
+        the object after sampling it (usually in multiple threads)
     """
 
     def push(chain: Callable, data: Any, *args, **kwargs) -> BufferedData:

diff --git a/ding/data/buffer/middleware/priority.py b/ding/data/buffer/middleware/priority.py
@@ -9,6 +9,10 @@
 
 
 class PriorityExperienceReplay:
+    """
+    Overview:
+        The middleware that implements priority experience replay (PER).
+    """
 
     def __init__(
             self,
@@ -18,6 +22,18 @@ def __init__(
             IS_weight_power_factor: float = 0.4,
             IS_weight_anneal_train_iter: int = int(1e5),
     ) -> None:
+        """
+        Arguments:
+            - buffer (:obj:`Buffer`): The buffer to use PER.
+            - IS_weight (:obj:`bool`): Whether use importance sampling or not.
+            - priority_power_factor (:obj:`float`): The factor that adjust the sensitivity between\
+                the sampling probability and the priority level.
+            - IS_weight_power_factor (:obj:`float`): The factor that adjust the sensitivity between\
+                the sample rarity and sampling probability in importance sampling.
+            - IS_weight_anneal_train_iter (:obj:`float`): The factor that controls the increasing of\
+                ``IS_weight_power_factor`` during training.
+        """
+
         self.buffer = buffer
         self.buffer_idx = {}
         self.buffer_size = buffer.size

diff --git a/ding/data/buffer/middleware/sample_range_view.py b/ding/data/buffer/middleware/sample_range_view.py
@@ -5,6 +5,13 @@
 
 
 def sample_range_view(buffer_: 'Buffer', start: Optional[int] = None, end: Optional[int] = None) -> Callable:
+    """
+    Overview:
+        The middleware that places restrictions on the range of indices during sampling.
+    Arguments:
+        - start (:obj:`int`): The starting index.
+        - end (:obj:`int`): One above the ending index.
+    """
     assert start is not None or end is not None
     if start and start < 0:
         start = buffer_.size + start

diff --git a/ding/data/buffer/middleware/staleness_check.py b/ding/data/buffer/middleware/staleness_check.py
@@ -9,6 +9,8 @@ def staleness_check(buffer_: 'Buffer', max_staleness: int = float("inf")) -> Cal
         This middleware aims to check staleness before each sample operation,
         staleness = train_iter_sample_data - train_iter_data_collected, means how old/off-policy the data is,
         If data's staleness is greater(>) than max_staleness, this data will be removed from buffer as soon as possible.
+    Arguments:
+        - max_staleness (:obj:`int`): The maximum legal span between the time of collecting and time of sampling.
     """
 
     def push(next: Callable, data: Any, *args, **kwargs) -> Any: