polish(davide) add example of Gail entry + config for Mujoco and Cartpole #114

davide97l · 2021-11-01T06:26:14Z

Description

This repo adds a config + entry to train an imitation policy to imitate an expert policy using GAIL. However the same entry can be generalized to any reward model whose interface is similar to GAIL. It includes the data collection phase and can save both the IL policy and reward model during training.

Important: this PR also changes the formula of the loss and reward of GAIL, changes have been approved after a discussion with @Will-Nie and @PaParaZz1. Now the loss function is according to the formula defined in the original paper.
Check the related comment for more details.

Related Issue

TODO

Check List

merge the latest version source branch/repo, and resolve all the conflicts
pass style check
pass all the tests

Now the loss is the same as in the paper

davide97l · 2021-11-02T08:36:11Z

As we can see from the plot above, with the new formula GAIL can train Cartpole faster:

The loss of GAIl changes in the following way:

.

davide97l · 2021-11-04T05:46:25Z

The reward function has also been changed, this further improves GAIL performance.

Will-Nie · 2021-11-04T06:58:37Z

ding/reward_model/gail_irl_model.py

        reward = torch.chunk(reward, reward.shape[0], dim=0)
        for item, rew in zip(data, reward):
-            item['reward'] = rew
+            item['reward'] = -torch.log(rew)


shall we add a small number in log to avoid potential explosion? e.g. -torch.log(rew + 1e-8)

It makes sense, I will add it, thank you for your finding.

Will-Nie · 2021-11-07T12:58:54Z

dizoo/classic_control/cartpole/config/cartpole_dqn_gail_config.py

+        learning_rate=1e-3,
+        update_per_collect=100,
+        expert_data_path='cartpole_dqn/expert_data_train.pkl',
+        load_path='cartpole_dqn_gail/reward_model/ckpt/ckpt_last.pth.tar',


Is load_path refering to the well-trained reward model you finally save? can we make this clearer? (let other users directly understanding that 'load_path' refers to the final reward model the algorithm learned)

Also, Why do we need load_path key here?

There are 2 load_path:

policy.load_path is the path where the state_dict of the policy is saved

reward_model.load_path is the path where the state_dict of the reward model is saved

I will write a comment in the config file to clarify it

Will-Nie · 2021-11-07T13:00:50Z

dizoo/classic_control/cartpole/entry/cartpole_dqn_gail_main.py

+            os.makedirs(path)
+        except FileExistsError:
+            pass
+    path = os.path.join(path, 'ckpt_last.pth.tar')


maybe 'ckpt_final_reward_model.pth.tar' is more clearer? ( as we had ckpt_best.pth.tar before, referring to the best model for the training the policy.)

Line 143: path = os.path.join(cfg.exp_name, 'reward_model', 'ckpt') creates a new path (reward_model/ckpt) where to save the state dict of the reward model. Given the explicit name, it should be clear that are pth.tar files inside that directory refer to the state dict of the reward model.

PaParaZz1 · 2021-11-16T12:35:50Z

ding/reward_model/gail_irl_model.py

@@ -119,7 +163,7 @@ def _train(self, train_data: torch.Tensor, expert_data: torch.Tensor) -> float:
        loss_1: torch.Tensor = torch.log(out_1 + 1e-8).mean()
        out_2: torch.Tensor = self.reward_model(expert_data)
        loss_2: torch.Tensor = torch.log(1 - out_2 + 1e-8).mean()
-        loss: torch.Tensor = loss_1 + loss_2
+        loss: torch.Tensor = - (loss_1 + loss_2)


add comment for this modification

I will add this comment: log(x) with 0<x<1 is negative, so to reduce this loss we have to minimize the opposite

PaParaZz1 · 2021-11-16T13:55:50Z

ding/reward_model/gail_irl_model.py

+        self.fc2 = nn.Linear(64 + 1, 1)  # here we add 1 to take consideration of the action concat
+        self.a = nn.Sigmoid()
+
+    def forward(self, x: torch.Tensor, ) -> torch.Tensor:


remove this redundant comma

PaParaZz1 · 2021-11-16T13:56:30Z

ding/reward_model/gail_irl_model.py

+        self.conv3 = nn.Conv2d(16, 16, 3, stride=1)
+        self.conv4 = nn.Conv2d(16, 16, 3, stride=1)
+        self.fc1 = nn.Linear(784, 64)
+        self.fc2 = nn.Linear(64 + 1, 1)  # here we add 1 to take consideration of the action concat


you should set action_size argument

PaParaZz1 · 2021-11-16T13:58:03Z

ding/reward_model/gail_irl_model.py

+
+    def forward(self, x: torch.Tensor, ) -> torch.Tensor:
+        # input: x = [B, 4 x 84 x 84 + 1], last element is action
+        actions = torch.unsqueeze(x[:, -1], -1)  # [B, 1]


action should be transformed into one-hot, because it is a categorial variable rather than scalar variable

ding/reward_model/gail_irl_model.py

dizoo/classic_control/cartpole/config/cartpole_dqn_config.py

dizoo/box2d/lunarlander/entry/lunarlander_dqn_gail_main.py

…pole (opendilab#114) * added gail entry * added lunarlander and cartpole config * added gail mujoco config * added mujoco exp * update22-10 * added third exp * added metric to evaluate policies * added GAIL entry and config for Cartpole and Walker2d * checked style and unittest * restored lunarlander env * style problems * bug correction * Delete expert_data_train.pkl * changed loss of GAIL * Update walker2d_ddpg_gail_config.py * changed gail reward from -D(s, a) to -log(D(s, a)) * added small constant to reward function * added comment to clarify config * Update walker2d_ddpg_gail_config.py * added lunarlander entry + config * Added Atari discriminator + Pong entry config * Update gail_irl_model.py * Update gail_irl_model.py * added gail serial pipeline and onehot actions for gail atari * related to previous commit * removed main files * removed old comment

* fix/fix_submodule_err (opendilab#61) * fix/fix_submodule_err --------- Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> * fix issue templates (opendilab#65) * fix(tokenizer): refactor tokenizer and update usage in readme (opendilab#51) * update tokenizer example * fix(readme, requirements): fix typo at Chinese readme and select a lower version of transformers (opendilab#73) * fix a typo in readme * in order to find InternLMTokenizer, select a lower version of Transformers --------- Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> * [Doc] Add wechat and discord link in readme (opendilab#78) * Doc：add wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * [Docs]: add Japanese README (opendilab#43) * Add Japanese README * Update README-ja-JP.md replace message * Update README-ja-JP.md * add repetition_penalty in GenerationConfig in web_demo.py (opendilab#48) Co-authored-by: YWMditto <862779238@qq.com> * use fp16 in instruction (opendilab#80) * [Enchancement] add more options for issue template (opendilab#77) * [Enchancement] add more options for issue template * update qustion icon * fix link * Use tempfile for convert2hf.py (opendilab#23) Fix InternLM/InternLM#50 * delete torch_dtype of README's example code (opendilab#100) * set the value of repetition_penalty to 1.0 to avoid random outputs (opendilab#99) * Update web_demo.py (opendilab#97) Remove meaningless log. * [Fix]Fix wrong string cutoff in the script for sft text tokenizing (opendilab#106) * docs(install.md): update dependency package transformers version to >= 4.28.0 (opendilab#124) Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> * docs(LICENSE): add license (opendilab#125) * add license of colossalai and flash-attn * fix lint * modify the name * fix AutoModel map in convert2hf.py (opendilab#116) * variables are not printly as expect (opendilab#114) * feat(solver): fix code to adapt to torch2.0 and provide docker images (opendilab#128) * feat(solver): fix code to adapt to torch2.0 * docs(install.md): publish internlm environment image * docs(install.md): update dependency packages version * docs(install.md): update default image --------- Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> * add demo test (opendilab#132) Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> * fix web_demo cache accelerate (opendilab#133) * fix(hybrid_zero_optim.py): delete math import * Update embedding.py --------- Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> Co-authored-by: Kai Chen <chenkaidev@gmail.com> Co-authored-by: Yang Gao <Gary1546308416AL@gmail.com> Co-authored-by: Changjiang GOU <gouchangjiang@gmail.com> Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> Co-authored-by: vansin <msnode@163.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: YWMditto <46778265+YWMditto@users.noreply.github.com> Co-authored-by: YWMditto <862779238@qq.com> Co-authored-by: WRH <12756472+wangruohui@users.noreply.github.com> Co-authored-by: liukuikun <24622904+Harold-lkk@users.noreply.github.com> Co-authored-by: x54-729 <45304952+x54-729@users.noreply.github.com> Co-authored-by: Shuo Zhang <zhangshuolove@live.com> Co-authored-by: Miao Zheng <76149310+MeowZheng@users.noreply.github.com> Co-authored-by: huangting4201 <1538303371@qq.com> Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> Co-authored-by: ytxiong <45058324+yingtongxiong@users.noreply.github.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: kkscilife <126147887+kkscilife@users.noreply.github.com> Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> Co-authored-by: hw <45089338+MorningForest@users.noreply.github.com>

* fix/fix_submodule_err (opendilab#61) * fix/fix_submodule_err --------- Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> * fix issue templates (opendilab#65) * fix(tokenizer): refactor tokenizer and update usage in readme (opendilab#51) * update tokenizer example * fix(readme, requirements): fix typo at Chinese readme and select a lower version of transformers (opendilab#73) * fix a typo in readme * in order to find InternLMTokenizer, select a lower version of Transformers --------- Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> * [Doc] Add wechat and discord link in readme (opendilab#78) * Doc：add wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * Doc：update wechat and discord link * [Docs]: add Japanese README (opendilab#43) * Add Japanese README * Update README-ja-JP.md replace message * Update README-ja-JP.md * add repetition_penalty in GenerationConfig in web_demo.py (opendilab#48) Co-authored-by: YWMditto <862779238@qq.com> * use fp16 in instruction (opendilab#80) * [Enchancement] add more options for issue template (opendilab#77) * [Enchancement] add more options for issue template * update qustion icon * fix link * Use tempfile for convert2hf.py (opendilab#23) Fix InternLM/InternLM#50 * delete torch_dtype of README's example code (opendilab#100) * set the value of repetition_penalty to 1.0 to avoid random outputs (opendilab#99) * Update web_demo.py (opendilab#97) Remove meaningless log. * [Fix]Fix wrong string cutoff in the script for sft text tokenizing (opendilab#106) * docs(install.md): update dependency package transformers version to >= 4.28.0 (opendilab#124) Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> * docs(LICENSE): add license (opendilab#125) * add license of colossalai and flash-attn * fix lint * modify the name * fix AutoModel map in convert2hf.py (opendilab#116) * variables are not printly as expect (opendilab#114) * feat(solver): fix code to adapt to torch2.0 and provide docker images (opendilab#128) * feat(solver): fix code to adapt to torch2.0 * docs(install.md): publish internlm environment image * docs(install.md): update dependency packages version * docs(install.md): update default image --------- Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> * add demo test (opendilab#132) Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> * fix web_demo cache accelerate (opendilab#133) * Doc: add twitter link (opendilab#141) * Feat add checkpoint fraction (opendilab#151) * feat(config): add checkpoint_fraction into config * feat: remove checkpoint_fraction from configs/7B_sft.py --------- Co-authored-by: wangguoteng.p <wangguoteng925@qq.com> * [Doc] update deployment guide to keep consistency with lmdeploy (opendilab#136) * update deployment guide * fix error * use llm partition (opendilab#159) Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> * test(ci_scripts): clean test data after test, remove unnecessary global variables, and other optimizations (opendilab#165) * test: optimization of ci scripts(variables, test data cleaning, etc). * chore(workflows): disable ci job on push. * fix: update partition * test(ci_scripts): add install requirements automaticlly,trigger event about lint check and other optimizations (opendilab#174) * add pull_request in lint check * use default variables in ci_scripts * fix format * check and install requirements automaticlly * fix format --------- Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> * feat(profiling): add a simple memory profiler (opendilab#89) * feat(profiling): add simple memory profiler * feat(profiling): add profiling argument * feat(CI_workflow): Add PR & Issue auto remove workflow (opendilab#184) * feat(ci_workflow): Add PR & Issue auto remove workflow Add a workflow for stale PR & Issue auto remove - pr & issue well be labeled as stale for inactive in 7 days - staled PR & Issue well be remove in 7 days - run this workflow every day on 1:30 a.m. * Update stale.yml * feat(bot): Create .owners.yml for Auto Assign (opendilab#176) * Create .owners.yml: for issue/pr assign automatically * Update .owners.yml * Update .owners.yml fix typo * [feat]: add pal reasoning script (opendilab#163) * [Feat] Add PAL inference script * Update README.md * Update tools/README.md Co-authored-by: BigDong <yudongwang1226@gmail.com> * Update tools/pal_inference.py Co-authored-by: BigDong <yudongwang1226@gmail.com> * Update pal script * Update README.md * restore .ore-commit-config.yaml * Update tools/README.md Co-authored-by: BigDong <yudongwang1226@gmail.com> * Update tools/README.md Co-authored-by: BigDong <yudongwang1226@gmail.com> * Update pal inference script * Update READMD.md * Update internlm/utils/interface.py Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> * Update pal script * Update pal script * Update script * Add docstring * Update format * Update script * Update script * Update script --------- Co-authored-by: BigDong <yudongwang1226@gmail.com> Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> * test(ci_scripts): add timeout settings and clean work after the slurm job (opendilab#185) * restore pr test on develop branch * add mask * add post action to cancel slurm job * remove readonly attribute on job log * add debug info * debug job log * try stdin * use stdin * set default value avoid error * try setting readonly on job log * performance echo * remove debug info * use squeue to check slurm job status * restore the lossed parm * litmit retry times * use exclusive to avoid port already in use * optimize loop body * remove partition * add {} for variables * set env variable for slurm partition --------- Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> * refactor(tools): move interface.py and import it to web_demo (opendilab#195) * move interface.py and import it to web_demo * typo * fix(ci): fix lint error * fix(ci): fix lint error --------- Co-authored-by: Sun Peng <sunpengsdu@gmail.com> Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> Co-authored-by: Kai Chen <chenkaidev@gmail.com> Co-authored-by: Yang Gao <Gary1546308416AL@gmail.com> Co-authored-by: Changjiang GOU <gouchangjiang@gmail.com> Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> Co-authored-by: vansin <msnode@163.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: YWMditto <46778265+YWMditto@users.noreply.github.com> Co-authored-by: YWMditto <862779238@qq.com> Co-authored-by: WRH <12756472+wangruohui@users.noreply.github.com> Co-authored-by: liukuikun <24622904+Harold-lkk@users.noreply.github.com> Co-authored-by: x54-729 <45304952+x54-729@users.noreply.github.com> Co-authored-by: Shuo Zhang <zhangshuolove@live.com> Co-authored-by: Miao Zheng <76149310+MeowZheng@users.noreply.github.com> Co-authored-by: 黄婷 <huangting3@CN0014010744M.local> Co-authored-by: ytxiong <45058324+yingtongxiong@users.noreply.github.com> Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com> Co-authored-by: kkscilife <126147887+kkscilife@users.noreply.github.com> Co-authored-by: qa-caif-cicd <qa-caif-cicd@pjlab.org.cn> Co-authored-by: hw <45089338+MorningForest@users.noreply.github.com> Co-authored-by: Guoteng <32697156+SolenoidWGT@users.noreply.github.com> Co-authored-by: wangguoteng.p <wangguoteng925@qq.com> Co-authored-by: lvhan028 <lvhan_028@163.com> Co-authored-by: zachtzy <141206206+zachtzy@users.noreply.github.com> Co-authored-by: cx <759046501@qq.com> Co-authored-by: Jaylin Lee <61487970+APX103@users.noreply.github.com> Co-authored-by: del-zhenwu <dele.zhenwu@gmail.com> Co-authored-by: Shaoyuan Xie <66255889+Daniel-xsy@users.noreply.github.com> Co-authored-by: BigDong <yudongwang1226@gmail.com> Co-authored-by: Wenwei Zhang <40779233+ZwwWayne@users.noreply.github.com> Co-authored-by: huangting4201 <huangting3@sensetime.com>

davide97l added 14 commits October 18, 2021 16:50

added gail entry

3a975df

added lunarlander and cartpole config

2de346f

added gail mujoco config

2c0b62f

added mujoco exp

c554094

update22-10

46f01d6

added third exp

180bc49

added metric to evaluate policies

2767f5a

Merge branch 'opendilab:main' into gail-entry-config

9bbd325

added GAIL entry and config for Cartpole and Walker2d

147310d

checked style and unittest

804e88f

restored lunarlander env

0dbcf26

style problems

fb3dde0

bug correction

a4d8871

Delete expert_data_train.pkl

8c8998d

PaParaZz1 added algo Add new algorithm or improve old one serial Serial training related labels Nov 1, 2021

davide97l added 3 commits November 2, 2021 15:42

changed loss of GAIL

30884c4

Now the loss is the same as in the paper

Merge branch 'opendilab:main' into gail-entry-config

d3cb245

Update walker2d_ddpg_gail_config.py

1ecd8b4

davide97l changed the title ~~Added example of Gail entry + config for Mujoco and Cartpole~~ (davide) add example of Gail entry + config for Mujoco and Cartpole Nov 2, 2021

davide97l added 2 commits November 3, 2021 16:08

Merge branch 'opendilab:main' into gail-entry-config

4a41d0b

changed gail reward from -D(s, a) to -log(D(s, a))

db0ade6

PaParaZz1 changed the title ~~(davide) add example of Gail entry + config for Mujoco and Cartpole~~ polish(davide) add example of Gail entry + config for Mujoco and Cartpole Nov 4, 2021

Will-Nie reviewed Nov 4, 2021

View reviewed changes

added small constant to reward function

7d54853

Will-Nie reviewed Nov 7, 2021

View reviewed changes

added comment to clarify config

fa28c4a

davide97l added 5 commits November 8, 2021 10:01

Update walker2d_ddpg_gail_config.py

7ac7627

added lunarlander entry + config

c4fe1ca

Added Atari discriminator + Pong entry config

93c173a

Update gail_irl_model.py

a1ab0d9

Update gail_irl_model.py

56b1ad5

PaParaZz1 requested changes Nov 16, 2021

View reviewed changes

davide97l added 4 commits November 19, 2021 17:19

added gail serial pipeline and onehot actions for gail atari

6b38eee

related to previous commit

c8743a1

removed main files

7499588

removed old comment

1284543

PaParaZz1 approved these changes Nov 19, 2021

View reviewed changes

PaParaZz1 merged commit d1bc138 into opendilab:main Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polish(davide) add example of Gail entry + config for Mujoco and Cartpole #114

polish(davide) add example of Gail entry + config for Mujoco and Cartpole #114

davide97l commented Nov 1, 2021 •

edited

Loading

davide97l commented Nov 2, 2021

davide97l commented Nov 4, 2021

Will-Nie Nov 4, 2021

davide97l Nov 4, 2021

Will-Nie Nov 7, 2021

davide97l Nov 8, 2021

Will-Nie Nov 7, 2021 •

edited

Loading

davide97l Nov 8, 2021

PaParaZz1 Nov 16, 2021

davide97l Nov 19, 2021

PaParaZz1 Nov 16, 2021

PaParaZz1 Nov 16, 2021

PaParaZz1 Nov 16, 2021

polish(davide) add example of Gail entry + config for Mujoco and Cartpole #114

polish(davide) add example of Gail entry + config for Mujoco and Cartpole #114

Conversation

davide97l commented Nov 1, 2021 • edited Loading

Description

Related Issue

TODO

Check List

davide97l commented Nov 2, 2021

davide97l commented Nov 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Will-Nie Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davide97l commented Nov 1, 2021 •

edited

Loading

Will-Nie Nov 7, 2021 •

edited

Loading