Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous action example for IMPALA #543

Closed
4 of 11 tasks
Tracked by #548
supersglzc opened this issue Nov 15, 2022 · 2 comments
Closed
4 of 11 tasks
Tracked by #548

Continuous action example for IMPALA #543

supersglzc opened this issue Nov 15, 2022 · 2 comments
Labels
algo Add new algorithm or improve old one

Comments

@supersglzc
Copy link

supersglzc commented Nov 15, 2022

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • system worker bug
    • system utils bug
    • code design/refactor
    • documentation request
    • new feature request
  • I have visited the readme and doc
  • I have searched through the issue tracker and pr tracker
  • I have mentioned version numbers, operating system and environment, where applicable:
    import ding, torch, sys
    print(ding.__version__, torch.__version__, sys.version, sys.platform)

Hi, I find that DI-Engine aims to support IMPALA for both continuous and discrete action spaces, however, there is no config examples for continuous action tasks. One thing I notice is that IMPALA uses arg-max for evaluation, which I suppose to be designed for discrete action tasks specifically? I am wondering if we can have some demos of IMPALA on continuous action tasks?

I tried to make one, but it fails to run:

from easydict import EasyDict

bipedalwalker_impala_config = dict(
    exp_name='bipedalwalker_impala_seed0',
    env=dict(
        env_id='BipedalWalker-v3',
        collector_env_num=8,
        evaluator_env_num=5,
        # (bool) Scale output action into legal range.
        act_scale=True,
        n_evaluator_episode=5,
        stop_value=300,
        rew_clip=True,
        # The path to save the game replay
        # replay_path='./bipedalwalker_ppo_seed0/video',
    ),
    policy=dict(
        cuda=False,
        action_space='continuous',
        model=dict(
            action_space='continuous',
            obs_shape=24,
            action_shape=4,
        ),
        learn=dict(
            # (int) collect n_sample data, train model update_per_collect times
            # here we follow ppo serial pipeline
            update_per_collect=4,
            # (int) the number of data for a train iteration
            batch_size=16,
            learning_rate=0.0005,
            # (float) loss weight of the value network, the weight of policy network is set to 1
            value_weight=0.5,
            # (float) loss weight of the entropy regularization, the weight of policy network is set to 1
            entropy_weight=0.0001,
            # (float) discount factor for future reward, defaults int [0, 1]
            discount_factor=0.9,
            # (float) additional discounting parameter
            lambda_=0.95,
            # (int) the trajectory length to calculate v-trace target
            unroll_len=32,
            # (float) clip ratio of importance weights
            rho_clip_ratio=1.0,
            # (float) clip ratio of importance weights
            c_clip_ratio=1.0,
            # (float) clip ratio of importance sampling
            rho_pg_clip_ratio=1.0,
        ),
        collect=dict(
            # (int) collect n_sample data, train model n_iteration times
            n_sample=16,
            # (int) the trajectory length to calculate v-trace target
            unroll_len=32,
            # (float) discount factor for future reward, defaults int [0, 1]
            discount_factor=0.9,
            gae_lambda=0.95,
            collector=dict(collect_print_freq=1000, ),
        ),
        eval=dict(evaluator=dict(eval_freq=200, )),
        other=dict(replay_buffer=dict(
            replay_buffer_size=1000,
            max_use=16,
        ), ),
    ),
)
bipedalwalker_impala_config = EasyDict(bipedalwalker_impala_config)
main_config = bipedalwalker_impala_config
bipedalwalker_impala_create_config = dict(
    env=dict(
        type='bipedalwalker',
        import_names=['dizoo.box2d.bipedalwalker.envs.bipedalwalker_env'],
    ),
    env_manager=dict(type='base'),
    policy=dict(type='impala'),
)
bipedalwalker_impala_create_config = EasyDict(bipedalwalker_impala_create_config)
create_config = bipedalwalker_impala_create_config

if __name__ == "__main__":
    # or you can enter `ding -m serial_onpolicy -c bipedalwalker_impala_config.py -s 0`
    from ding.entry import serial_pipeline_onpolicy
    serial_pipeline_onpolicy([main_config, create_config], seed=0)

@PaParaZz1
Copy link
Member

The current implementation only supports discrete action space, we will add a continuous version and related examples next week. You can continue to follow this issue.

@PaParaZz1 PaParaZz1 added the algo Add new algorithm or improve old one label Nov 16, 2022
@PaParaZz1
Copy link
Member

This issue has been solved in #551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algo Add new algorithm or improve old one
Projects
None yet
Development

No branches or pull requests

2 participants