The meaning of some settings in the SafeLOOP algorithm. #287

TomatoZ2 · 2023-11-14T09:02:55Z

TomatoZ2
Nov 14, 2023

I have read the paper Learning Off-Policy with Online Planning and took a quick look at the code of SafeLOOP, but I'm not clear about the meaning and function of some algorithm configs. Could you please explain the configs below?

dynamics_cfgs
elite_size, use_decay
planner_cfgs
num_iterations, num_particles, num_samples, temperature, cost_temperature
algo_cfgs
action_repeat(I'm very confused about what the function this config is, why action have to be repeated several times?)

Answered by Gaiejj

Nov 15, 2023

Certainly! Here's the explanation for these parameters:

elite_size: This signifies the number of elite individuals selected from sampled individuals. It is commonly used in action planning of a model-based RL. (More details can refer to Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method and The Cross-Entropy Method for Combinatorial and Continuous Optimization)
use_decay: This parameter indicates whether to use the weight decay technique when updating the dynamic model.
num_iterations: This is the number of iterations performed in planning.
num_particles: During planning, this is the number of models used for evaluating an action trajectory.
num_samples

View full answer

Gaiejj · 2023-11-15T05:23:01Z

Gaiejj
Nov 15, 2023
Maintainer

Certainly! Here's the explanation for these parameters:

elite_size: This signifies the number of elite individuals selected from sampled individuals. It is commonly used in action planning of a model-based RL. (More details can refer to Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method and The Cross-Entropy Method for Combinatorial and Continuous Optimization)
use_decay: This parameter indicates whether to use the weight decay technique when updating the dynamic model.
num_iterations: This is the number of iterations performed in planning.
num_particles: During planning, this is the number of models used for evaluating an action trajectory.
num_samples: This indicates the number of trajectories sampled.
temperature: This is the weight for updating the action mean. It can control the extent of exploration and exploitation during learning.

1 reply

TomatoZ2 Nov 15, 2023
Author

Thanks for your reply, and what about the action_repeat, why action have to be repeated several times?

Gaiejj · 2023-11-15T12:11:34Z

Gaiejj
Nov 15, 2023
Maintainer

Hello, action_repeat entails repeating an action several times, conserving time, and accelerating training.

2 replies

TomatoZ2 Nov 15, 2023
Author

Got it, this means that this parameter is only suitable for environments with high frequency of action, and for general environments, just set this parameter to 1, right?

Gaiejj Nov 15, 2023
Maintainer

Sure, feel free to set that to 1 according to your environment.

TomatoZ2 · 2023-11-20T04:27:40Z

TomatoZ2
Nov 20, 2023
Author

Does SafeLOOP work in a vector environment? Is it feasible to directly modify the env in the adapter to a vector environment?

0 replies

TomatoZ2 · 2023-11-20T05:11:22Z

TomatoZ2
Nov 20, 2023
Author

When the safeloop algorithm is applied in a custom environment, the algorithm cannot converge. So I decided to invalidate the cost limit until a certain epoch and then reinstate the cost limit. When the limits are invalidated, the agent can learn effectively and gradually improve its performance. However, as soon as the restriction is reinstated, SafeArc's actions immediately become very bad. Note that the cost of most trajectories still meet the limit at this time, safeARC still utilize the reward to update. Why is this happening? How do I fix the problem?

3 replies

TomatoZ2 Nov 20, 2023
Author

I suspect there is a bug in this section of the SafeARC code, where mean_episode_returns is a two-dimensional tensor, and .nonzero().reshape(-1) will cause elite_values to contain a lot of irrelevant data (index=0).

if feasible_num < self._num_elites:
    elite_values, elite_actions = -mean_episode_costs, actions
else:
    elite_idxs = (
        (mean_episode_costs <= self._cost_limit).nonzero().reshape(-1)
    )  # like tensor([0, 1])
    elite_values, elite_actions = mean_episode_returns[elite_idxs], actions[:, elite_idxs]

example as follow

import torch

a = torch.rand(10, 1)
elite_idxs = (a <= 0.5).nonzero().reshape(-1)
print(a[elite_idxs])

tensor([[0.2820],
        [0.7670],
        [0.1507],
        [0.7670],
        [0.2072],
        [0.7670],
        [0.3530],
        [0.7670],
        [0.0151],
        [0.7670]])

Gaiejj Nov 20, 2023
Maintainer

We are currently reviewing the relevant code and will provide feedback as soon as possible.

hdadong Nov 23, 2023
Maintainer

Thank you for highlighting this critical issue. Following our verification, we confirm it as a performance-impacting bug. We plan to address this in an upcoming update. Your alert is greatly appreciated.

TomatoZ2 · 2023-11-24T14:38:34Z

TomatoZ2
Nov 24, 2023
Author

When I use the safeloop algorithm, I have to limit the output of the environment model due to the specificity of the environment. But in some cases, arc will produce harmful actions when the real environment should terminate because the output of the environment model reaches the boundary at this time, and the state generated by any action at this time is the same. How do I deal with the termination of my environment model? How to avoid arc's harmful actions and train the agent in this case?

0 replies

hdadong · 2023-11-25T10:21:22Z

hdadong
Nov 25, 2023
Maintainer

Based on your description, it appears that in certain scenarios, your environment may reach a termination state. To address this, you should implement a termination_function. This function is critical during the planning phase, as it evaluates each state to determine whether it is a terminal state. If a state is identified as terminal, the reward of the trajectory containing this state should be set to zero. This adjustment ensures that the planning algorithm does not select actions corresponding to these trajectories. For a practical implementation, you might find the following references:
The implementation of a termination function in ARC can be examined at ARC Controller Implementation.
An example of how this is integrated into an environment is available at Ant Environment Implementation.
I hope this information assists you in your work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The meaning of some settings in the SafeLOOP algorithm. #287

{{title}}

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The meaning of some settings in the SafeLOOP algorithm. #287

TomatoZ2 Nov 14, 2023

Replies: 6 comments · 6 replies

Gaiejj Nov 15, 2023 Maintainer

TomatoZ2 Nov 15, 2023 Author

Gaiejj Nov 15, 2023 Maintainer

TomatoZ2 Nov 15, 2023 Author

Gaiejj Nov 15, 2023 Maintainer

TomatoZ2 Nov 20, 2023 Author

TomatoZ2 Nov 20, 2023 Author

TomatoZ2 Nov 20, 2023 Author

Gaiejj Nov 20, 2023 Maintainer

hdadong Nov 23, 2023 Maintainer

TomatoZ2 Nov 24, 2023 Author

hdadong Nov 25, 2023 Maintainer

TomatoZ2
Nov 14, 2023

Replies: 6 comments 6 replies

Gaiejj
Nov 15, 2023
Maintainer

TomatoZ2 Nov 15, 2023
Author

Gaiejj
Nov 15, 2023
Maintainer

TomatoZ2 Nov 15, 2023
Author

Gaiejj Nov 15, 2023
Maintainer

TomatoZ2
Nov 20, 2023
Author

TomatoZ2
Nov 20, 2023
Author

TomatoZ2 Nov 20, 2023
Author

Gaiejj Nov 20, 2023
Maintainer

hdadong Nov 23, 2023
Maintainer

TomatoZ2
Nov 24, 2023
Author

hdadong
Nov 25, 2023
Maintainer