-
I have read the paper Learning Off-Policy with Online Planning and took a quick look at the code of SafeLOOP, but I'm not clear about the meaning and function of some algorithm configs. Could you please explain the configs below? dynamics_cfgs |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 6 replies
-
Certainly! Here's the explanation for these parameters:
|
Beta Was this translation helpful? Give feedback.
-
Hello, |
Beta Was this translation helpful? Give feedback.
-
Does SafeLOOP work in a vector environment? Is it feasible to directly modify the |
Beta Was this translation helpful? Give feedback.
-
When the safeloop algorithm is applied in a custom environment, the algorithm cannot converge. So I decided to invalidate the cost limit until a certain epoch and then reinstate the cost limit. When the limits are invalidated, the agent can learn effectively and gradually improve its performance. However, as soon as the restriction is reinstated, SafeArc's actions immediately become very bad. Note that the cost of most trajectories still meet the limit at this time, safeARC still utilize the reward to update. Why is this happening? How do I fix the problem? |
Beta Was this translation helpful? Give feedback.
-
When I use the safeloop algorithm, I have to limit the output of the environment model due to the specificity of the environment. But in some cases, arc will produce harmful actions when the real environment should terminate because the output of the environment model reaches the boundary at this time, and the state generated by any action at this time is the same. How do I deal with the termination of my environment model? How to avoid arc's harmful actions and train the agent in this case? |
Beta Was this translation helpful? Give feedback.
-
Based on your description, it appears that in certain scenarios, your environment may reach a termination state. To address this, you should implement a termination_function. This function is critical during the planning phase, as it evaluates each state to determine whether it is a terminal state. If a state is identified as terminal, the reward of the trajectory containing this state should be set to zero. This adjustment ensures that the planning algorithm does not select actions corresponding to these trajectories. For a practical implementation, you might find the following references: |
Beta Was this translation helpful? Give feedback.
Certainly! Here's the explanation for these parameters:
elite_size
: This signifies the number of elite individuals selected from sampled individuals. It is commonly used in action planning of a model-based RL. (More details can refer to Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method and The Cross-Entropy Method for Combinatorial and Continuous Optimization)use_decay
: This parameter indicates whether to use the weight decay technique when updating the dynamic model.num_iterations
: This is the number of iterations performed in planning.num_particles
: During planning, this is the number of models used for evaluating an action trajectory.num_samples