-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue in forward(....) function of class ActorCriticPolicy while working on Custom Gym Environment. #2043
Comments
|
Yes Yes, I have done that already. After the line Because earlier, I was getting the error: |
Sorry, I read too fast, I thought the issue was the concatenation. Btw SB3 does support dict obs. I'll have a look later, but there is no reason the transition from the step method would not be taken into account. |
Using this minimal code, I don't see any problem: import gymnasium as gym
import numpy as np
import torch
import torch as th
from gymnasium import spaces
from gymnasium.wrappers import FlattenObservation
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.policies import ActorCriticPolicy
class CustomEnv(gym.Env):
def __init__(self):
super().__init__()
self.observation_space = spaces.Dict(
{
"context": spaces.Box(
low=-np.inf, high=np.inf, shape=(2,), dtype=np.float32
),
"score": spaces.Box(low=0, high=1, shape=(1,), dtype=np.float32),
}
)
self.action_space = gym.spaces.Discrete(5)
def reset(self, seed=None, options=None):
return {
"context": np.array([1.0, 2.0], dtype=np.float32),
"score": np.array([0.0], dtype=np.float32),
}, {}
def step(self, action):
reward = 0.0
terminated = False
truncated = False
info = {}
return (
{
"context": np.array([1.0, 2.0], dtype=np.float32),
"score": np.array([0.98], dtype=np.float32),
},
reward,
terminated,
truncated,
info,
)
class CustomPolicy(ActorCriticPolicy):
def __init__(self, observation_space, action_space, lr_schedule, **kwargs):
super().__init__(observation_space, action_space, lr_schedule, **kwargs)
# Policy head (action logits) and value head
self.policy_head = torch.nn.Linear(10, self.action_space.n)
self.value_head = torch.nn.Linear(10, 1)
def forward(
self, obs: th.Tensor, deterministic: bool = False
) -> tuple[th.Tensor, th.Tensor, th.Tensor]:
"""
Forward pass in all the networks (actor and critic)
:param obs: Observation
:param deterministic: Whether to sample or use deterministic actions
:return: action, value and log probability of the action
"""
print("obs in the implicity defined forward function is: ", obs)
return super().forward(obs, deterministic)
env = CustomEnv()
env = FlattenObservation(env)
check_env(env)
model = PPO(
CustomPolicy,
env,
verbose=1,
n_steps=8,
batch_size=8,
n_epochs=1,
).learn(10) |
Hi, Thanks for the reply. I just made a small change in the code that you provided and I reproduced the same problem regarding the forward function. The change is that:- import gymnasium as gym
import numpy as np
import json
import torch
import torch as th
from gymnasium import spaces
from gymnasium.wrappers import FlattenObservation
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.policies import ActorCriticPolicy
class CustomEnv(gym.Env):
def __init__(self):
super().__init__()
self.Threshold = 0.95
self.observation_space = spaces.Dict(
{
"context": spaces.Box(
low=-np.inf, high=np.inf, shape=(2,), dtype=np.float32
),
"score": spaces.Box(low=0, high=1, shape=(1,), dtype=np.float32),
}
)
self.action_space = gym.spaces.Discrete(5)
def reset(self, seed=None, options=None):
return {
"context": np.array([1.0, 2.0], dtype=np.float32),
"score": np.array([0.0], dtype=np.float32),
}, {}
def step(self, action):
reward = 0.0
terminated = 0.98 >= self.Threshold
truncated = False
info = {}
return {
"context": np.array([1.0, 2.0], dtype=np.float32),
"score": np.array([0.98], dtype=np.float32),
}, reward, terminated, truncated, info
class CustomPolicy(ActorCriticPolicy):
def __init__(self, observation_space, action_space, lr_schedule, **kwargs):
super().__init__(observation_space, action_space, lr_schedule, **kwargs)
# Policy head (action logits) and value head
self.policy_head = torch.nn.Linear(10, self.action_space.n)
self.value_head = torch.nn.Linear(10, 1)
env = CustomEnv()
env = FlattenObservation(env)
check_env(env)
model = PPO(
CustomPolicy,
env,
verbose=1,
n_steps=5
).learn(10) In terms of execution:- But now, this is the order: reset --> forward --> step --> reset --> forward --> step --> reset --> forward --> and so on. Actually, in the code, I am using training data that I pass to the |
this is correct, the terminal observation is only used to predict the value, not the action: stable-baselines3/stable_baselines3/common/on_policy_algorithm.py Lines 234 to 245 in daaebd0
(see some link in our doc + issue to explain the distinction between termination/truncation) Predicting an action here would not make sense since the episode is over (so the action will not be executed) Note: the |
Thanks for the reply and the code. I looked at the code and the documentation.
def step(self, action):
reward = 0.0
terminated = bool(0.98 >= self.Threshold)
truncated = False
info = {}
obs = { "context": np.array([1.0, 2.0], dtype=np.float32),
"score": np.array([0.98], dtype=np.float32)
}
if terminated:
info = {"terminal_observation": obs}
return obs, reward, False, truncated, info
return obs, reward, terminated, truncated, info I guess now, it will be cover the case when episode terminates in the first step. |
yes
I'm not sure about that one. But it sounds about right. If your episodes are single steps, then it will use single steps to train (which sounds reasonable).
it seems that you are changing the problem you want to solve. Finishing after one step doesn't seem bad if it reaches the objective you are asking for. |
🐛 Bug
I have created a Custom Environment as well as Custom ActorCritic Policy. In the custom environment, I have two functions
reset
andstep
. I initialize a variablescore
to 0 in thereset
function but in thestep
function, this variablescore
is calculated and gets a non-zero value (say 0.95). I return thisscore
variable in both thereset
andstep
functions but when I print the variableobs
inside theforward
function of the class ActorCriticPolicy given in the file common/policies.py, it always prints the value 0 i.e. it always takes the value of thescore
that I initialize in thereset
function, not from thestep
function.Code example
As shown in the above code, I didn't implement a custom
forward
method for myCustomPolicy
class. When I try to print theobs
variable inside theforward
function defined in theActorCriticPolicy
class in the file common/policies.py, then I always see the value of thescore
as 0. Same for the variablecontext
.The problem is that I am afraid that during rollout, when the time comes to update the policy, then it doesn't take into account the
score
andcontext
values (in the observation) from thestep
function but from thereset
function.Observation returned by the
reset
method is:Observation returned by the
step
method is:But
obs
shown in theforward
function of the ActorCriticPolicy class is:As you can see, the
forward
function shows the context and score returned from thereset
function, not from thestep
function.Relevant log output / Error message
No response
System Info
No response
Checklist
The text was updated successfully, but these errors were encountered: