You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This part is to add divergence of predicted trajectory and sampled trajectory as additional cost.
i.e. (Kx + k - u).T * inverse_policy_variance_matrix * (Kx+k -u)
u is sampled action from data.
Kx + k is predicted action from global policy network.
@wangsd01 Hi, I am also looking at these lines. Have you solved the problem?
I am not 100% sure what's happening, but one thing that looks especially suspicious to me is that the derivative to u is Cov^{-1}.dot(k_old).
In the code repo, by looking at the forward pass, it uses u = Kx + k, instead of u = K(x-x_old) + k + u_old.
And therefore, I kinda feel that if we actually take the derivative of the KL penalty wrt u, we will have something like Cov^{-1}.dot(u_new - u_old) = Cov^{-1}.dot(K_new.dot(x) - K_old.dot(x) + k_new - k_old) != Cov^{-1}.dot(k_old).
Not sure if I missed anything. Be great if you could help :( @cbfinn
Could you help to give any reference to this part of code? Thank you!
The text was updated successfully, but these errors were encountered: