You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.
Here are my questions:
In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?
In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?
In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?
Based on the experiments in Figure 1, can it be understood that
(1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
(2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
(3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.
The text was updated successfully, but these errors were encountered:
Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.
Here are my questions:
In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?
In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?
In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?
Based on the experiments in Figure 1, can it be understood that
(1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
(2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
(3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.
The text was updated successfully, but these errors were encountered: