Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the experiments for demonstrating extrapolation error. #15

Open
awecefil opened this issue Dec 14, 2023 · 0 comments

Comments

@awecefil
Copy link

Hello, I am currently studying offline reinforcement learning and came across BCQ. It's a great work worth delving into. However, I have some questions regarding the paper that I'd like to clarify and ensure that I haven't misunderstood. My questions might be numerous, but I genuinely want to understand the experimental details in the paper.

Here are my questions:

  1. In Figure 1, does "Off-policy DDPG" refer to DDPG trained using a fixed dataset without interaction with environment? Additionally, as a benchmark for comparison, does "Behavioral" refer to DDPG trained using the normal training process?

  2. In Figure 1, for the three experiments with different buffers, is "Final" understood as training a Behavioral DDPG, recording transitions during the training process, and then using the final buffer as a dataset for Off-Policy DDPG training (with no new transitions added to the buffer during training)? Can "Concurrent" be simply understood as Off-Policy DDPG gradually using transitions from early-stage to late-stage during the training process, rather than having the chance to sample late-stage transitions right from the beginning?

  3. In Figure 1, the orange horizontal lines in (a)(c) represent calculating the episode average return after collecting the complete buffer using Behavioral (after Behavioral training concludes) right? Is this also the reason why there is no such line in (b) (because the buffer is in the process of collecting transitions)?

  4. Based on the experiments in Figure 1, can it be understood that
    (1) even if offline RL uses a dataset with sufficient coverage, the extrapolation error (caused by the actor in DDPG taking an action out-of-distribution) leads to suboptimal performance?
    (2) even if offline RL uses the same buffer as Behavioral, because the transitions in the buffer are still not generated by offline RL itself, there is still a distribution shift issue.
    (3) even if offline RL is trained with expert or nearly expert data, without encountering "bad(early-state) data," it may fail to learn which actions should be avoided and cause the performance worse than the final and concurrent buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant