Fixed save/load problem on dqn.py #184

jmribeiro · 2023-06-19T11:33:25Z

Saving and loading the DQN agent would not save/load four needed attributes:

self.t
self.optim_t
self._cumulative_steps
self.replay_buffer

This caused the agent to have different a performance when evaluated without killing the program vs when saving the agent, killing the program, resuming the program and loading the agent.

Fig 1 - Training without checkpoints (i.e. same program ran from start to finish)

Fig 2 - Training with checkpoint (i.e., program killed at every t steps and agents loaded from disk)

My proposed solution (working, but applied only to the DQN agent) was to add new save_snapshot and load_snapshot methods on the agent's class (without overwriting the original save and load methods, avoiding saving the replay buffer every time):

    def save_snapshot(self, dirname: str) -> None:
        self.save(dirname)
        torch.save(
            self.t, os.path.join(dirname, "t.pt")
        )
        torch.save(
            self.optim_t, os.path.join(dirname, "optim_t.pt")
        )
        torch.save(
            self._cumulative_steps, os.path.join(dirname, "_cumulative_steps.pt")
        )
        self.replay_buffer.save(
            os.path.join(dirname, "replay_buffer.pkl")
        )


    def load_snapshot(self, dirname: str) -> None:
        self.load(dirname)
        self.t = torch.load(
            os.path.join(dirname, "t.pt")
        )
        self.optim_t = torch.load(
            os.path.join(dirname, "optim_t.pt")
        )
        self._cumulative_steps = torch.load(
            os.path.join(dirname, "_cumulative_steps.pt")
        )
        self.replay_buffer.load(
            os.path.join(dirname, "replay_buffer.pkl")
        )

This change is working as intended, training is resumed properly after reloading the agent from disk:

Fig 3 - Training with checkpoint (New patch) (i.e., program killed at every t steps and agents loaded from disk)

Overwritten save() and load() methods on dqn.py to save four attributed needed for stopping/resuming training when saving the agent to disk: - self.t - self.optim_t - self._cumulative_steps - self.replay_buffer

Renamed overwritten methods save and load to save_snapshot and load_snapshot, to avoid saving the replay buffer in existing calls of the original methods

muupan · 2023-06-26T03:37:47Z

/test

pfn-ci-bot · 2023-06-26T03:37:52Z

Successfully created a job for commit 8fc26f4:

Dashboard for commit 8fc26f4

jmribeiro · 2023-06-27T10:47:10Z

/test

@muupan There seem to be a problem with tests on test_acer.py (unrelated to the changes)

muupan

Sorry for my delayed response! Some of CI including code lint is broken and should be fixed later. I see the changes by the PR is harmless besides potential lint warnings, which should also be fixed later if any. Thanks for your contribution!

Update dqn.py

82b4e70

Overwritten save() and load() methods on dqn.py to save four attributed needed for stopping/resuming training when saving the agent to disk: - self.t - self.optim_t - self._cumulative_steps - self.replay_buffer

jmribeiro mentioned this pull request Jun 19, 2023

DDQN Loading Issue - training not resuming properly #183

Closed

Update dqn.py

8fc26f4

Renamed overwritten methods save and load to save_snapshot and load_snapshot, to avoid saving the replay buffer in existing calls of the original methods

muupan self-requested a review June 26, 2023 03:37

muupan approved these changes Jul 14, 2023

View reviewed changes

muupan merged commit ee0f363 into pfnet:master Jul 14, 2023

muupan added this to the v0.4.0 milestone Jul 16, 2023

muupan added the enhancement New feature or request label Jul 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed save/load problem on dqn.py #184

Fixed save/load problem on dqn.py #184

jmribeiro commented Jun 19, 2023 •

edited

Loading

muupan commented Jun 26, 2023

pfn-ci-bot commented Jun 26, 2023

jmribeiro commented Jun 27, 2023

muupan left a comment

Fixed save/load problem on dqn.py #184

Fixed save/load problem on dqn.py #184

Conversation

jmribeiro commented Jun 19, 2023 • edited Loading

muupan commented Jun 26, 2023

pfn-ci-bot commented Jun 26, 2023

jmribeiro commented Jun 27, 2023

muupan left a comment

Choose a reason for hiding this comment

jmribeiro commented Jun 19, 2023 •

edited

Loading