How to change auto-requeue hpc.ckpt
path
#20357
Closed
arijit-hub
started this conversation in
General
Replies: 1 comment
-
I figured it out. One needs to specifically set the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am using a slurm environment and am requeuing the job using lightning's automatic slurm handler. It works flawlessly. However, I just have one small issue. The temporary checkpoints
hpc_ckpt_*.ckpt
are saved in the current working directory instead of the directory I specified for model checkpoint saving. This causes a flaw in my experiments when I try to run a new job when an earlier job is in auto-queue. What I mean is this:(1) I had an old job which has hit the wall-time, saved a temporary ckpt, and is requeued. This will use the
hpc_ckpt_*.ckpt
to resume training.(2) My new experiment with the same
.sh
file will not start from scratch as it thinks that thehpc_ckpt_*.ckpt
that is there is intended for it to use.Is there any fix for this?
Beta Was this translation helpful? Give feedback.
All reactions