Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

SSL is not picking up recovery checkpoints #608

Closed
ant0nsc opened this issue Dec 7, 2021 · 1 comment
Closed

SSL is not picking up recovery checkpoints #608

ant0nsc opened this issue Dec 7, 2021 · 1 comment

Comments

@ant0nsc
Copy link
Contributor

ant0nsc commented Dec 7, 2021

When SSL jobs get pre-empted, they seem to start from afresh.

As a first step, add diagnostics. Print out all checkpoints that are found, to see if a recovered job actually sees the previously written checkpoints

AB#4774

@ant0nsc
Copy link
Contributor Author

ant0nsc commented Dec 14, 2021

Bug in AML restart functionality reported. Added #614 as a workaround.

@ant0nsc ant0nsc closed this as completed Dec 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant