booktest: add hangcheck timer to print current file+line, and later backtrace #4504
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To help with debugging #4493, cc @fingolfin:
When the same example file is running for 10 minutes it will try to print the current file+line, and then again every 5 min.
Once at > 20 min (i.e. 25 min) it will try to send a
USR1
/INFO
signal to the process (with a short delay) to get a backtrace of the currently running computation.Edit: added a commit to reduce the timeouts for testing, will be removed laterAll this will only work if the code has some yield points to allow the task to run, if this doesn't work we need to move this to a separate thread / process.
With the current timings, the hangcheck will probably trigger once on a successful run:
Edit2:
In the testrun (with shorter limits) a backtrace is printed here:
https://github.com/oscar-system/Oscar.jl/actions/runs/12986964827/job/36214891463?pr=4504#step:11:6489
And a hangcheck warning here:
https://github.com/oscar-system/Oscar.jl/actions/runs/12986964827/job/36214891463?pr=4504#step:11:752
I am slightly confused about the timing of the backtrace, I think it should have been printed directly after the +3min hangcheck and not shortly before the +4min, but I don't think that this is important.