-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle time limit better #37
Comments
Agreed. But, core-to-core speed is also not that different. I'd say if it runs in 10 minutes on an AVX2 based system, a walltime of 30 mins should be enough for everyone (famous last words, I know... "640K ought to be enough for anyone"). Especially if the error is clear enough (we can include a suggestion to just override with Admitedly, GPU to GPU times might vary a lot more. But a liberal limit is less of a problem here: I'm not so interested in if it runs in 2 mins or 5, I'd personally be happy with a walltime of 30 mins on each of those tests (as is the case now). We could implement some (very liberal) time limits based on a look-up table, or some basic logic, to at least allow higher walltime for the slowest tests (< 16 cores or so).
We could. At this point, my first approach would be to variant-specific time limits. I think it's easier to implement.
Ugh, true. Or decide we don't support those (at least for now) :P |
copying discussion of #28
from @casparvl
Point number 2 makes me think. First of all, the error is quite non-descriptive:
Maybe we should make a standard sanity check that checks if the job output does not contain something like
I'm not sure how this generalizes to other systems (I'm assuming all SLURM based systems print this by default), but even so: it doesn't hurt to check. At least there is a better chance of getting a clear error message.
Secondly, how do we make sure we don't run out of walltime? Sure, we could just specify a very long time, but that can be problematic as well (not satisfying max walltimes on a queue, and in our case, jobs <1h actually get backfilled behind a floating reservation that we have in order to reserve some dedicated nodes for short jobs). Should we scale the max walltime based on the amount of resources?
from @smoors
about the walltime issue:
about checking for the exceeded time limit message:
i think that's a good idea. we could expand that to also check for an out-of-memory message, and maybe other slurm messages that i don't know of. even better would be to check that the job error file is empty, but Gromacs prints some stuff to stderr, would be nice if there is a way to force Gromacs to not do that.
The text was updated successfully, but these errors were encountered: