Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submission fails without internet connection #270

Closed
aiida-bot opened this issue Sep 22, 2016 · 8 comments
Closed

Job submission fails without internet connection #270

aiida-bot opened this issue Sep 22, 2016 · 8 comments

Comments

@aiida-bot
Copy link

Originally reported by: Aliaksandr Yakutovich (Bitbucket: yakutovich, GitHub: yakutovicha)


Dear AiiDA developers,

I would like to draw you attention to the fact, that many times (at least in my case) AiiDA job fails due to the simple reason: absence or unstable internet connection.

Of course user could do some work around, but I think it would be great if you can solve this problem from the "low level".

I believe this is rather important issue to solve, because user can not always control internet connection (for example if workflow is running). Moreover this will allow offline job submission where one could prepare his calculations, submit them and the rest will be done automatically once internet will appear again. (This can happen if you work in the train for example)

Hope this will help to improve AiiDA.
Best,
Sasha


@aiida-bot
Copy link
Author

Original comment by Jens Broeder (Bitbucket: broeder-j, GitHub: broeder-j):


Dear Sasha,

I (an AiiDA user) am aware of that problem. The way I solve it: I shut down the daemon if I work 'offline'.
This won't solve the problem if a computing resource is only temporary not available or only available in a certain network. But I think if a resource is unavailable the submission should fail.

'low level' solution suggestion I could think of:

a) calculation with submission failed can be easily resubmitted.
(will be tried to be resubmitted after daemon restart,
or daemon restart with some flag,
or a verdi command (would be my choice), example: verdi calculation resubmit pk/all)
Also in a workflow this is wanted.

Drawback, the user has to take care of 'submission failed' calculations he does not want to be resubmitted.

b) calculations stay in tosubmit state, if resource unavailable ... But I think this is really unwanted, because the daemon will be slowed down and there shouldn't be many/endless connection tries. I like the way it is.

Best, Jens

@aiida-bot
Copy link
Author

Original comment by Aliaksandr Yakutovich (Bitbucket: yakutovich, GitHub: yakutovicha):


Dear Jens,

Thank you for your comments.

I do actually agree with the 'low level' ways you proposed to solve the problem. And I personally believe that both of them should be implemented

But I wouldn't agree with two things:

  1. "if resource is unavailable the submission should fail".

Why? Suppose you run a workflow. You computer submits the jobs in a given order. And then for some reason you lose your internet connection. Then your job is failed.. You would need to restart it somehow. For me it is really unclear why failing is better then waiting until the computer is up again.

  1. "calculations stay in tosubmit state, if resource unavailable ... But I think this is really unwanted ...".

As far as I know daemon tries to connect every 30 seconds (to check the running state). What would change if it first checks whether the computer is online or not? I do not see a problem here.

Best,

Sasha

@giovannipizzi
Copy link
Member

If there is no network connection, the error one gets is e.g.

gaierror: [Errno 8] nodename nor servname provided, or not known

@ltalirz
Copy link
Member

ltalirz commented Dec 18, 2017

Proposal:
Add a new job state "SUBMISSIONONHOLD", which will be triggered, when

Add command verdi calculation resume [<pk1> <pk2> ....] (with optional flag --all), which put them back into TOSUBMIT state

Potential things to take care of

  • DbCalcState table prevents jobs from going twice through the same state. Thus, the resume command needs to remove TOSUBMIT, SUBMISSIONFAILED, SUBMITTING and SUBMISSIONONHOLD states from the table before changing the state of the calculation.
  • The daemon should not 'seal' the calculation when it goes to SUBMISSIONONHOLD

This depends on the work by @muhrin on the new daemon and should be implemented afterwards.
Assigning @yakutovicha since he is working on workflows and reported the initial issue.

@broeder-j
Copy link
Member

broeder-j commented Feb 14, 2018

Another common case to consider, AiiDA should put jobs in 'SUBMISSIONHOLD' if an HPC resource is in maintenance. Currently the calculation gets submitted and will fail, because it will not be put into the queue and soon considered as done. Fails because no output file. I have currently no Idea how to detect it.

Maybe another feature idea might be to give computer an optional 'available property' (ggf user specific), connected to time. (submit calcs only on this machine from 21 p.m. to 5 p.m in Jan, Feb....)

@giovannipizzi
Copy link
Member

I agree that it should go on SUBMISSIONHOLD.
Note that now you can (by hand) disable the computer. You could have a cron job or an external script that decides when to disable computers. Not sure if we want to implement the scheduling in AiiDA, it's complicated for the user and very few people would use it (because often the maintenance are not always at the same time, but just announced a few days before).

@broeder-j
Copy link
Member

Your are right. I thought about that. Does this not reject your calculation right away? If I add some logic this should work. Thanks for pointing it out.

@sphuber
Copy link
Contributor

sphuber commented Sep 12, 2018

This is now implemented through PR #1903 and will be released with v1.0.0

@sphuber sphuber closed this as completed Sep 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants