Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Workaround for "finish file present too long" client bug #3300

Merged
merged 1 commit into from
Sep 25, 2019

Conversation

stream1972
Copy link
Contributor

Fixes second scenario of #3017 and improves #3019

Description of the Change
This patch deals with "finish file problem" completely, bringing tasks affected by this bug "back to life" on the server side.
As was noted in #3017 (comment) , this problem may also appear when client shutting down, and this is not fixed yet (I'll explain this scenario below). In #3019 was mentioned that "it should be treated as successful regardless of whether it exits, but right now we don't have a mechanism for killing a job and marking it as success."
This patch marks job as successful on the server side, so it'll fix all problems with all client versions.

Cause of the problem
There are two causes. First reason is a when an application really exits too slow due to high system load and swap trashing. It was fixed in #3019 by increasing exit timeout up to 5 minutes.
But in real life, on PrimeGrid and my own Private GFN Server, I've encountered only second type of problem. This is a race condition with following scenario:

  1. A client is requested to shutdown, It could be user request or system reboot, it does not matter.
  2. At same time, application successfully finishes and writes boinc_finish file.
  3. Client quits, nor noticing nor processing termination of application.
  4. When client is restarted by user or after system reboot, it does not check for presence of boinc_finish file in slot directories.
  5. This mean that is a task which is already finished is started again
  6. If an application keeps internal state/checkpoint and could determine that nothing should be done, you're in luck. But many applications will delete own checkpoints before completion, so such a restart causes that application will do it's job again from beginning. But boinc_finish file is already here...
  7. Client notices presence of boinc_finish, but application is not going to quit... Finally, it kills an application with "finish file present to long" diagnostic. An stderr log clearly says that application was run twice (and killed on the second run).

Alternate Designs
Of course, it would be nice if this problem was also fixed in client itself (check for boinc_finish before starting programs in slot directories for a first time after reboot). But it will not help to users with old clients.
Also, a wrapper or native Boinc application may check for boinc_finish itself and do exit(0) immediately. But it's not possible to update existing wrappers and applications.
So this server patch seems to be lowest of evils. First it was implemented on my Private GFN Server and works great. Every day, few tasks are silently recovered.

Release Notes
N/A

@stream1972
Copy link
Contributor Author

Here is an example of a task affected by this bug, it hit one of PrimeGrid users right after I wrote this pull request. https://www.primegrid.com/result.php?resultid=1025214123
This result will be eventually purged from database so I'll quote important parts here.

Exit status | 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8

LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=6 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 6 threads, a = 3
16:50:55 (10108): called boinc_finish(0)
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8

LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=6 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 6 threads, a = 3

</stderr_txt>
]]>

You can see that task successfully finished at 16:50:55, but Boinc client didn't noticed this. After restart of client, task was started once again from beginning and finally was killed by client because finish file was already there.
PrimeGrid has a cron job which attempts to recover tasks with these fake errors, and this task was successfully recovered and validated, user got credit. But it was too late, server already sent third task (unneeded) for this workunit, and some CPU power was wasted. The patch eliminate this problem, it'll do recovery immediately.

@davidpanderson
Copy link
Contributor

Thanks. I submitted a PR to fix this in the client as well.

@stream1972 stream1972 deleted the stream_for_merge branch September 25, 2019 10:19
@AenBleidd AenBleidd added this to the Server Release 1.4.1 milestone Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants