server: Workaround for "finish file present too long" client bug #3300

stream1972 · 2019-09-21T20:15:17Z

Fixes second scenario of #3017 and improves #3019

Description of the Change
This patch deals with "finish file problem" completely, bringing tasks affected by this bug "back to life" on the server side.
As was noted in #3017 (comment) , this problem may also appear when client shutting down, and this is not fixed yet (I'll explain this scenario below). In #3019 was mentioned that "it should be treated as successful regardless of whether it exits, but right now we don't have a mechanism for killing a job and marking it as success."
This patch marks job as successful on the server side, so it'll fix all problems with all client versions.

Cause of the problem
There are two causes. First reason is a when an application really exits too slow due to high system load and swap trashing. It was fixed in #3019 by increasing exit timeout up to 5 minutes.
But in real life, on PrimeGrid and my own Private GFN Server, I've encountered only second type of problem. This is a race condition with following scenario:

A client is requested to shutdown, It could be user request or system reboot, it does not matter.
At same time, application successfully finishes and writes boinc_finish file.
Client quits, nor noticing nor processing termination of application.
When client is restarted by user or after system reboot, it does not check for presence of boinc_finish file in slot directories.
This mean that is a task which is already finished is started again
If an application keeps internal state/checkpoint and could determine that nothing should be done, you're in luck. But many applications will delete own checkpoints before completion, so such a restart causes that application will do it's job again from beginning. But boinc_finish file is already here...
Client notices presence of boinc_finish, but application is not going to quit... Finally, it kills an application with "finish file present to long" diagnostic. An stderr log clearly says that application was run twice (and killed on the second run).

Alternate Designs
Of course, it would be nice if this problem was also fixed in client itself (check for boinc_finish before starting programs in slot directories for a first time after reboot). But it will not help to users with old clients.
Also, a wrapper or native Boinc application may check for boinc_finish itself and do exit(0) immediately. But it's not possible to update existing wrappers and applications.
So this server patch seems to be lowest of evils. First it was implemented on my Private GFN Server and works great. Every day, few tasks are silently recovered.

Release Notes
N/A

stream1972 · 2019-09-22T07:59:35Z

Here is an example of a task affected by this bug, it hit one of PrimeGrid users right after I wrote this pull request. https://www.primegrid.com/result.php?resultid=1025214123
This result will be eventually purged from database so I'll quote important parts here.

Exit status | 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8

LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=6 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 6 threads, a = 3
16:50:55 (10108): called boinc_finish(0)
BOINC llr wrapper (version 8.04)
Using Jean Penne's llr (64 bit)
LLR Program - Version 3.8.23, using Gwnum Library Version 29.8

LLR command line: primegrid_cllr.exe -d -oDiskWriteTime=1 -oThreadsPerTest=6 llr.in
Using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 6 threads, a = 3

</stderr_txt>
]]>

You can see that task successfully finished at 16:50:55, but Boinc client didn't noticed this. After restart of client, task was started once again from beginning and finally was killed by client because finish file was already there.
PrimeGrid has a cron job which attempts to recover tasks with these fake errors, and this task was successfully recovered and validated, user got credit. But it was too late, server already sent third task (unneeded) for this workunit, and some CPU power was wasted. The patch eliminate this problem, it'll do recovery immediately.

davidpanderson · 2019-09-25T04:05:24Z

Thanks. I submitted a PR to fix this in the client as well.

server: Workaround for "finish file present too long" client bug

8204b8e

davidpanderson mentioned this pull request Sep 24, 2019

client: on startup, check for active task finish files #3303

Merged

davidpanderson merged commit 6a7f48d into BOINC:master Sep 25, 2019

stream1972 deleted the stream_for_merge branch September 25, 2019 10:19

AenBleidd added this to the Server Release 1.4.1 milestone Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: Workaround for "finish file present too long" client bug #3300

server: Workaround for "finish file present too long" client bug #3300

stream1972 commented Sep 21, 2019

stream1972 commented Sep 22, 2019

davidpanderson commented Sep 25, 2019

server: Workaround for "finish file present too long" client bug #3300

server: Workaround for "finish file present too long" client bug #3300

Conversation

stream1972 commented Sep 21, 2019

stream1972 commented Sep 22, 2019

davidpanderson commented Sep 25, 2019