-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when suspending tasks #1024
Comments
The scheduler was completely changed between BOINC 6 and 7. Is this still an issue when running the same scenario on a BOINC 7.6? |
I assume this happens because the GUI RPC protocol can only suspend one task per RPC. Suspending a block of tasks (multi-select) is something I do regularly, but I don't usually include a running task in the mix. Presumably it's implemented by sending a stream of single-task RPCs in quick succession. The behaviour will be determined by the relative speeds of the host running the Manager, the host running the client, the communication link between them, and the reaction time of the running task to a suspend message. On my network (fast modern machines with gigabit LAN), the 'suspend' RPCs are processed fast enough not to allow a new task to start until the block suspend is complete. I can see the possibility that 'suspend task, start next task' completes before 'suspend next task' is ready to be acted upon, leading to a number of tasks being left suspended, waiting to run, with 1 second elapsed time showing. That creates a large number of unnecessary slot directories, and possibly occupies extra memory, but isn't fatal. Eliminating the race condition would involve re-writing the RPC mechanism to allow batching: I think that's too much work to resolve what is in reality a minor problem which can easily be worked round on the rare systems where is it a problem (suspend 'ready-to-start' tasks only in the batch, and then suspend the running task singly once all suspends have been processed). |
I agree that this is not high priority. |
10-18-18 , BOINC 7.14.2: On a Windows 7 Pro box, Intel Core Duo E8500 @ 3.16GHz, 4GB DRAM with nVIDIA GF 8400 GS as display adapter (GPU suspended in BOINC) , I was able to suspend multiple combinations of running and waiting (SETI and Einstein) tasks, resume them, and re-suspend them multiple times with instantaneous change to all highlighted tasks as viewed in Task window. When an already suspended task was included in the list, the resume/suspend button was grayed out. On a SuperMicro 2 x 6 Xeon X5650 `(12 physical cores) 16 GB ECC RAM, no GPU, running WIN10 Pro: I was able to suspend and resume running and waiting (Einstein, MW 12-core WU and WCG) tasks multiple times with instantaneous change as viewed in Task window. No evidence of the race condition was observed. The above tests were done at the computer's console keyboard and mouse. |
I just tested this on my Core 2 Duo T7250 laptop a few times running Ubuntu 18.04.6 and I couldn't reproduce this, for what its worth. Regardless, a solution may be to suspend the tasks that are not active first, then the active tasks? |
Reported by Martin Suchan on 21 Sep 41192538 07:06 UTC
I've just noticed this issue when suspending tasks manually in BOINC Manager - situation:
Win7 x86, BM 6.12.15, only WCG project, Core2Duo - 2 cores
I got about 10 downloaded tasks, one is completed and reported, other two are running, the other tasks are not started yet, but allowed to be started once other task is finished.
'''I selected all not-started tasks PLUS one running task and clicked the Suspend button''' in the left command bar.
I expected, that all task will be marked at once as Suspended and the running will stop as well.
What actually happened? '''One not-yet-started task was started for about 1 second and after then it was suspended'''. I guess the task of '''"changing status to suspended" is not done in transactional way'''. What actually happened, my guess, - some function got list of tasks to suspend, it started suspending one task each time. First it suspended the one running task. In this moment some other thread noticed there is one free slot for running, it found ready task and started it (typical race condition ), in the meantime the first thread finished suspending the other tasks,
including the one started by the other thread.
This should be fixed in my opinion. It could lead to bigger problems when running on 8+ core systems with lot of projects.
Event log:
task faah19421_ZINC17130909_xmdEq_1TW7_02_0 is running
task HFCC_L4_01202033_L4_0001_0 is in group for suspending, but it is started for 1 second
Migrated-From: http://boinc.berkeley.edu/trac/ticket/1048
The text was updated successfully, but these errors were encountered: