You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ThreadPoolExecutor#getActiveCount returns approximate active number of threads.
For example, if a certain thread is started and getActiveCount called immediately after that, the thread count that was just started thread may not be reflected.
To testing this problem, I add some info logs to MultiThreadAgent#run as follows.
@Override
public void run()
{
while (!stop) {
try {
synchronized (newTaskLock) {
if (executor.isShutdown()) {
break;
}
int max = Math.min(executor.getMaximumPoolSize() - executor.getActiveCount(), 10);
logger.info("max={}", max); // Add for tesitng
if (max > 0) {
List<TaskRequest> reqs = taskServer.lockSharedAgentTasks(max, agentId, config.getLockRetentionTime(), 1000);
logger.info("reqs.size={}", reqs.size()); // Add for tesitng
for (TaskRequest req : reqs) {
executor.submit(() -> {
try {
runner.run(req);
}
...
In ths case, task with id 160 (task +loop-10+run) is not executed yet, waiting in the queue.
Since this task is inactive and cannot send a heartbeat, this task expires lock after 5 minutes (if another task does not finished during this time) and will be retried with following logs.
2017-02-13 15:44:45 +0900 [WARN] (lock-expire-0) io.digdag.core.database.DatabaseTaskQueueServer: 1 task locks are expired. Tasks will be retried.
And I have confirmed that updating retry_count and lock_expire_time on id 160 at this time.
MultiThreadAgent is the only instance that can increase number of active threads. There're no other threads that may start another thread right after executor.getActiveCount() call. Thus the cause I guess is that getActiveCount() returns snapshot of an old value which may not reflect the previous executor.submit() call even if executor.getActiveCount() and executor.submit() are called from a single same thread.
So, solution would be something like this: instead of checking number of active threads, MultiThreadAgent needs to monitor number of active (non-finished) tasks.
Summary
I often get duplicate task execution problem when running many tasks in parallel with following conditions.
--max-task-threads
to digdag server.--max-task-threads
in parallel.Steps to reproduce
--max-task-threads
option.test-duplicate.dig
scripts/sleep-6min.sh
test-duplicate
. The example of this execution result is:In this case, task
+loop-10+run
andloop-19+run
were executed twice.Details
I think the cause of this problem seems to be how to start the task execution thread on MultiThreadAgent#run.
https://github.com/treasure-data/digdag/blob/v0.9.4/digdag-core/src/main/java/io/digdag/core/agent/MultiThreadAgent.java#L89
ThreadPoolExecutor#getActiveCount returns approximate active number of threads.
For example, if a certain thread is started and getActiveCount called immediately after that, the thread count that was just started thread may not be reflected.
To testing this problem, I add some info logs to MultiThreadAgent#run as follows.
The execution result is:
And at this time,
queued_task_locks
had 11 records of updatedlock_expire_time
.In ths case, task with id 160 (task
+loop-10+run
) is not executed yet, waiting in the queue.Since this task is inactive and cannot send a heartbeat, this task expires lock after 5 minutes (if another task does not finished during this time) and will be retried with following logs.
And I have confirmed that updating
retry_count
andlock_expire_time
on id 160 at this time.This may cause the duplicate task execution.
I also confirmed that the task with id 169 (task
+loop-19+run
) also behaved similarly.System configuration
digdag version: 0.9.4
The text was updated successfully, but these errors were encountered: