Heartbeat timeout on long running tasks #359

nstott · 2019-02-13T01:28:48Z

Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.

My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.

When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.

The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.

I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach

I'm not sure if this is related to #239

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat timeout on long running tasks #359

Heartbeat timeout on long running tasks #359

nstott commented Feb 13, 2019

Heartbeat timeout on long running tasks #359

Heartbeat timeout on long running tasks #359

Comments

nstott commented Feb 13, 2019