You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.
My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.
When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.
The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.
I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach
Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.
My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.
When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.
The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.
I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach
I'm not sure if this is related to #239
The text was updated successfully, but these errors were encountered: