-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In a previous commit, the detection of a failure became too aggressive. #386
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -111,13 +111,12 @@ def select_columns(self): | |
WHEN end_attempt_ok IS NOT NULL | ||
THEN end_attempt_ok.ts_epoch | ||
WHEN {table_name}.last_heartbeat_ts IS NOT NULL | ||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold} | ||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff} | ||
THEN NULL | ||
ELSE {table_name}.last_heartbeat_ts*1000 | ||
END) AS finished_at | ||
""".format( | ||
table_name=table_name, | ||
heartbeat_threshold=HEARTBEAT_THRESHOLD, | ||
heartbeat_cutoff=RUN_INACTIVE_CUTOFF_TIME, | ||
), | ||
""" | ||
|
@@ -130,14 +129,12 @@ def select_columns(self): | |
WHEN end_attempt_ok IS NOT NULL AND end_attempt_ok.value IS FALSE | ||
THEN 'failed' | ||
WHEN {table_name}.last_heartbeat_ts IS NOT NULL | ||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold} | ||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. default for the cutoff is Do you have some examples of the scenarios where the threshold is too eager? Is it that a running run is being marked as failed, before flipping back to "running" when heartbeats resume. I assume this is because running tasks are the only ones refreshing a run level heartbeat, and sometimes all new tasks can get stuck in the scheduler so refreshing the run heartbeat stops momentarily? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the threshold is fully configurable as well, but it affects maybe a separate run level threshold? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe we already use cutoff (see line 154 for duration for example). And yes, the issue was that 1minute was too short because tasks pretty much always take more than one minute to be scheduled and find a node and start executing (not always but very frequently). Internally, I have set the cuttoff to something like 5 minutes so it's not 1h. I think that cutoff makes more sense as well. We can have another configuration too if you want but 1 minute is definitely too short. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that it used to be different. I simplified this to not make it depend on whether or not there was a successful last task. Before, it would wait for longer if there was a successful last task or something like that. |
||
THEN 'running' | ||
ELSE 'failed' | ||
END) AS status | ||
""".format( | ||
table_name=table_name, | ||
heartbeat_threshold=HEARTBEAT_THRESHOLD, | ||
cutoff=OLD_RUN_FAILURE_CUTOFF_TIME, | ||
heartbeat_cutoff=RUN_INACTIVE_CUTOFF_TIME, | ||
), | ||
""" | ||
|
@@ -157,7 +154,6 @@ def select_columns(self): | |
END) AS duration | ||
""".format( | ||
table_name=table_name, | ||
heartbeat_threshold=HEARTBEAT_THRESHOLD, | ||
cutoff=OLD_RUN_FAILURE_CUTOFF_TIME, | ||
), | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to
heartbeat_cutoff
as well so we don't end up in a situation where a run has afinished_at
timestamp, yet is still marked as "running". At a glance it seemed that the query prior to changes in #338 was also relying on cutoff rather than threshold for this case.Does the change seem reasonable @romain-intel ?