Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In a previous commit, the detection of a failure became too aggressive. #386

Merged
merged 3 commits into from
Sep 5, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions services/ui_backend_service/data/db/tables/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,13 +111,12 @@ def select_columns(self):
WHEN end_attempt_ok IS NOT NULL
THEN end_attempt_ok.ts_epoch
WHEN {table_name}.last_heartbeat_ts IS NOT NULL
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to heartbeat_cutoff as well so we don't end up in a situation where a run has a finished_at timestamp, yet is still marked as "running". At a glance it seemed that the query prior to changes in #338 was also relying on cutoff rather than threshold for this case.

Does the change seem reasonable @romain-intel ?

THEN NULL
ELSE {table_name}.last_heartbeat_ts*1000
END) AS finished_at
""".format(
table_name=table_name,
heartbeat_threshold=HEARTBEAT_THRESHOLD,
heartbeat_cutoff=RUN_INACTIVE_CUTOFF_TIME,
),
"""
Expand All @@ -130,14 +129,12 @@ def select_columns(self):
WHEN end_attempt_ok IS NOT NULL AND end_attempt_ok.value IS FALSE
THEN 'failed'
WHEN {table_name}.last_heartbeat_ts IS NOT NULL
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default for the cutoff is 1 day, whereas the threshold is 1 minute. This would result in failed runs to remain stuck in "running" state for a whole day, instead of being detected within a minute or so.

Do you have some examples of the scenarios where the threshold is too eager? Is it that a running run is being marked as failed, before flipping back to "running" when heartbeats resume.

I assume this is because running tasks are the only ones refreshing a run level heartbeat, and sometimes all new tasks can get stuck in the scheduler so refreshing the run heartbeat stops momentarily?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the threshold is fully configurable as well, but it affects status/duration/finished_at. Decoupling the status from the other two can also lead to odd visuals, where a run is still "running", but it has a finished_at and duration

maybe a separate run level threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already use cutoff (see line 154 for duration for example). And yes, the issue was that 1minute was too short because tasks pretty much always take more than one minute to be scheduled and find a node and start executing (not always but very frequently). Internally, I have set the cuttoff to something like 5 minutes so it's not 1h. I think that cutoff makes more sense as well. We can have another configuration too if you want but 1 minute is definitely too short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it used to be different. I simplified this to not make it depend on whether or not there was a successful last task. Before, it would wait for longer if there was a successful last task or something like that.

THEN 'running'
ELSE 'failed'
END) AS status
""".format(
table_name=table_name,
heartbeat_threshold=HEARTBEAT_THRESHOLD,
cutoff=OLD_RUN_FAILURE_CUTOFF_TIME,
heartbeat_cutoff=RUN_INACTIVE_CUTOFF_TIME,
),
"""
Expand All @@ -157,7 +154,6 @@ def select_columns(self):
END) AS duration
""".format(
table_name=table_name,
heartbeat_threshold=HEARTBEAT_THRESHOLD,
cutoff=OLD_RUN_FAILURE_CUTOFF_TIME,
),
]
Expand Down