In a previous commit, the detection of a failure became too aggressive. #386

romain-intel · 2023-08-24T16:05:04Z

This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

saikonen · 2023-08-24T16:16:15Z

services/ui_backend_service/data/db/tables/run.py

@@ -130,7 +130,7 @@ def select_columns(self):
            WHEN end_attempt_ok IS NOT NULL AND end_attempt_ok.value IS FALSE
            THEN 'failed'
            WHEN {table_name}.last_heartbeat_ts IS NOT NULL
-                AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
+                AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}


default for the cutoff is 1 day, whereas the threshold is 1 minute. This would result in failed runs to remain stuck in "running" state for a whole day, instead of being detected within a minute or so.

Do you have some examples of the scenarios where the threshold is too eager? Is it that a running run is being marked as failed, before flipping back to "running" when heartbeats resume.

I assume this is because running tasks are the only ones refreshing a run level heartbeat, and sometimes all new tasks can get stuck in the scheduler so refreshing the run heartbeat stops momentarily?

the threshold is fully configurable as well, but it affects status/duration/finished_at. Decoupling the status from the other two can also lead to odd visuals, where a run is still "running", but it has a finished_at and duration

maybe a separate run level threshold?

I believe we already use cutoff (see line 154 for duration for example). And yes, the issue was that 1minute was too short because tasks pretty much always take more than one minute to be scheduled and find a node and start executing (not always but very frequently). Internally, I have set the cuttoff to something like 5 minutes so it's not 1h. I think that cutoff makes more sense as well. We can have another configuration too if you want but 1 minute is definitely too short.

Note that it used to be different. I simplified this to not make it depend on whether or not there was a successful last task. Before, it would wait for longer if there was a successful last task or something like that.

saikonen · 2023-09-05T14:43:46Z

services/ui_backend_service/data/db/tables/run.py

@@ -111,13 +111,12 @@ def select_columns(self):
            WHEN end_attempt_ok IS NOT NULL
            THEN end_attempt_ok.ts_epoch
            WHEN {table_name}.last_heartbeat_ts IS NOT NULL
-                AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
+                AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}


changed this to heartbeat_cutoff as well so we don't end up in a situation where a run has a finished_at timestamp, yet is still marked as "running". At a glance it seemed that the query prior to changes in #338 was also relying on cutoff rather than threshold for this case.

Does the change seem reasonable @romain-intel ?

romain-intel · 2023-09-05T16:00:51Z

LGTM. Thanks for catching that.

* Upgrade Github actions used in `dockerimage` action (#379) * upgrade github actions used in dockerimage action * remove setup-buildx-action and pin to hashes. * change deprecated pkg_resources to importlib.metadata (#387) * In a previous commit, the detection of a failure became too aggressive. (#386) * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * change run finished at query to heartbeat_cutoff from threshold * clean up unused values from run query --------- Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com> * fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388) * Configurable SSL Connection (#373) * [TRIS-297] Configurable SSL Connection (#1) * Configurable SSL connection * Update services/utils/__init__.py * no ssl unit testing (#3) * ssl seperate test (#4) * dsn generator sslmode none (#5) * fix run_goose.py not working without SSL mode env variables. (#390) * change run inactive cutoff default to 6 minutes. cleanup unused constant (#392) * clarify comment on read replica hosts * make USE_SEPARATE_READER_POOL a boolean * remove unnecessary conditionals for pool choice in execute_sql --------- Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com>

…nection pools. (#344) * Changes for using a separate reader pool for Aurora-like use cases * Avoid some expensive logging operations when not needed * Refactoring execute_sql implementations and separating reader/writer endpoints choosing the right pool in execute_sql * Adding documentation for using separate reader pools * use [PREFIX]_READ_REPLICA_HOST as a feature gate instead of localhost * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * Patch pjoshi aurora (#395) * Upgrade Github actions used in `dockerimage` action (#379) * upgrade github actions used in dockerimage action * remove setup-buildx-action and pin to hashes. * change deprecated pkg_resources to importlib.metadata (#387) * In a previous commit, the detection of a failure became too aggressive. (#386) * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * change run finished at query to heartbeat_cutoff from threshold * clean up unused values from run query --------- Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com> * fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388) * Configurable SSL Connection (#373) * [TRIS-297] Configurable SSL Connection (#1) * Configurable SSL connection * Update services/utils/__init__.py * no ssl unit testing (#3) * ssl seperate test (#4) * dsn generator sslmode none (#5) * fix run_goose.py not working without SSL mode env variables. (#390) * change run inactive cutoff default to 6 minutes. cleanup unused constant (#392) * clarify comment on read replica hosts * make USE_SEPARATE_READER_POOL a boolean * remove unnecessary conditionals for pool choice in execute_sql --------- Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com> * fix broken connection string after conflict resolve * make codestyles happy * fix test cases * cleanup * merge run_goose.py from master * revert unnecessary changes --------- Co-authored-by: Preetam Joshi <preetamj@netflix.com> Co-authored-by: Romain Cledat <rcledat@netflix.com> Co-authored-by: Chaoying Wang <chaoyingw@netflix.com> Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com> Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com> Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com>

In a previous commit, the detection of a failure became too aggressive.

459c18d

This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

romain-intel requested a review from saikonen August 24, 2023 16:05

saikonen reviewed Aug 24, 2023

View reviewed changes

saikonen added the in review Currently under review label Aug 24, 2023

saikonen added 2 commits September 5, 2023 17:37

change run finished at query to heartbeat_cutoff from threshold

a178157

clean up unused values from run query

a8c6777

saikonen reviewed Sep 5, 2023

View reviewed changes

saikonen approved these changes Sep 5, 2023

View reviewed changes

romain-intel merged commit 749efcf into master Sep 5, 2023

saikonen deleted the less_agressive_timeout branch November 3, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In a previous commit, the detection of a failure became too aggressive. #386

In a previous commit, the detection of a failure became too aggressive. #386

romain-intel commented Aug 24, 2023

saikonen Aug 24, 2023

saikonen Aug 24, 2023

romain-intel Aug 27, 2023

romain-intel Aug 27, 2023

saikonen Sep 5, 2023

romain-intel commented Sep 5, 2023

In a previous commit, the detection of a failure became too aggressive. #386

In a previous commit, the detection of a failure became too aggressive. #386

Conversation

romain-intel commented Aug 24, 2023

saikonen Aug 24, 2023

Choose a reason for hiding this comment

saikonen Aug 24, 2023

Choose a reason for hiding this comment

romain-intel Aug 27, 2023

Choose a reason for hiding this comment

romain-intel Aug 27, 2023

Choose a reason for hiding this comment

saikonen Sep 5, 2023

Choose a reason for hiding this comment

romain-intel commented Sep 5, 2023