-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In a previous commit, the detection of a failure became too aggressive. #386
Conversation
This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time
@@ -130,7 +130,7 @@ def select_columns(self): | |||
WHEN end_attempt_ok IS NOT NULL AND end_attempt_ok.value IS FALSE | |||
THEN 'failed' | |||
WHEN {table_name}.last_heartbeat_ts IS NOT NULL | |||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold} | |||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default for the cutoff is 1 day
, whereas the threshold is 1 minute
. This would result in failed runs to remain stuck in "running" state for a whole day, instead of being detected within a minute or so.
Do you have some examples of the scenarios where the threshold is too eager? Is it that a running run is being marked as failed, before flipping back to "running" when heartbeats resume.
I assume this is because running tasks are the only ones refreshing a run level heartbeat, and sometimes all new tasks can get stuck in the scheduler so refreshing the run heartbeat stops momentarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the threshold is fully configurable as well, but it affects status/duration/finished_at
. Decoupling the status from the other two can also lead to odd visuals, where a run is still "running", but it has a finished_at and duration
maybe a separate run level threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we already use cutoff (see line 154 for duration for example). And yes, the issue was that 1minute was too short because tasks pretty much always take more than one minute to be scheduled and find a node and start executing (not always but very frequently). Internally, I have set the cuttoff to something like 5 minutes so it's not 1h. I think that cutoff makes more sense as well. We can have another configuration too if you want but 1 minute is definitely too short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it used to be different. I simplified this to not make it depend on whether or not there was a successful last task. Before, it would wait for longer if there was a successful last task or something like that.
@@ -111,13 +111,12 @@ def select_columns(self): | |||
WHEN end_attempt_ok IS NOT NULL | |||
THEN end_attempt_ok.ts_epoch | |||
WHEN {table_name}.last_heartbeat_ts IS NOT NULL | |||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold} | |||
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to heartbeat_cutoff
as well so we don't end up in a situation where a run has a finished_at
timestamp, yet is still marked as "running". At a glance it seemed that the query prior to changes in #338 was also relying on cutoff rather than threshold for this case.
Does the change seem reasonable @romain-intel ?
LGTM. Thanks for catching that. |
* Upgrade Github actions used in `dockerimage` action (#379) * upgrade github actions used in dockerimage action * remove setup-buildx-action and pin to hashes. * change deprecated pkg_resources to importlib.metadata (#387) * In a previous commit, the detection of a failure became too aggressive. (#386) * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * change run finished at query to heartbeat_cutoff from threshold * clean up unused values from run query --------- Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com> * fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388) * Configurable SSL Connection (#373) * [TRIS-297] Configurable SSL Connection (#1) * Configurable SSL connection * Update services/utils/__init__.py * no ssl unit testing (#3) * ssl seperate test (#4) * dsn generator sslmode none (#5) * fix run_goose.py not working without SSL mode env variables. (#390) * change run inactive cutoff default to 6 minutes. cleanup unused constant (#392) * clarify comment on read replica hosts * make USE_SEPARATE_READER_POOL a boolean * remove unnecessary conditionals for pool choice in execute_sql --------- Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com>
…nection pools. (#344) * Changes for using a separate reader pool for Aurora-like use cases * Avoid some expensive logging operations when not needed * Refactoring execute_sql implementations and separating reader/writer endpoints choosing the right pool in execute_sql * Adding documentation for using separate reader pools * use [PREFIX]_READ_REPLICA_HOST as a feature gate instead of localhost * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * Patch pjoshi aurora (#395) * Upgrade Github actions used in `dockerimage` action (#379) * upgrade github actions used in dockerimage action * remove setup-buildx-action and pin to hashes. * change deprecated pkg_resources to importlib.metadata (#387) * In a previous commit, the detection of a failure became too aggressive. (#386) * In a previous commit, the detection of a failure became too aggressive. This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time * change run finished at query to heartbeat_cutoff from threshold * clean up unused values from run query --------- Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com> * fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388) * Configurable SSL Connection (#373) * [TRIS-297] Configurable SSL Connection (#1) * Configurable SSL connection * Update services/utils/__init__.py * no ssl unit testing (#3) * ssl seperate test (#4) * dsn generator sslmode none (#5) * fix run_goose.py not working without SSL mode env variables. (#390) * change run inactive cutoff default to 6 minutes. cleanup unused constant (#392) * clarify comment on read replica hosts * make USE_SEPARATE_READER_POOL a boolean * remove unnecessary conditionals for pool choice in execute_sql --------- Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com> * fix broken connection string after conflict resolve * make codestyles happy * fix test cases * cleanup * merge run_goose.py from master * revert unnecessary changes --------- Co-authored-by: Preetam Joshi <preetamj@netflix.com> Co-authored-by: Romain Cledat <rcledat@netflix.com> Co-authored-by: Chaoying Wang <chaoyingw@netflix.com> Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com> Co-authored-by: Tom Furmston <tfurmston@googlemail.com> Co-authored-by: Romain <romain-intel@users.noreply.github.com> Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com> Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com> Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com>
This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time