Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In a previous commit, the detection of a failure became too aggressive. #386

Merged
merged 3 commits into from
Sep 5, 2023

Conversation

romain-intel
Copy link
Contributor

This remediates this by considering a run 'failed' if the hb hasn't been updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

This remediates this by considering a run 'failed' if the hb hasn't been
updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time
@romain-intel romain-intel requested a review from saikonen August 24, 2023 16:05
@@ -130,7 +130,7 @@ def select_columns(self):
WHEN end_attempt_ok IS NOT NULL AND end_attempt_ok.value IS FALSE
THEN 'failed'
WHEN {table_name}.last_heartbeat_ts IS NOT NULL
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default for the cutoff is 1 day, whereas the threshold is 1 minute. This would result in failed runs to remain stuck in "running" state for a whole day, instead of being detected within a minute or so.

Do you have some examples of the scenarios where the threshold is too eager? Is it that a running run is being marked as failed, before flipping back to "running" when heartbeats resume.

I assume this is because running tasks are the only ones refreshing a run level heartbeat, and sometimes all new tasks can get stuck in the scheduler so refreshing the run heartbeat stops momentarily?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the threshold is fully configurable as well, but it affects status/duration/finished_at. Decoupling the status from the other two can also lead to odd visuals, where a run is still "running", but it has a finished_at and duration

maybe a separate run level threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already use cutoff (see line 154 for duration for example). And yes, the issue was that 1minute was too short because tasks pretty much always take more than one minute to be scheduled and find a node and start executing (not always but very frequently). Internally, I have set the cuttoff to something like 5 minutes so it's not 1h. I think that cutoff makes more sense as well. We can have another configuration too if you want but 1 minute is definitely too short.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it used to be different. I simplified this to not make it depend on whether or not there was a successful last task. Before, it would wait for longer if there was a successful last task or something like that.

@saikonen saikonen added the in review Currently under review label Aug 24, 2023
@@ -111,13 +111,12 @@ def select_columns(self):
WHEN end_attempt_ok IS NOT NULL
THEN end_attempt_ok.ts_epoch
WHEN {table_name}.last_heartbeat_ts IS NOT NULL
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_threshold}
AND @(extract(epoch from now())-{table_name}.last_heartbeat_ts)<={heartbeat_cutoff}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to heartbeat_cutoff as well so we don't end up in a situation where a run has a finished_at timestamp, yet is still marked as "running". At a glance it seemed that the query prior to changes in #338 was also relying on cutoff rather than threshold for this case.

Does the change seem reasonable @romain-intel ?

@romain-intel
Copy link
Contributor Author

LGTM. Thanks for catching that.

@romain-intel romain-intel merged commit 749efcf into master Sep 5, 2023
saikonen added a commit that referenced this pull request Oct 25, 2023
* Upgrade Github actions used in `dockerimage` action (#379)

* upgrade github actions used in dockerimage action

* remove setup-buildx-action and pin to hashes.

* change deprecated pkg_resources to importlib.metadata (#387)

* In a previous commit, the detection of a failure became too aggressive. (#386)

* In a previous commit, the detection of a failure became too aggressive.

This remediates this by considering a run 'failed' if the hb hasn't been
updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

* change run finished at query to heartbeat_cutoff from threshold

* clean up unused values from run query

---------

Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com>

* fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388)

* Configurable SSL Connection (#373)

* [TRIS-297] Configurable SSL Connection (#1)

* Configurable SSL connection

* Update services/utils/__init__.py

* no ssl unit testing (#3)

* ssl seperate test (#4)

* dsn generator sslmode none (#5)

* fix run_goose.py not working without SSL mode env variables. (#390)

* change run inactive cutoff default to 6 minutes. cleanup unused constant (#392)

* clarify comment on read replica hosts

* make USE_SEPARATE_READER_POOL a boolean

* remove unnecessary conditionals for pool choice in execute_sql

---------

Co-authored-by: Tom Furmston <tfurmston@googlemail.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com>
saikonen added a commit that referenced this pull request Oct 30, 2023
…nection pools. (#344)

* Changes for using a separate reader pool for Aurora-like use cases

* Avoid some expensive logging operations when not needed

* Refactoring execute_sql implementations and separating reader/writer endpoints

choosing the right pool in execute_sql

* Adding documentation for using separate reader pools

* use [PREFIX]_READ_REPLICA_HOST as a feature gate instead of localhost

* In a previous commit, the detection of a failure became too aggressive.

This remediates this by considering a run 'failed' if the hb hasn't been
updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

* Patch pjoshi aurora (#395)

* Upgrade Github actions used in `dockerimage` action (#379)

* upgrade github actions used in dockerimage action

* remove setup-buildx-action and pin to hashes.

* change deprecated pkg_resources to importlib.metadata (#387)

* In a previous commit, the detection of a failure became too aggressive. (#386)

* In a previous commit, the detection of a failure became too aggressive.

This remediates this by considering a run 'failed' if the hb hasn't been
updated within heartbeat_cutoff time as opposed to the heartbeat_threshold time

* change run finished at query to heartbeat_cutoff from threshold

* clean up unused values from run query

---------

Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com>

* fix PATH_PREFIX handling in metadata service so it doesn't interfere with mfgui routes (#388)

* Configurable SSL Connection (#373)

* [TRIS-297] Configurable SSL Connection (#1)

* Configurable SSL connection

* Update services/utils/__init__.py

* no ssl unit testing (#3)

* ssl seperate test (#4)

* dsn generator sslmode none (#5)

* fix run_goose.py not working without SSL mode env variables. (#390)

* change run inactive cutoff default to 6 minutes. cleanup unused constant (#392)

* clarify comment on read replica hosts

* make USE_SEPARATE_READER_POOL a boolean

* remove unnecessary conditionals for pool choice in execute_sql

---------

Co-authored-by: Tom Furmston <tfurmston@googlemail.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com>

* fix broken connection string after conflict resolve

* make codestyles happy

* fix test cases

* cleanup

* merge run_goose.py from master

* revert unnecessary changes

---------

Co-authored-by: Preetam Joshi <preetamj@netflix.com>
Co-authored-by: Romain Cledat <rcledat@netflix.com>
Co-authored-by: Chaoying Wang <chaoyingw@netflix.com>
Co-authored-by: Sakari Ikonen <64256562+saikonen@users.noreply.github.com>
Co-authored-by: Tom Furmston <tfurmston@googlemail.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: RikishK <69884402+RikishK@users.noreply.github.com>
Co-authored-by: Sakari Ikonen <sakari.a.ikonen@gmail.com>
@saikonen saikonen deleted the less_agressive_timeout branch November 3, 2023 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in review Currently under review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants