-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Platform] Health Check should provide some alert if it isn't running #6581
Comments
A combination of email alert and indication in the UI there is some problem with HealthChecks would be ideal. cc @chirag-yb |
SergeyPotachev
added a commit
that referenced
this issue
Jan 15, 2021
…6581 Summary: - Code for email sending moved from cluster_health.py to Java code; - Added some alerts when health checker fails by own reasons; The HTML preparation is left the same as it was in Python code. The proposed way to use templates is not implemented here and will not be implemented here as it requires a separate issue. Test Plan: Common scenarios: 1. Check that health emails are correctly formed; 2. Check that emails with alerts are sent; 3. Check that occasional errors in cluster_health.py lead to alert emails; Repeat scenarios 1-3 for different SMTP settings in Customer's Profile, for different states of flags "Only include errors in alert emails", "Send backup failure notification". Some useful information can be found here: [[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]] [[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]] Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 | D9660 ]]. Reviewers: sanketh, daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10177
SergeyPotachev
added a commit
that referenced
this issue
Jan 17, 2021
Summary: The problem introduced while implementing #6581. Actual fix is very small - a few lines in EmailHelper.java: lines 116-118 + 125 and in line 216 Other lines - junit tests improvements + small logic corrections (according to changes in EmailHelper). Test Plan: Test cases from #6581 + one specific scenario: 1. New customer 2. SMTP is never configured (see picture from the issue). 3. There is a universe with failures (in health checks, as example). Check that there are no exceptions in YW logs related to emails sending (with alerts). Reviewers: daniel, sb-yb Reviewed By: sb-yb Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10380
SergeyPotachev
added a commit
that referenced
this issue
Jan 19, 2021
…in YW logs) #6911 Summary: Actually the problem was existing before the submit made in #6581 but it is quite rare case + it had less exceptions in logs (before). Test Plan: The scenario is very hard to reproduce without having locally deployed YW (need to look logs at runtime). 1. Take normal universe without any active alerts; 2. Make something that will lead to health check failure; 3. Wait while health check is started for this universe (it is easier to do this having logs running in another terminal window); 4. Delete the universe right after the step 3 is started. 5. Observe logs after the health check is finished for some time (1-2 minutes are enough). Expected result: There are no continuously appearing exceptions in logs (related to an alert with the disappeared universe). It can be one exception at the beginning of the observations but not more (or no exceptions at all). Actual result: There are exceptions appearing each minute. Reviewers: daniel Reviewed By: daniel Subscribers: yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10390
SergeyPotachev
added a commit
that referenced
this issue
Feb 1, 2021
… isn't running #6581 Summary: - Code for email sending moved from cluster_health.py to Java code; - Added some alerts when health checker fails by own reasons; The HTML preparation is left the same as it was in Python code. The proposed way to use templates is not implemented here and will not be implemented here as it requires a separate issue. Original diffs: #6581 - https://phabricator.dev.yugabyte.com/D10177 #6901 - https://phabricator.dev.yugabyte.com/D10380 #6911 - https://phabricator.dev.yugabyte.com/D10390 Test Plan: Jenkins: rebase: 2.4 Common scenarios: 1. Check that health emails are correctly formed; 2. Check that emails with alerts are sent; 3. Check that occasional errors in cluster_health.py lead to alert emails; Repeat steps 1-3 for different SMTP settings in Customer's Profile, for different states of flags "Only include errors in alert emails", "Send backup failure notification". Some useful information can be found here: [[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]] [[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]] Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 | D9660 ]]. ------------------------------------------------- Additional scenario for #6901: 1. New customer 2. SMTP is never configured (see picture from the issue). 3. There is a universe with failures (in health checks, as example). Check that there are no exceptions in YW logs related to emails sending (with alerts). ------------------------------------------------- Additional scenario for #6911: It is very hard to reproduce without having locally deployed YW (need to look logs at runtime). 1. Take normal universe without any active alerts; 2. Make something that will lead to health check failure; 3. Wait while health check is started for this universe (it is easier to do this having logs running in another terminal window); 4. Delete the universe right after the step 3 is started. 5. Observe logs after the health check is finished for some time (1-2 minutes are enough). Expected result: There are no continuously appearing exceptions in logs (related to an alert with the disappeared universe). It can be one exception at the beginning of the observations but not more (or no exceptions at all). Actual result: There are exceptions appearing each minute. ------------------------------------------------- Reviewers: daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10496
polarweasel
pushed a commit
to lizayugabyte/yugabyte-db
that referenced
this issue
Mar 9, 2021
…ugabyte#6581 Summary: - Code for email sending moved from cluster_health.py to Java code; - Added some alerts when health checker fails by own reasons; The HTML preparation is left the same as it was in Python code. The proposed way to use templates is not implemented here and will not be implemented here as it requires a separate issue. Test Plan: Common scenarios: 1. Check that health emails are correctly formed; 2. Check that emails with alerts are sent; 3. Check that occasional errors in cluster_health.py lead to alert emails; Repeat scenarios 1-3 for different SMTP settings in Customer's Profile, for different states of flags "Only include errors in alert emails", "Send backup failure notification". Some useful information can be found here: [[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]] [[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]] Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 | D9660 ]]. Reviewers: sanketh, daniel Reviewed By: daniel Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10177
polarweasel
pushed a commit
to lizayugabyte/yugabyte-db
that referenced
this issue
Mar 9, 2021
Summary: The problem introduced while implementing yugabyte#6581. Actual fix is very small - a few lines in EmailHelper.java: lines 116-118 + 125 and in line 216 Other lines - junit tests improvements + small logic corrections (according to changes in EmailHelper). Test Plan: Test cases from yugabyte#6581 + one specific scenario: 1. New customer 2. SMTP is never configured (see picture from the issue). 3. There is a universe with failures (in health checks, as example). Check that there are no exceptions in YW logs related to emails sending (with alerts). Reviewers: daniel, sb-yb Reviewed By: sb-yb Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10380
polarweasel
pushed a commit
to lizayugabyte/yugabyte-db
that referenced
this issue
Mar 9, 2021
…in YW logs) yugabyte#6911 Summary: Actually the problem was existing before the submit made in yugabyte#6581 but it is quite rare case + it had less exceptions in logs (before). Test Plan: The scenario is very hard to reproduce without having locally deployed YW (need to look logs at runtime). 1. Take normal universe without any active alerts; 2. Make something that will lead to health check failure; 3. Wait while health check is started for this universe (it is easier to do this having logs running in another terminal window); 4. Delete the universe right after the step 3 is started. 5. Observe logs after the health check is finished for some time (1-2 minutes are enough). Expected result: There are no continuously appearing exceptions in logs (related to an alert with the disappeared universe). It can be one exception at the beginning of the observations but not more (or no exceptions at all). Actual result: There are exceptions appearing each minute. Reviewers: daniel Reviewed By: daniel Subscribers: yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D10390
Verified on 2.4.2. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, if the health check service fails for some reason, the user does not know except that health checks aren't incrementing in the yugaware UI.
Ideally, and email will be sent on the failure of the health check service itself.
As an example, edit the "cluster_health.py" script, and add some line which causes python to fail, like
The health check will then fail silently.
Obviously, the above problem is an invalid python file, but the concern is that it seems that the health check could fail in several ways without triggering an email alert.
The text was updated successfully, but these errors were encountered: