Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform] Health Check should provide some alert if it isn't running #6581

Closed
tylarb opened this issue Dec 8, 2020 · 2 comments
Closed
Assignees
Labels
area/platform Yugabyte Platform
Milestone

Comments

@tylarb
Copy link
Contributor

tylarb commented Dec 8, 2020

Currently, if the health check service fails for some reason, the user does not know except that health checks aren't incrementing in the yugaware UI.

Ideally, and email will be sent on the failure of the health check service itself.

As an example, edit the "cluster_health.py" script, and add some line which causes python to fail, like

testing = not_a_variable

The health check will then fail silently.

Obviously, the above problem is an invalid python file, but the concern is that it seems that the health check could fail in several ways without triggering an email alert.

@tylarb tylarb added the area/platform Yugabyte Platform label Dec 8, 2020
@streddy-yb streddy-yb added this to the 2.5.x milestone Dec 10, 2020
@streddy-yb
Copy link
Contributor

A combination of email alert and indication in the UI there is some problem with HealthChecks would be ideal. cc @chirag-yb

SergeyPotachev added a commit that referenced this issue Jan 15, 2021
…6581

Summary:
- Code for email sending moved from cluster_health.py to Java code;
- Added some alerts when health checker fails by own reasons;

The HTML preparation is left the same as it was in Python code. The proposed way to use templates is not implemented here and will not be implemented here as it requires a separate issue.

Test Plan:
Common scenarios:
1. Check that health emails are correctly formed;
2. Check that emails with alerts are sent;
3. Check that occasional errors in cluster_health.py lead to alert emails;

Repeat scenarios 1-3 for different SMTP settings in Customer's Profile, for different states of flags "Only include errors in alert emails", "Send backup failure notification".
Some useful information can be found here:
[[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]]
[[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]]

Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 | D9660 ]].

Reviewers: sanketh, daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10177
SergeyPotachev added a commit that referenced this issue Jan 17, 2021
Summary:
The problem introduced while implementing #6581.
Actual fix is very small - a few lines in EmailHelper.java:
  lines 116-118 + 125  and in line 216

Other lines - junit tests improvements + small logic corrections (according to changes in EmailHelper).

Test Plan:
Test cases from #6581 + one specific scenario:
   1. New customer
   2. SMTP is never configured (see picture from the issue).
   3. There is a universe with failures (in health checks, as example).
Check that there are no exceptions in YW logs related to emails sending (with alerts).

Reviewers: daniel, sb-yb

Reviewed By: sb-yb

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10380
SergeyPotachev added a commit that referenced this issue Jan 19, 2021
…in YW logs) #6911

Summary: Actually the problem was existing before the submit made in #6581 but it is quite rare case + it had less exceptions in logs (before).

Test Plan:
The scenario is very hard to reproduce without having locally deployed YW (need to look logs at runtime).

1. Take normal universe without any active alerts;
2. Make something that will lead to health check failure;
3. Wait while health check is started for this universe (it is easier to do this having logs running in another terminal window);
4. Delete the universe right after the step 3 is started.
5. Observe logs after the health check is finished for some time (1-2 minutes are enough).

Expected result:
  There are no continuously appearing exceptions in logs (related to an alert with the disappeared universe). It can be one exception at the beginning of the observations but not more (or no exceptions at all).

Actual result:
  There are exceptions appearing each minute.

Reviewers: daniel

Reviewed By: daniel

Subscribers: yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10390
@streddy-yb streddy-yb modified the milestones: 2.5.x, 2.4.1.0 Jan 22, 2021
SergeyPotachev added a commit that referenced this issue Feb 1, 2021
… isn't running #6581

Summary:
- Code for email sending moved from cluster_health.py to Java code;
- Added some alerts when health checker fails by own reasons;

The HTML preparation is left the same as it was in Python code. The
proposed way to use templates is not implemented here and will not be
implemented here as it requires a separate issue.

Original diffs:
 #6581 - https://phabricator.dev.yugabyte.com/D10177
 #6901 - https://phabricator.dev.yugabyte.com/D10380
 #6911 - https://phabricator.dev.yugabyte.com/D10390

Test Plan:
Jenkins: rebase: 2.4

Common scenarios:
1. Check that health emails are correctly formed;
2. Check that emails with alerts are sent;
3. Check that occasional errors in cluster_health.py lead to alert
emails;

Repeat steps 1-3 for different SMTP settings in Customer's Profile, for
different states of flags "Only include errors in alert emails", "Send
backup failure notification".
Some useful information can be found here:
[[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]]
[[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]]

Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 |
D9660 ]].

-------------------------------------------------
Additional scenario for #6901:
   1. New customer
   2. SMTP is never configured (see picture from the issue).
   3. There is a universe with failures (in health checks, as example).
Check that there are no exceptions in YW logs related to emails sending
(with alerts).

-------------------------------------------------
Additional scenario for #6911:
It is very hard to reproduce without having locally deployed YW (need to
look logs at runtime).

1. Take normal universe without any active alerts;
2. Make something that will lead to health check failure;
3. Wait while health check is started for this universe (it is easier to
do this having logs running in another terminal window);
4. Delete the universe right after the step 3 is started.
5. Observe logs after the health check is finished for some time (1-2
minutes are enough).

Expected result:
  There are no continuously appearing exceptions in logs (related to an
alert with the disappeared universe). It can be one exception at the
beginning of the observations but not more (or no exceptions at all).

Actual result:
  There are exceptions appearing each minute.
-------------------------------------------------

Reviewers: daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10496
polarweasel pushed a commit to lizayugabyte/yugabyte-db that referenced this issue Mar 9, 2021
…ugabyte#6581

Summary:
- Code for email sending moved from cluster_health.py to Java code;
- Added some alerts when health checker fails by own reasons;

The HTML preparation is left the same as it was in Python code. The proposed way to use templates is not implemented here and will not be implemented here as it requires a separate issue.

Test Plan:
Common scenarios:
1. Check that health emails are correctly formed;
2. Check that emails with alerts are sent;
3. Check that occasional errors in cluster_health.py lead to alert emails;

Repeat scenarios 1-3 for different SMTP settings in Customer's Profile, for different states of flags "Only include errors in alert emails", "Send backup failure notification".
Some useful information can be found here:
[[ https://phabricator.dev.yugabyte.com/D7816 | D7816 ]]
[[ https://phabricator.dev.yugabyte.com/D8309 | D8309 ]]

Also test scenarios from [[ https://phabricator.dev.yugabyte.com/D9660 | D9660 ]].

Reviewers: sanketh, daniel

Reviewed By: daniel

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10177
polarweasel pushed a commit to lizayugabyte/yugabyte-db that referenced this issue Mar 9, 2021
Summary:
The problem introduced while implementing yugabyte#6581.
Actual fix is very small - a few lines in EmailHelper.java:
  lines 116-118 + 125  and in line 216

Other lines - junit tests improvements + small logic corrections (according to changes in EmailHelper).

Test Plan:
Test cases from yugabyte#6581 + one specific scenario:
   1. New customer
   2. SMTP is never configured (see picture from the issue).
   3. There is a universe with failures (in health checks, as example).
Check that there are no exceptions in YW logs related to emails sending (with alerts).

Reviewers: daniel, sb-yb

Reviewed By: sb-yb

Subscribers: jenkins-bot, yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10380
polarweasel pushed a commit to lizayugabyte/yugabyte-db that referenced this issue Mar 9, 2021
…in YW logs) yugabyte#6911

Summary: Actually the problem was existing before the submit made in yugabyte#6581 but it is quite rare case + it had less exceptions in logs (before).

Test Plan:
The scenario is very hard to reproduce without having locally deployed YW (need to look logs at runtime).

1. Take normal universe without any active alerts;
2. Make something that will lead to health check failure;
3. Wait while health check is started for this universe (it is easier to do this having logs running in another terminal window);
4. Delete the universe right after the step 3 is started.
5. Observe logs after the health check is finished for some time (1-2 minutes are enough).

Expected result:
  There are no continuously appearing exceptions in logs (related to an alert with the disappeared universe). It can be one exception at the beginning of the observations but not more (or no exceptions at all).

Actual result:
  There are exceptions appearing each minute.

Reviewers: daniel

Reviewed By: daniel

Subscribers: yugaware

Differential Revision: https://phabricator.dev.yugabyte.com/D10390
@SergeyPotachev
Copy link
Contributor

Verified on 2.4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform
Projects
None yet
Development

No branches or pull requests

3 participants