Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Platform][Master] Alerting and Notification Improvements #8212

Open
SergeyPotachev opened this issue Apr 28, 2021 · 0 comments
Open

[Platform][Master] Alerting and Notification Improvements #8212

SergeyPotachev opened this issue Apr 28, 2021 · 0 comments
Assignees
Labels
area/platform Yugabyte Platform kind/enhancement This is an enhancement of an existing feature
Milestone

Comments

@SergeyPotachev
Copy link
Contributor

SergeyPotachev commented Apr 28, 2021

Motivation

  • Real time database alerts based on a user alert policy: Users can set alert policies based on their universe performance metrics. Alert policies notify you when a performance metric rises above or falls below a threshold you set.
  • OOTB intelligent database health checks and default alerts: YB Platform will provide intelligent ootb health checks and alerts, when something goes wrong, but also when it thinks something may go wrong in the future, allowing you to stay ahead of issues that may arise.
  • Forward notifications to 3rd party centralized notification systems: Alert notifications can integrate with a customer's choice of centralized notification system so they can get a 360 view of their entire application stack. To start with we will allow forwarding notifications to SMTP destinations and then integrate with other systems - Slack, PagerDuty and Webhooks
  • Build your own alerting - Allow forwarding and scraping metrics from Prometheus
  • Allow interacting with alerting stack programmatically via APIs
  • While YB Platform will provide advanced alerting and notifications via UI, customers can also interact with the stack via APIs to make sure their Ops teams (with minimum knowledge of YB Platform) are able to turn on/off alerts during maintenance windows.

Phase 1

Issues for tracking new functionality:

Status Subtask GitHub Issue
[Platform] Update Prometheus to a more recent version #8209
[Platform] Alerts configuration per a universe basis (notification channels) #8211
[Platform] Change alert and alert definition to better match Prometheus data model #8281
[Platform] Implement prometheus alerting rules config generation #8282
[Platform] Implement retrieval of active alerts from prometheus #8283
[Platform] Move alert definition threshold from runtime config to alert definition #8457
[Platform] Use alert definitions for all the alerts #8458
[Platform] Modify alert definition controller to allow configuring custom metric #8459
[Platform] Get rid of raw json in AlertReceiver #8830
[Platform] Add swagger annotations to new functions in AlertController #8831
[Platform] Alerts control over API #9053
[Platform][Alerts] To implement AlertReceivers group entity #9054
[Platform] Implement Alert definition group entity and API #9055
[Platform] Implement Alert severities for Alert and Alert Definition #9056
⬜️ [Platform] Ability to send test alert for alert definition group #9058
[Platform][Alerts] Alert Acknowledgement functionality #9059
[Platform][Alerts] Default alert route functionality #9120
[Platform][Alerts] Repeat notifications after previous failures + code cleanup #9198
[Platform][Alerts] To implement Slack notifications #9337
[Platform][Alerts] Add sort and filter fields to alert controller #9426
⬜️ [Platform] Implement base node alerts #9406
⬜️ [Platform] Implement base YSQL/YCQL alerts #9407
⬜️ [Platform] Implement platform backed alerts #9408
⬜️ [Platform] Implement table/tablet alerts #9409
⬜️ [Platform] Implement K8S alerts #9410

And issues related to alerts:

Status Subtask GitHub Issue
⬜️ [Platform] yugaware should alert if ysql_max_connections (and equivalent ycql connection k/v) is >= $threshold #7238
⬜️ [platform] Alert in Yugaware if leaders are skewed in cluster #8665
⬜️ [Platform] Alerting Emails should be configurable by Universe #8491
[Platform] Alert spam for message "Clock Skew Alert Resolved" #8426
⬜️ [Platform] Improve alert scalability #7957
⬜️ [Platform] Need to have Alerts for replication Enabled by default #7792

Phase 2

Status Subtask GitHub Issue
⬜️ Suspend alerts during maintenance window
⬜️ User define Notification frequency (just once or each time or at most every interval)
⬜️ Alerts access control: three permission levels for a alert- No Permissions, Can Run, and Can Manage
⬜️ [Platform][Alerts] Upgrade node_exporter to a newer version #8524
⬜️ [Platform] Snooze Alert generation while Universe operations are in progress #9057
⬜️ [Platform] Metric federation to allow external alerting integration #9052
@SergeyPotachev SergeyPotachev added kind/enhancement This is an enhancement of an existing feature area/platform Yugabyte Platform labels Apr 28, 2021
@SergeyPotachev SergeyPotachev self-assigned this Apr 28, 2021
@streddy-yb streddy-yb added this to the 2.7.x milestone Apr 29, 2021
@ymahajan ymahajan changed the title [Platform][Master] Alerting Improvements [Platform][Master] Alerting and Notification Improvements Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform Yugabyte Platform kind/enhancement This is an enhancement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants