Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

Conversation

neptunian
Copy link
Contributor

@neptunian neptunian commented Jun 17, 2021

Resolves #100136. This is part of Stack Monitoring alerting alignment where we want to create an alert on where the alert occured and not always on the cluster. This means we also change the alert ids to reflect that.

  • Creates an alert per node or per index for some alerts instead of all per cluster.
  • Action Messaging Template Changes for each Rule type
  • Alert messaging changes to reflect the new context (per node, per index)
  • Bug fixes with alerts

Changes

Bug fix for Threadpool alerts I don't think Threadpool search and write rejection alerts are working in production. Testing on master will give you errors due to the wrong parameters being passed into executeActions. This was probably not caught because it's never been checked off in any test scenarios due to testers not having an easy way to trigger the alert. This will be resolved in this PR.

Screen Shot 2021-06-23 at 10 14 15 AM
Screen Shot 2021-06-23 at 10 14 04 AM

Bug fix for Threadpool alerts UI message Threadpool rejection counts were not showing up in the UI message Before/after:

Screen Shot 2021-06-23 at 1 30 53 PM
Screen Shot 2021-06-23 at 1 43 57 PM
Screen Shot 2021-06-23 at 3 07 35 PM

CPU Usage Alert Messaging

internalShortMessage, before and after. The first alert is the old message, where with new alerting changes there would only ever be 1 node in the list. The last couple lines are the new message.
Screen Shot 2021-06-21 at 1 23 07 PM

internalFullMessage, before and after. The top messages are the old messages, where with new alerting changes there would only ever be 1 node in the list. The last messages are the new messages. Note that the link was changed to a global state link, like other alerts, and instead of taking the user to the node view
Screen Shot 2021-06-21 at 1 03 55 PM

Remove template variables nodes and count, though they will still continue to be sent "behind the scenes" for users already using them, where count always equals 1 and nodes is always one node.
Screen Shot 2021-06-21 at 1 28 44 PM
Screen Shot 2021-06-21 at 1 28 36 PM

Add template variable node, which functions the same as nodes but will only have one node in the value, eg: mynodeId:cpuUsage
Screen Shot 2021-06-21 at 3 07 18 PM

Disk Usage Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Disk usage"

Eg of internalShortMessage, before and after
Screen Shot 2021-06-21 at 4 17 23 PM

Eg of internalFullMessage, before and after
Screen Shot 2021-06-21 at 4 22 51 PM

Memory Usage (JVM) Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Memory usage"

Eg of internalShortMessage, before and after
Screen Shot 2021-06-21 at 5 28 07 PM

Eg of internalFullMessage, before and after
Screen Shot 2021-06-21 at 5 22 35 PM

Missing Monitoring Data Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Missing monitoring data"

Eg of internalShortMessage, before and after

Screen Shot 2021-06-22 at 10 27 04 AM

Eg of internalFullMessage, before and after

Screen Shot 2021-06-22 at 10 14 06 AM
Thread pool Search Rejections Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "threadpool search rejections"

Eg of internalShortMessage, before and after

Screen Shot 2021-06-23 at 11 21 42 AM

Eg of internalFullMessage, before and after

Screen Shot 2021-06-23 at 10 54 37 AM

Thread pool Write Rejections Alert Messaging

All the same changes as Thread pool Write Rejections rule above

internalShortMessage

Screen Shot 2021-06-23 at 3 07 50 PM

internalFullMessage

Screen Shot 2021-06-23 at 3 05 14 PM

Shard Size Alert ID and Messaging change

This alert becomes a per index alert:

Before
Screen Shot 2021-06-24 at 12 51 41 PM

Now the alert id is the instance_id returned from elasticsearch: clusterId:shardIndex.
Screen Shot 2021-06-24 at 12 42 11 PM

Messaging changes:
internalShortMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. Investigate indices with large shard sizes.

After:
Large shard size alert is firing for the following index: apm-8.0.0-onboarding-2021.06.22. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-span-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: twitter. Investigate indices with large shard sizes.

internalFullMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

After:
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-error-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: apm-8.0.0-transaction-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-transaction-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: metricbeat-8.0.0-2021.06.24-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/metricbeat-8.0.0-2021.06.24-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-metric-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

CCR Read Rejections alert ID and Messaging change

This alert becomes a per index alert.

Before/after (the bottom ID is the old id, the new id is the top two). Since an index could have the same name in different remote clusters we use the remote cluster in the ID as well. We do not use the "leader" cluster as discussed with @jasonrhodes the alerting framework is not across clusters.
Screen Shot 2021-06-24 at 1 40 58 PM

Messaging changes:
internalShortMessage

Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Verify follower and leader index relationships across the affected remote clusters.
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Verify follower and leader index relationships on the affected remote cluster.

CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Verify follower and leader index relationships on the affected remote cluster.

internalFullMessage

Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Current 'follower_index' indices are affected: followerIndex, followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Current 'follower_index' index affected: followerIndex. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Current 'follower_index' index affected: followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))

Elasticsearch cluster health, Elasticsearch version mismatch, Kibana version mismatch, Logstash version mismatch, Elasticsearch nodes changed, Elasticsearch license expiration

These alerts remain cluster level and the alert id is the cluster id, as they were previously, minus a colon that was at the end of id. Everything should function and look the same.

Test

  • The UI should look as it did previously. Though we only created one alert we still displayed it to the user as multiple alerts. For eg one alert for a cluster with 4 nodes firing is still displayed as 4 alerts in the UI listing each node the alert takes place on. Now we actually do have 4 alerts, one per node, and each node should be listed as an alert. Whether looking at an old alert, or a new "per node" alert, the API fetch handles them the same, so no changes for "backward compatibility" are necessary.

Screen Shot 2021-06-17 at 1 43 23 PM

  • When viewing the alerts in Stack Management, you should see alerts per node (with id being the node id) and not per cluster. In this example the the active alerts are the new alerts and the old alert with the old ID is no longer active

Screen Shot 2021-06-17 at 1 29 48 PM

Helpful for generating the alerts:
#87377 . Threadpool and CCR alerts, I still need to find an easier way to reproduce but here are some notes: #93072 (comment), #85908

@neptunian neptunian added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.0.0 v7.14.0 and removed v8.0.0 labels Jun 17, 2021
@neptunian neptunian changed the title [Monitoring] create alert per node instead of per cluster [Stack Monitoring] create alert per node instead of per cluster Jun 17, 2021
@neptunian neptunian force-pushed the 100136-alerts-per-node-change-stack-monitoring branch from c2ac780 to 27841f0 Compare June 21, 2021 19:48
@neptunian neptunian force-pushed the 100136-alerts-per-node-change-stack-monitoring branch from e08158e to 65fe38b Compare June 23, 2021 18:40
@neptunian neptunian marked this pull request as ready for review June 24, 2021 21:26
@neptunian neptunian requested a review from a team June 24, 2021 21:26
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@neptunian
Copy link
Contributor Author

@elasticmachine merge upstream

const key = this.alertOptions.accessorKey;

// for each node, update the alert's state with node state
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where I've moved the triggering of the new alert into the inner for loop that loops through each node, instead of in the cluster for loop.

@simianhacker simianhacker removed their request for review June 29, 2021 13:56
const instance = services.alertInstanceFactory(node.meta.nodeId || node.meta.instanceId);
// quick fix for now so that non node level alerts will use the cluster id
const instance = services.alertInstanceFactory(
node.meta.nodeId || node.meta.instanceId || cluster.clusterUuid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some alerts are cluster level and do not have a nodeId or instanceId property, so use the clusterUuid. This should be typed and improved at some point or have a different function for cluster vs node level processing.

protected executeActions(
instance: AlertInstance,
alertStates: AlertThreadPoolRejectionsState[],
{ alertStates }: { alertStates: AlertState[] },
item: AlertData | null,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was getting the wrong data was not able to destructure the values further down resulting in a bug for Threadpool alerts.

@estermv estermv self-requested a review June 29, 2021 14:29
Copy link
Contributor

@estermv estermv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

I couldn't fire all the alarms but the code looks good.

@neptunian neptunian merged commit a7b5d35 into elastic:master Jun 29, 2021
neptunian added a commit to neptunian/kibana that referenced this pull request Jun 29, 2021
…tic#102544)

* create alert per node instead of per cluster

* add comment

* fix test, replace alert state with empty array with no node is firing

* update cpu usage action messaging

* fix internationalization

* update disk usage rule action messaging

* update memory usage rule action messaging

* update other action messaging

* update missing monitoring data alert action messaging

* remove comment

* fix bug where threadpool alerts were not firing

* fix bug with threadpool rejections and update alert action messaging to be per node

* update comments

* unit test for thread pool write rejections alert

* update messaging for CCR read rejection

* fix cluster level alerts to use the cluster id when its not node level

* add more tests to nodes changed alert

* update default message

* update alert messaging for large shard size

* update default messaging

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@ravikesarwani
Copy link
Contributor

@neptunian If I understand this correctly we have changed the alert ID to be just the node ID.

When viewing the alerts in Stack Management, you should see alerts per node (with id being the node id) and not per cluster. In this example the the active alerts are the new alerts and the old alert with the old ID is no longer active

In most customer use cases monitoring cluster is used to consolidate many production clusters monitoring data. Each production cluster can have an instance say "instance-0000000001". If the alert id is only nodeID then it may not be unique. Do we need the "clusterid:nodeID" to make the alert ID unique?

@neptunian
Copy link
Contributor Author

@ravikesarwani: "instance-0000000001" is the node name but the node id should be unique. You can see here in the url, the ID used bMihNvXbR7iUs4A2U-s-UQ:
Screen Shot 2021-06-29 at 2 18 10 PM

@jasonrhodes Do you see any issue with just using the node ID?

@jasonrhodes
Copy link
Member

I actually don't know what generates the node IDs but I imagine it's a UUID as well? If so, it should be safe/unique.

neptunian added a commit that referenced this pull request Jun 29, 2021
) (#103719)

* create alert per node instead of per cluster

* add comment

* fix test, replace alert state with empty array with no node is firing

* update cpu usage action messaging

* fix internationalization

* update disk usage rule action messaging

* update memory usage rule action messaging

* update other action messaging

* update missing monitoring data alert action messaging

* remove comment

* fix bug where threadpool alerts were not firing

* fix bug with threadpool rejections and update alert action messaging to be per node

* update comments

* unit test for thread pool write rejections alert

* update messaging for CCR read rejection

* fix cluster level alerts to use the cluster id when its not node level

* add more tests to nodes changed alert

* update default message

* update alert messaging for large shard size

* update default messaging

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@simianhacker
Copy link
Member

@jasonrhodes @ravikesarwani I believe the Alerts team said that those ids only need to be unique per alert.

@jasonrhodes
Copy link
Member

@simianhacker yes, that's correct. I think Ravi is wondering whether, if you have multiple nodes whose CPU is spiking at the same time, across different clusters, could those nodes each have the same Node ID? It would all be within the one rule, but if those Node IDs aren't unique, there could theoretically be a collision. I assume they're UUIDs but I'm not sure, it'd be nice to confirm that.

@neptunian
Copy link
Contributor Author

We are using node_stats.node_id which matches the source_node.uuid. poIczYSkR-aSlh3jXwK6QQ in this example. So I think it's safe to assume it's a uuid.
Screen Shot 2021-06-30 at 9 17 53 AM

@neptunian neptunian changed the title [Stack Monitoring] create alert per node instead of per cluster [Stack Monitoring] create alert per node, index, or cluster instead of always per cluster Jul 6, 2021
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Stack Monitoring] Change SM rule types to generate discrete alert instances per node
7 participants