[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

neptunian · 2021-06-17T17:41:21Z

Resolves #100136. This is part of Stack Monitoring alerting alignment where we want to create an alert on where the alert occured and not always on the cluster. This means we also change the alert ids to reflect that.

Creates an alert per node or per index for some alerts instead of all per cluster.
Action Messaging Template Changes for each Rule type
Alert messaging changes to reflect the new context (per node, per index)
Bug fixes with alerts

Changes

Bug fix for Threadpool alerts

I don't think Threadpool search and write rejection alerts are working in production. Testing on master will give you errors due to the wrong parameters being passed into executeActions. This was probably not caught because it's never been checked off in any test scenarios due to testers not having an easy way to trigger the alert. This will be resolved in this PR.

Bug fix for Threadpool alerts UI message

Threadpool rejection counts were not showing up in the UI message Before/after:

CPU Usage Alert Messaging

internalShortMessage, before and after. The first alert is the old message, where with new alerting changes there would only ever be 1 node in the list. The last couple lines are the new message.

internalFullMessage, before and after. The top messages are the old messages, where with new alerting changes there would only ever be 1 node in the list. The last messages are the new messages. Note that the link was changed to a global state link, like other alerts, and instead of taking the user to the node view

Remove template variables nodes and count, though they will still continue to be sent "behind the scenes" for users already using them, where count always equals 1 and nodes is always one node.

Add template variable node, which functions the same as nodes but will only have one node in the value, eg: mynodeId:cpuUsage

Disk Usage Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Disk usage"

Eg of internalShortMessage, before and after

Eg of internalFullMessage, before and after

Memory Usage (JVM) Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Memory usage"

Eg of internalShortMessage, before and after

Eg of internalFullMessage, before and after

Missing Monitoring Data Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "Missing monitoring data"

Eg of internalShortMessage, before and after

Eg of internalFullMessage, before and after

Thread pool Search Rejections Alert Messaging

All the same changes as CPU Usage rule above, but replace "CPU usage" with "threadpool search rejections"

Eg of internalShortMessage, before and after

Eg of internalFullMessage, before and after

Thread pool Write Rejections Alert Messaging

All the same changes as Thread pool Write Rejections rule above

internalShortMessage

internalFullMessage

Shard Size Alert ID and Messaging change

This alert becomes a per index alert:

Before

Now the alert id is the instance_id returned from elasticsearch: clusterId:shardIndex.

Messaging changes:
internalShortMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. Investigate indices with large shard sizes.

After:
Large shard size alert is firing for the following index: apm-8.0.0-onboarding-2021.06.22. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-span-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: twitter. Investigate indices with large shard sizes.

internalFullMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

After:
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-error-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: apm-8.0.0-transaction-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-transaction-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: metricbeat-8.0.0-2021.06.24-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/metricbeat-8.0.0-2021.06.24-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-metric-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))

CCR Read Rejections alert ID and Messaging change

This alert becomes a per index alert.

Before/after (the bottom ID is the old id, the new id is the top two). Since an index could have the same name in different remote clusters we use the remote cluster in the ID as well. We do not use the "leader" cluster as discussed with @jasonrhodes the alerting framework is not across clusters.

Messaging changes:
internalShortMessage

Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Verify follower and leader index relationships across the affected remote clusters.
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Verify follower and leader index relationships on the affected remote cluster.

CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Verify follower and leader index relationships on the affected remote cluster.

internalFullMessage

Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Current 'follower_index' indices are affected: followerIndex, followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Current 'follower_index' index affected: followerIndex. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Current 'follower_index' index affected: followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))

Elasticsearch cluster health, Elasticsearch version mismatch, Kibana version mismatch, Logstash version mismatch, Elasticsearch nodes changed, Elasticsearch license expiration

These alerts remain cluster level and the alert id is the cluster id, as they were previously, minus a colon that was at the end of id. Everything should function and look the same.

Test

The UI should look as it did previously. Though we only created one alert we still displayed it to the user as multiple alerts. For eg one alert for a cluster with 4 nodes firing is still displayed as 4 alerts in the UI listing each node the alert takes place on. Now we actually do have 4 alerts, one per node, and each node should be listed as an alert. Whether looking at an old alert, or a new "per node" alert, the API fetch handles them the same, so no changes for "backward compatibility" are necessary.

When viewing the alerts in Stack Management, you should see alerts per node (with id being the node id) and not per cluster. In this example the the active alerts are the new alerts and the old alert with the old ID is no longer active

Helpful for generating the alerts:
#87377 . Threadpool and CCR alerts, I still need to find an easier way to reproduce but here are some notes: #93072 (comment), #85908

…to be per node

elasticmachine · 2021-06-24T21:26:28Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

neptunian · 2021-06-29T00:49:44Z

@elasticmachine merge upstream

…ring

neptunian · 2021-06-29T13:55:59Z

x-pack/plugins/monitoring/server/alerts/base_alert.ts

      const key = this.alertOptions.accessorKey;
+
+      // for each node, update the alert's state with node state


Here is where I've moved the triggering of the new alert into the inner for loop that loops through each node, instead of in the cluster for loop.

neptunian · 2021-06-29T13:58:26Z

x-pack/plugins/monitoring/server/alerts/base_alert.ts

-        const instance = services.alertInstanceFactory(node.meta.nodeId || node.meta.instanceId);
+        // quick fix for now so that non node level alerts will use the cluster id
+        const instance = services.alertInstanceFactory(
+          node.meta.nodeId || node.meta.instanceId || cluster.clusterUuid


Some alerts are cluster level and do not have a nodeId or instanceId property, so use the clusterUuid. This should be typed and improved at some point or have a different function for cluster vs node level processing.

neptunian · 2021-06-29T14:00:07Z

x-pack/plugins/monitoring/server/alerts/thread_pool_rejections_alert_base.ts

  protected executeActions(
    instance: AlertInstance,
-    alertStates: AlertThreadPoolRejectionsState[],
+    { alertStates }: { alertStates: AlertState[] },
+    item: AlertData | null,


This was getting the wrong data was not able to destructure the values further down resulting in a bug for Threadpool alerts.

estermv

LGTM 🚀

I couldn't fire all the alarms but the code looks good.

…tic#102544) * create alert per node instead of per cluster * add comment * fix test, replace alert state with empty array with no node is firing * update cpu usage action messaging * fix internationalization * update disk usage rule action messaging * update memory usage rule action messaging * update other action messaging * update missing monitoring data alert action messaging * remove comment * fix bug where threadpool alerts were not firing * fix bug with threadpool rejections and update alert action messaging to be per node * update comments * unit test for thread pool write rejections alert * update messaging for CCR read rejection * fix cluster level alerts to use the cluster id when its not node level * add more tests to nodes changed alert * update default message * update alert messaging for large shard size * update default messaging Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

ravikesarwani · 2021-06-29T18:03:46Z

@neptunian If I understand this correctly we have changed the alert ID to be just the node ID.

When viewing the alerts in Stack Management, you should see alerts per node (with id being the node id) and not per cluster. In this example the the active alerts are the new alerts and the old alert with the old ID is no longer active

In most customer use cases monitoring cluster is used to consolidate many production clusters monitoring data. Each production cluster can have an instance say "instance-0000000001". If the alert id is only nodeID then it may not be unique. Do we need the "clusterid:nodeID" to make the alert ID unique?

neptunian · 2021-06-29T18:20:07Z

@ravikesarwani: "instance-0000000001" is the node name but the node id should be unique. You can see here in the url, the ID used bMihNvXbR7iUs4A2U-s-UQ:

@jasonrhodes Do you see any issue with just using the node ID?

jasonrhodes · 2021-06-29T18:23:08Z

I actually don't know what generates the node IDs but I imagine it's a UUID as well? If so, it should be safe/unique.

) (#103719) * create alert per node instead of per cluster * add comment * fix test, replace alert state with empty array with no node is firing * update cpu usage action messaging * fix internationalization * update disk usage rule action messaging * update memory usage rule action messaging * update other action messaging * update missing monitoring data alert action messaging * remove comment * fix bug where threadpool alerts were not firing * fix bug with threadpool rejections and update alert action messaging to be per node * update comments * unit test for thread pool write rejections alert * update messaging for CCR read rejection * fix cluster level alerts to use the cluster id when its not node level * add more tests to nodes changed alert * update default message * update alert messaging for large shard size * update default messaging Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

simianhacker · 2021-06-29T19:40:32Z

@jasonrhodes @ravikesarwani I believe the Alerts team said that those ids only need to be unique per alert.

jasonrhodes · 2021-06-29T19:42:55Z

@simianhacker yes, that's correct. I think Ravi is wondering whether, if you have multiple nodes whose CPU is spiking at the same time, across different clusters, could those nodes each have the same Node ID? It would all be within the one rule, but if those Node IDs aren't unique, there could theoretically be a collision. I assume they're UUIDs but I'm not sure, it'd be nice to confirm that.

neptunian · 2021-06-30T13:19:22Z

We are using node_stats.node_id which matches the source_node.uuid. poIczYSkR-aSlh3jXwK6QQ in this example. So I think it's safe to assume it's a uuid.

kibanamachine · 2021-07-06T17:57:15Z

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

💚 Build #134908 succeeded 0a0b773
💚 Build #134257 succeeded f3034ca
💔 Build #134239 failed 5c9f377
💔 Build #133755 failed 4957b68
💔 Build #133181 failed e08158e105524a859195e2873c11d2bc61871edb

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

neptunian added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.0.0 v7.14.0 and removed v8.0.0 labels Jun 17, 2021

neptunian changed the title ~~[Monitoring] create alert per node instead of per cluster~~ [Stack Monitoring] create alert per node instead of per cluster Jun 17, 2021

neptunian force-pushed the 100136-alerts-per-node-change-stack-monitoring branch from c2ac780 to 27841f0 Compare June 21, 2021 19:48

neptunian added 13 commits June 23, 2021 14:40

create alert per node instead of per cluster

8958485

add comment

78ccc94

fix test, replace alert state with empty array with no node is firing

78882c6

update cpu usage action messaging

b7fd890

fix internationalization

b0af3fe

update disk usage rule action messaging

d610744

update memory usage rule action messaging

215e5ac

update other action messaging

8087080

update missing monitoring data alert action messaging

2b1bf65

remove comment

42f06e8

fix bug where threadpool alerts were not firing

394369e

fix bug with threadpool rejections and update alert action messaging …

853725c

…to be per node

update comments

65fe38b

neptunian force-pushed the 100136-alerts-per-node-change-stack-monitoring branch from e08158e to 65fe38b Compare June 23, 2021 18:40

neptunian added 7 commits June 23, 2021 15:18

unit test for thread pool write rejections alert

4957b68

update messaging for CCR read rejection

c71347c

fix cluster level alerts to use the cluster id when its not node level

ded697e

add more tests to nodes changed alert

f9b57cd

update default message

5c9f377

update alert messaging for large shard size

6cb2c0d

update default messaging

f3034ca

neptunian marked this pull request as ready for review June 24, 2021 21:26

neptunian requested a review from a team June 24, 2021 21:26

neptunian added release_note:enhancement release_note:fix labels Jun 24, 2021

simianhacker self-requested a review June 24, 2021 21:32

Merge branch 'master' into 100136-alerts-per-node-change-stack-monito…

0a0b773

…ring

neptunian commented Jun 29, 2021

View reviewed changes

simianhacker removed their request for review June 29, 2021 13:56

neptunian commented Jun 29, 2021

View reviewed changes

estermv self-requested a review June 29, 2021 14:29

estermv approved these changes Jun 29, 2021

View reviewed changes

neptunian merged commit a7b5d35 into elastic:master Jun 29, 2021

This was referenced Jun 29, 2021

[Stack Monitoring] Change SM rule types to generate discrete alert instances per node #100136

Closed

[7.x] [Stack Monitoring] create alert per node instead of per cluster (#102544) #103719

Merged

neptunian added the Feature:Stack Monitoring label Jun 29, 2021

This was referenced Jul 1, 2021

[Monitoring] update SM rule template variables #104176

Merged

[Stack Monitoring] [Test Scenario] Out of the box alerting #104440

Closed

neptunian changed the title ~~[Stack Monitoring] create alert per node instead of per cluster~~ [Stack Monitoring] create alert per node, index, or cluster instead of always per cluster Jul 6, 2021

ravikesarwani mentioned this pull request Aug 2, 2021

Update stack monitoring documentation based on changes in 7.14 #107417

Closed

chrisronline mentioned this pull request Aug 18, 2021

[Monitoring] Re-evaluate how alertInstanceId is used in rules #109151

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

neptunian commented Jun 17, 2021 •

edited

Loading

elasticmachine commented Jun 24, 2021

neptunian commented Jun 29, 2021

neptunian Jun 29, 2021

neptunian Jun 29, 2021

neptunian Jun 29, 2021

estermv left a comment

ravikesarwani commented Jun 29, 2021

neptunian commented Jun 29, 2021

jasonrhodes commented Jun 29, 2021

simianhacker commented Jun 29, 2021

jasonrhodes commented Jun 29, 2021

neptunian commented Jun 30, 2021

kibanamachine commented Jul 6, 2021

		const key = this.alertOptions.accessorKey;

		// for each node, update the alert's state with node state

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544

Conversation

neptunian commented Jun 17, 2021 • edited Loading

Changes

Test

elasticmachine commented Jun 24, 2021

neptunian commented Jun 29, 2021

neptunian Jun 29, 2021

Choose a reason for hiding this comment

neptunian Jun 29, 2021

Choose a reason for hiding this comment

neptunian Jun 29, 2021

Choose a reason for hiding this comment

estermv left a comment

Choose a reason for hiding this comment

ravikesarwani commented Jun 29, 2021

neptunian commented Jun 29, 2021

jasonrhodes commented Jun 29, 2021

simianhacker commented Jun 29, 2021

jasonrhodes commented Jun 29, 2021

neptunian commented Jun 30, 2021

kibanamachine commented Jul 6, 2021

💚 Build Succeeded

Metrics [docs]

History

neptunian commented Jun 17, 2021 •

edited

Loading