-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544
[Stack Monitoring] create alert per node, index, or cluster instead of always per cluster #102544
Conversation
c2ac780
to
27841f0
Compare
e08158e
to
65fe38b
Compare
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui) |
@elasticmachine merge upstream |
const key = this.alertOptions.accessorKey; | ||
|
||
// for each node, update the alert's state with node state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where I've moved the triggering of the new alert into the inner for loop that loops through each node, instead of in the cluster for loop.
const instance = services.alertInstanceFactory(node.meta.nodeId || node.meta.instanceId); | ||
// quick fix for now so that non node level alerts will use the cluster id | ||
const instance = services.alertInstanceFactory( | ||
node.meta.nodeId || node.meta.instanceId || cluster.clusterUuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some alerts are cluster level and do not have a nodeId or instanceId property, so use the clusterUuid. This should be typed and improved at some point or have a different function for cluster vs node level processing.
protected executeActions( | ||
instance: AlertInstance, | ||
alertStates: AlertThreadPoolRejectionsState[], | ||
{ alertStates }: { alertStates: AlertState[] }, | ||
item: AlertData | null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was getting the wrong data was not able to destructure the values further down resulting in a bug for Threadpool alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
I couldn't fire all the alarms but the code looks good.
…tic#102544) * create alert per node instead of per cluster * add comment * fix test, replace alert state with empty array with no node is firing * update cpu usage action messaging * fix internationalization * update disk usage rule action messaging * update memory usage rule action messaging * update other action messaging * update missing monitoring data alert action messaging * remove comment * fix bug where threadpool alerts were not firing * fix bug with threadpool rejections and update alert action messaging to be per node * update comments * unit test for thread pool write rejections alert * update messaging for CCR read rejection * fix cluster level alerts to use the cluster id when its not node level * add more tests to nodes changed alert * update default message * update alert messaging for large shard size * update default messaging Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@neptunian If I understand this correctly we have changed the alert ID to be just the node ID.
In most customer use cases monitoring cluster is used to consolidate many production clusters monitoring data. Each production cluster can have an instance say "instance-0000000001". If the alert id is only nodeID then it may not be unique. Do we need the "clusterid:nodeID" to make the alert ID unique? |
@ravikesarwani: "instance-0000000001" is the node name but the node id should be unique. You can see here in the url, the ID used @jasonrhodes Do you see any issue with just using the node ID? |
I actually don't know what generates the node IDs but I imagine it's a UUID as well? If so, it should be safe/unique. |
) (#103719) * create alert per node instead of per cluster * add comment * fix test, replace alert state with empty array with no node is firing * update cpu usage action messaging * fix internationalization * update disk usage rule action messaging * update memory usage rule action messaging * update other action messaging * update missing monitoring data alert action messaging * remove comment * fix bug where threadpool alerts were not firing * fix bug with threadpool rejections and update alert action messaging to be per node * update comments * unit test for thread pool write rejections alert * update messaging for CCR read rejection * fix cluster level alerts to use the cluster id when its not node level * add more tests to nodes changed alert * update default message * update alert messaging for large shard size * update default messaging Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@jasonrhodes @ravikesarwani I believe the Alerts team said that those ids only need to be unique per alert. |
@simianhacker yes, that's correct. I think Ravi is wondering whether, if you have multiple nodes whose CPU is spiking at the same time, across different clusters, could those nodes each have the same Node ID? It would all be within the one rule, but if those Node IDs aren't unique, there could theoretically be a collision. I assume they're UUIDs but I'm not sure, it'd be nice to confirm that. |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
Resolves #100136. This is part of Stack Monitoring alerting alignment where we want to create an alert on where the alert occured and not always on the cluster. This means we also change the alert ids to reflect that.
Changes
Bug fix for Threadpool alerts
I don't think Threadpool search and write rejection alerts are working in production. Testing on master will give you errors due to the wrong parameters being passed into executeActions. This was probably not caught because it's never been checked off in any test scenarios due to testers not having an easy way to trigger the alert. This will be resolved in this PR.Bug fix for Threadpool alerts UI message
Threadpool rejection counts were not showing up in the UI message Before/after:CPU Usage Alert Messaging
internalShortMessage
, before and after. The first alert is the old message, where with new alerting changes there would only ever be 1 node in the list. The last couple lines are the new message.internalFullMessage
, before and after. The top messages are the old messages, where with new alerting changes there would only ever be 1 node in the list. The last messages are the new messages. Note that the link was changed to a global state link, like other alerts, and instead of taking the user to the node viewRemove template variables
nodes
andcount
, though they will still continue to be sent "behind the scenes" for users already using them, wherecount
always equals 1 andnodes
is always one node.Add template variable
node
, which functions the same asnodes
but will only have one node in the value, eg:mynodeId:cpuUsage
Disk Usage Alert Messaging
All the same changes as CPU Usage rule above, but replace "CPU usage" with "Disk usage"
Eg of
internalShortMessage
, before and afterEg of
internalFullMessage
, before and afterMemory Usage (JVM) Alert Messaging
All the same changes as CPU Usage rule above, but replace "CPU usage" with "Memory usage"
Eg of
internalShortMessage
, before and afterEg of
internalFullMessage
, before and afterMissing Monitoring Data Alert Messaging
All the same changes as CPU Usage rule above, but replace "CPU usage" with "Missing monitoring data"
Eg of
internalShortMessage
, before and afterEg of
internalFullMessage
, before and afterThread pool Search Rejections Alert Messaging
All the same changes as CPU Usage rule above, but replace "CPU usage" with "threadpool search rejections"
Eg of
internalShortMessage
, before and afterEg of
internalFullMessage
, before and afterThread pool Write Rejections Alert Messaging
All the same changes as Thread pool Write Rejections rule above
internalShortMessage
internalFullMessage
Shard Size Alert ID and Messaging change
This alert becomes a per index alert:
Before
Now the alert id is the instance_id returned from elasticsearch:
clusterId:shardIndex
.Messaging changes:
internalShortMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. Investigate indices with large shard sizes.
After:
Large shard size alert is firing for the following index: apm-8.0.0-onboarding-2021.06.22. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: apm-8.0.0-span-000001. Investigate indices with large shard sizes.
Large shard size alert is firing for the following index: twitter. Investigate indices with large shard sizes.
internalFullMessage
Before:
Large shard size alert is firing for the following indices: apm-8.0.0-error-000001, apm-8.0.0-metric-000001, apm-8.0.0-onboarding-2021.06.22, apm-8.0.0-profile-000001, apm-8.0.0-span-000001, apm-8.0.0-transaction-000001, metricbeat-8.0.0-2021.06.24-000001, twitter. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))
After:
Large shard size alert is firing for the following index: apm-8.0.0-error-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-error-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))
Large shard size alert is firing for the following index: apm-8.0.0-transaction-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-transaction-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))
Large shard size alert is firing for the following index: metricbeat-8.0.0-2021.06.24-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/metricbeat-8.0.0-2021.06.24-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))
Large shard size alert is firing for the following index: apm-8.0.0-metric-000001. [View index shard size stats](http://localhost:5603/ssg/app/monitoring#/elasticsearch/indices/apm-8.0.0-metric-000001?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ))
CCR Read Rejections alert ID and Messaging change
This alert becomes a per index alert.
Before/after (the bottom ID is the old id, the new id is the top two). Since an index could have the same name in different remote clusters we use the remote cluster in the ID as well. We do not use the "leader" cluster as discussed with @jasonrhodes the alerting framework is not across clusters.
Messaging changes:
internalShortMessage
Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Verify follower and leader index relationships across the affected remote clusters.
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Verify follower and leader index relationships on the affected remote cluster.
CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Verify follower and leader index relationships on the affected remote cluster.
internalFullMessage
Before:
CCR read exceptions alert is firing for the following remote clusters: remoteCluster1, remoteCluster2. Current 'follower_index' indices are affected: followerIndex, followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
After:
CCR read exceptions alert is firing for the following remote cluster: remoteCluster1. Current 'follower_index' index affected: followerIndex. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
CCR read exceptions alert is firing for the following remote cluster: remoteCluster2. Current 'follower_index' index affected: followerIndex2. [View CCR stats](http://localhost:5601/ssg/app/monitoring#/elasticsearch/ccr?_g=(cluster_uuid:22FsxZ7mRs6tINBAPq-9lQ,ccs:ccs))
Elasticsearch cluster health, Elasticsearch version mismatch, Kibana version mismatch, Logstash version mismatch, Elasticsearch nodes changed, Elasticsearch license expiration
These alerts remain cluster level and the alert id is the cluster id, as they were previously, minus a colon that was at the end of id. Everything should function and look the same.
Test
Helpful for generating the alerts:
#87377 . Threadpool and CCR alerts, I still need to find an easier way to reproduce but here are some notes: #93072 (comment), #85908