-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Agent health, integration update availability alerts #124240
Comments
Pinging @elastic/fleet (Team:Fleet) |
Also the scaling described in here: https://www.elastic.co/guide/en/fleet/7.17/fleet-server-scalability.html#scaling-recommendations should be available as an alert, thus when reaching 2000 agents, there should be an alert that the current sizing might not be adequate enough and a change to xyz should be performed. |
Team is working on an infrastructure that would allow Status (be it agent status or status of the inputs/integrations ) to be propagated up to the Fleet UI. In this process the status changes will be stored in a specific datastream. The user will then have the flexibility to build these alerts based on the documents we store. I'm sure we can also create pre-baked alerts out of the box. |
I will +1 this as a feature request. My clusters are all on-prem. Having a way to easily send notifications when agents/fleet are unhealthy is something that is sorely needed. |
@mukeshelastic I'm guessing this is no longer a "8.6" candidate sine that has been released? |
@nimarezainia - is this on the product roadmap for the near future? I have a customer requesting this feature. |
@defutek-tj it is one of our higher priority items for the users of platform however currently not slated for delivery due to other higher priority items on that list. |
Hi @joshdover who should I speak with to get more info on this? - fellow Elastician here! JP |
@jpsep-elastic happy to discuss. |
@nimarezainia - any updates as to when this feature might be available? |
@defutek-tj our 8.9 release brings agent health including reporting on the health of inputs/integrations (see the agent details page). We don;t have alerts as yet built on the status changes however. |
Describe the feature:
TLDR;
Within the Fleet UI you can sort, select and search for the agents and also see how many are unhealthy, healthy, or have and have not responded within x minutes.
Within the Stack Monitoring for the Beats agents it is possible to create a custom alerting rule that uses the Elasticsearch Query inside Kibana and query the
.monitoring-beats...
indices to check if a certain beat is alive and sending in data.Sometimes there is a not so dynamical infrastructure involved and an alert for certain Elastic Agents might be of interest. Currently there is no possibility to alert based on the health of the agent. E.g. interesting would be if my on-premise fleet server is healthy, if that breaks I want to be alerted immediately, since this can introduce cascading errors, like policies not updating, all agents becoming unhealthy.
A good way would be to allow some default rules, like we have in the Stack Monitoring, where I can select
give me an alert every 12 hours to Slack with all unhealthy agents
. This way I would get information if my infrastructure has some issues, there are some changes and I might want to perform cleanups and throw the unhealthy agents away.Default rules in Fleet UI
As of now, I have no idea when an integration has a new version and I would need to look into the agent policy, then check if there is an update, or even go one step deeper into the integration itself and update there first. This was commented here after I created an issue that I could not see the update on the agent.
An alert that would run once a day and sends me a mail with
integration xyz is ready to update, no breaking changes
would be good.This is needed to give me an alert if my fleet server goes down. Currently I am using a heartbeat that does an http request against the fleet and I have an status alert set. However, that involves me running additional software on different hardware, whilst the data is already available within Elasticsearch.
So the possibility to give me an alert on an agent that goes unhealthy immediately would be good. Furthermore a second rule that gives me a status report once a day with
10 agents unhealthy ... list of agent names
would be interesting for me to clean up.The text was updated successfully, but these errors were encountered: