-
Notifications
You must be signed in to change notification settings - Fork 17
Measure and expose shipper queue performance metrics #11
Comments
@michel-laterman Would be good to have your inputs here too |
As far as diagnostics go, we would like to be able to access recent metrics/tracing data in order to include them with our bundles. If metrics/tracing are handled through the pipeline env then we can use a queue or buffer of some sort to capture recent data (similar to the ring-buffer) |
Okay, so I'm just gonna use this to track the various intersecting issues needed for queue metrics:
|
@cmacknz @fearful-symmetry sounds like the approach here is to add these stats to the diagnostics download from the agent. Please correct me if I am wrong. Would it be possible to have stats displayed on the agent detailed page also? (i don't think at this stage we need to have dashboards for the output observability of agents). User Persona: Anyone debugging throughput issues of an agent. (support, platform operator etc) Some additions to the above on what we should consider:
Average bytes on each write |
Removed the age of the oldest item in the queue metrics as a requirement for now. Let's start simple and just report the current size of the queue periodically and see if we need more information. Once we have the queue size over time in ES+Kibana we should be able to compute queue lag from there for example. |
I am moving the output metrics to a separate issue, let's just focus on the queue metrics with this issue. |
Tried to think of things that would be useful for the disk queue, here is what I came up with.
|
Looks like it got auto-closed. Since this is a meta-issue, reopening for now... |
@leehinman what happens when an event is "stuck"? does it block? |
This only happens if you have enabled infinite retries. But in that case an event that can't be delivered will be re-tried over and over again. So it will take up space in the queue, but other events will still be delivered. If you get enough events that can't be delivered then the queue will fill up. That would block reading of new events. This is expected behavior with infinite retries. |
Not sure if this is an issue we need to deal with right now but Could we have a retry queue for events that have reached some retry threshold. Yank them out of the main queue and use a separate thread to retry. |
Lots of ways to address this. I'm thinking we wait a bit to see what happens with the queues from a performance perspective and then we can work on retries & failure states. |
Closing, creating a new issue with the metrics that will be in the redesigned health dashboard. |
Add metrics counters and gauges for the follow set of metrics, and ensure they are exposed on the shipper's HTTP monitoring interface under /debug/vars:
elastic-agent-shipper/monitoring/queuemon.go
Line 131 in e231e51
elastic-agent-shipper/monitoring/queuemon.go
Line 130 in e231e51
The eventual goal is to expose these metrics on the agent details page (example
https://$kbana/app/fleet/agents/$agentID
) in fleet via the agent monitoring metrics.The text was updated successfully, but these errors were encountered: