Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Measure and expose shipper queue performance metrics #11

Closed
4 of 8 tasks
cmacknz opened this issue Mar 17, 2022 · 14 comments · Fixed by elastic/beats#33471
Closed
4 of 8 tasks

Measure and expose shipper queue performance metrics #11

cmacknz opened this issue Mar 17, 2022 · 14 comments · Fixed by elastic/beats#33471
Assignees
Labels
estimation:Week Task that represents a week of work. Meta Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Mar 17, 2022

Add metrics counters and gauges for the follow set of metrics, and ensure they are exposed on the shipper's HTTP monitoring interface under /debug/vars:

The eventual goal is to expose these metrics on the agent details page (example https://$kbana/app/fleet/agents/$agentID) in fleet via the agent monitoring metrics.

@jlind23 jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 21, 2022
@cmacknz cmacknz added the v8.4.0 label Mar 22, 2022
@cmacknz cmacknz changed the title [META][Feature] Improve data plane observability [Meta][Feature] Improve data plane observability Mar 23, 2022
@ph
Copy link

ph commented Mar 24, 2022

@michel-laterman Would be good to have your inputs here too

@michel-laterman
Copy link

As far as diagnostics go, we would like to be able to access recent metrics/tracing data in order to include them with our bundles.

If metrics/tracing are handled through the pipeline env then we can use a queue or buffer of some sort to capture recent data (similar to the ring-buffer)

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Apr 14, 2022

Okay, so I'm just gonna use this to track the various intersecting issues needed for queue metrics:

@jlind23 jlind23 added estimation:Week Task that represents a week of work. and removed 8.4-candidate labels May 24, 2022
@jlind23 jlind23 assigned fearful-symmetry and faec and unassigned faec Jul 13, 2022
@cmacknz cmacknz changed the title [Meta][Feature] Improve data plane observability [Meta][Feature] Implement shipper observability features Jul 13, 2022
@cmacknz cmacknz added the v8.6.0 label Sep 7, 2022
@nimarezainia
Copy link

@cmacknz @fearful-symmetry sounds like the approach here is to add these stats to the diagnostics download from the agent. Please correct me if I am wrong. Would it be possible to have stats displayed on the agent detailed page also? (i don't think at this stage we need to have dashboards for the output observability of agents).

User Persona: Anyone debugging throughput issues of an agent. (support, platform operator etc)
Use Case: Support engineer trying to tune the throughput of the agent's output, they will modify some of the queue and batching parameters. Need to then observe changes in throughput on the agent.

Some additions to the above on what we should consider:

  • Details of the memory queue actually allocated
  • Utilization of the memory queue
  • Average batch size sent from Output queue
  • Transmission rates, any stats associated with the variables that make up the output engine.

Average bytes on each write
This may indicate if maximum_batch_size needs to be increased if the average is consistently near the max configured

@cmacknz cmacknz changed the title [Meta][Feature] Implement shipper observability features Measure and expose shipper performance metrics Oct 13, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Oct 26, 2022

Removed the age of the oldest item in the queue metrics as a requirement for now. Let's start simple and just report the current size of the queue periodically and see if we need more information. Once we have the queue size over time in ES+Kibana we should be able to compute queue lag from there for example.

@cmacknz cmacknz changed the title Measure and expose shipper performance metrics Measure and expose shipper queue performance metrics Oct 31, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Oct 31, 2022

I am moving the output metrics to a separate issue, let's just focus on the queue metrics with this issue.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 31, 2022

#153

@leehinman
Copy link
Contributor

Tried to think of things that would be useful for the disk queue, here is what I came up with.

  • Size on Disk (queue already tracks this)
  • Number of Events (need to add code to track)
  • Rate of Events added to queue last 1, 5, 15 minutes
  • Rate of Events removed from queue last 1, 5, 15 minutes
  • Date of oldest segment (once a segment is full, we shouldn't write to it again, so this should show us how tail of queue is doing and if we have at least one stuck event)

@fearful-symmetry
Copy link
Contributor

Looks like it got auto-closed. Since this is a meta-issue, reopening for now...

@nimarezainia
Copy link

  • Date of oldest segment (once a segment is full, we shouldn't write to it again, so this should show us how tail of queue is doing and if we have at least one stuck event)

@leehinman what happens when an event is "stuck"? does it block?

@leehinman
Copy link
Contributor

@leehinman what happens when an event is "stuck"? does it block?

This only happens if you have enabled infinite retries. But in that case an event that can't be delivered will be re-tried over and over again. So it will take up space in the queue, but other events will still be delivered. If you get enough events that can't be delivered then the queue will fill up. That would block reading of new events. This is expected behavior with infinite retries.

@nimarezainia
Copy link

@leehinman what happens when an event is "stuck"? does it block?

This only happens if you have enabled infinite retries. But in that case an event that can't be delivered will be re-tried over and over again. So it will take up space in the queue, but other events will still be delivered. If you get enough events that can't be delivered then the queue will fill up. That would block reading of new events. This is expected behavior with infinite retries.

Not sure if this is an issue we need to deal with right now but Could we have a retry queue for events that have reached some retry threshold. Yank them out of the main queue and use a separate thread to retry.

@leehinman
Copy link
Contributor

Not sure if this is an issue we need to deal with right now but Could we have a retry queue for events that have reached some retry threshold. Yank them out of the main queue and use a separate thread to retry.

Lots of ways to address this. I'm thinking we wait a bit to see what happens with the queues from a performance perspective and then we can work on retries & failure states.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 23, 2023

Closing, creating a new issue with the metrics that will be in the redesigned health dashboard.

@cmacknz cmacknz closed this as completed Feb 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
estimation:Week Task that represents a week of work. Meta Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.6.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants