Measure and expose shipper queue performance metrics #11

cmacknz · 2022-03-17T18:23:35Z

Add metrics counters and gauges for the follow set of metrics, and ensure they are exposed on the shipper's HTTP monitoring interface under /debug/vars:

Report the following set of queue metrics:
- The current configured maximum size of the queue:
  - Report the maximum queue size in events for the memory queue:
    
    elastic-agent-shipper/monitoring/queuemon.go
    
    Line 131 in e231e51
    
    MaxLevel: opt.UintWith(limit),
  - Report the maximum queue size in bytes for the disk queue.
- The current utilization of the queue in events:
  - Report the current utilization of the queue in events for the memory queue:
    
    elastic-agent-shipper/monitoring/queuemon.go
    
    Line 130 in e231e51
    
    CurrentLevel: opt.UintWith(count),
  - Report the current utilization of the queue in bytes for the disk queue.
- Implement queue metrics for the disk queue #145

The eventual goal is to expose these metrics on the agent details page (example https://$kbana/app/fleet/agents/$agentID) in fleet via the agent monitoring metrics.

The text was updated successfully, but these errors were encountered:

ph · 2022-03-24T16:44:00Z

@michel-laterman Would be good to have your inputs here too

michel-laterman · 2022-03-24T17:51:02Z

As far as diagnostics go, we would like to be able to access recent metrics/tracing data in order to include them with our bundles.

If metrics/tracing are handled through the pipeline env then we can use a queue or buffer of some sort to capture recent data (similar to the ring-buffer)

fearful-symmetry · 2022-04-14T21:21:24Z

Okay, so I'm just gonna use this to track the various intersecting issues needed for queue metrics:

merge the actual PR that defines the metrics: Define a queue metrics reporter interface beats#31289
Implement the "basic" queue size and max size metrics in the queue
Implement the "oldest event" metric in the queue(s)
Implement queue lag, once this is dealt with: Make the memory queue work with types other than publisher.Event beats#31307 And also once we get input on precisely what it should measure according to @cmacknz Tentitively replaced by metrics outlined here: Report detailed queue metrics beats#31113 (comment)
Implement the basic metrics reporter in the shipper, including the derived metrics based on what metrics are available from the queue
Implement occupied_retry
implement occupied_idle

nimarezainia · 2022-10-04T23:02:23Z

@cmacknz @fearful-symmetry sounds like the approach here is to add these stats to the diagnostics download from the agent. Please correct me if I am wrong. Would it be possible to have stats displayed on the agent detailed page also? (i don't think at this stage we need to have dashboards for the output observability of agents).

User Persona: Anyone debugging throughput issues of an agent. (support, platform operator etc)
Use Case: Support engineer trying to tune the throughput of the agent's output, they will modify some of the queue and batching parameters. Need to then observe changes in throughput on the agent.

Some additions to the above on what we should consider:

Details of the memory queue actually allocated
Utilization of the memory queue
Average batch size sent from Output queue
Transmission rates, any stats associated with the variables that make up the output engine.

Average bytes on each write
This may indicate if maximum_batch_size needs to be increased if the average is consistently near the max configured

cmacknz · 2022-10-26T16:30:44Z

Removed the age of the oldest item in the queue metrics as a requirement for now. Let's start simple and just report the current size of the queue periodically and see if we need more information. Once we have the queue size over time in ES+Kibana we should be able to compute queue lag from there for example.

cmacknz · 2022-10-31T19:29:54Z

I am moving the output metrics to a separate issue, let's just focus on the queue metrics with this issue.

cmacknz · 2022-10-31T19:34:30Z

#153

leehinman · 2022-11-04T18:47:32Z

Tried to think of things that would be useful for the disk queue, here is what I came up with.

Size on Disk (queue already tracks this)
Number of Events (need to add code to track)
Rate of Events added to queue last 1, 5, 15 minutes
Rate of Events removed from queue last 1, 5, 15 minutes
Date of oldest segment (once a segment is full, we shouldn't write to it again, so this should show us how tail of queue is doing and if we have at least one stuck event)

fearful-symmetry · 2022-11-09T23:04:58Z

Looks like it got auto-closed. Since this is a meta-issue, reopening for now...

nimarezainia · 2022-12-07T20:52:45Z

Date of oldest segment (once a segment is full, we shouldn't write to it again, so this should show us how tail of queue is doing and if we have at least one stuck event)

@leehinman what happens when an event is "stuck"? does it block?

leehinman · 2022-12-12T20:29:43Z

@leehinman what happens when an event is "stuck"? does it block?

This only happens if you have enabled infinite retries. But in that case an event that can't be delivered will be re-tried over and over again. So it will take up space in the queue, but other events will still be delivered. If you get enough events that can't be delivered then the queue will fill up. That would block reading of new events. This is expected behavior with infinite retries.

nimarezainia · 2023-01-17T04:57:48Z

@leehinman what happens when an event is "stuck"? does it block?

This only happens if you have enabled infinite retries. But in that case an event that can't be delivered will be re-tried over and over again. So it will take up space in the queue, but other events will still be delivered. If you get enough events that can't be delivered then the queue will fill up. That would block reading of new events. This is expected behavior with infinite retries.

Not sure if this is an issue we need to deal with right now but Could we have a retry queue for events that have reached some retry threshold. Yank them out of the main queue and use a separate thread to retry.

leehinman · 2023-01-17T20:42:23Z

Not sure if this is an issue we need to deal with right now but Could we have a retry queue for events that have reached some retry threshold. Yank them out of the main queue and use a separate thread to retry.

Lots of ways to address this. I'm thinking we wait a bit to see what happens with the queues from a performance perspective and then we can work on retries & failure states.

cmacknz · 2023-02-23T18:46:52Z

Closing, creating a new issue with the metrics that will be in the redesigned health dashboard.

cmacknz mentioned this issue Mar 17, 2022

[Meta][Project] Implement the Elastic Agent Data Shipper #3

Closed

3 tasks

jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Mar 21, 2022

cmacknz added the v8.4.0 label Mar 22, 2022

cmacknz assigned fearful-symmetry Mar 22, 2022

cmacknz mentioned this issue Mar 22, 2022

[Meta] Shipper 8.5 - Experimental integration with Filebeat and Metricbeat #15

Closed

29 tasks

cmacknz changed the title ~~[META][Feature] Improve data plane observability~~ [Meta][Feature] Improve data plane observability Mar 23, 2022

cmacknz mentioned this issue Mar 29, 2022

Create a skeleton shipper executable #6

Closed

ph mentioned this issue Apr 12, 2022

V2 Control Protocol elastic/elastic-agent-client#29

Merged

fearful-symmetry mentioned this issue Apr 21, 2022

Add queue monitoring #29

Merged

jlind23 added the 8.4-candidate label May 3, 2022

jlind23 added estimation:Week Task that represents a week of work. and removed 8.4-candidate labels May 24, 2022

cmacknz added 8.5-candidate and removed v8.4.0 labels Jun 21, 2022

cmacknz unassigned fearful-symmetry Jul 6, 2022

jlind23 added v8.5.0 and removed 8.5-candidate labels Jul 8, 2022

cmacknz mentioned this issue Jul 12, 2022

[Meta] Elastic Agent Shipper Project #16

Open

100 tasks

jlind23 assigned fearful-symmetry and faec and unassigned faec Jul 13, 2022

cmacknz changed the title ~~[Meta][Feature] Improve data plane observability~~ [Meta][Feature] Implement shipper observability features Jul 13, 2022

cmacknz added the v8.6.0 label Sep 7, 2022

mukeshelastic assigned nimarezainia Oct 4, 2022

cmacknz changed the title ~~[Meta][Feature] Implement shipper observability features~~ Measure and expose shipper performance metrics Oct 13, 2022

cmacknz mentioned this issue Oct 13, 2022

Report detailed queue metrics elastic/beats#31113

Closed

cmacknz mentioned this issue Oct 13, 2022

Implement shipper gRPC server metrics #78

Closed

cmacknz removed the v8.5.0 label Oct 13, 2022

cmacknz unassigned nimarezainia Oct 13, 2022

This was referenced Oct 25, 2022

Implement occupied_idle metrics for the shipper #142

Closed

Implement occupied_retry metrics for the shipper #143

Closed

Implement queue metrics for the disk queue #145

Open

Implement an "oldest event" metric for the queue #146

Open

fearful-symmetry mentioned this issue Oct 27, 2022

Add basic metrics to disk queue elastic/beats#33471

Merged

4 tasks

cmacknz changed the title ~~Measure and expose shipper performance metrics~~ Measure and expose shipper queue performance metrics Oct 31, 2022

fearful-symmetry closed this as completed in elastic/beats#33471 Nov 9, 2022

fearful-symmetry reopened this Nov 9, 2022

jlind23 added the Meta label Nov 21, 2022

cmacknz closed this as completed Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure and expose shipper queue performance metrics #11

Measure and expose shipper queue performance metrics #11

cmacknz commented Mar 17, 2022 •

edited by fearful-symmetry

Loading

ph commented Mar 24, 2022

michel-laterman commented Mar 24, 2022

fearful-symmetry commented Apr 14, 2022 •

edited by jlind23

Loading

nimarezainia commented Oct 4, 2022

cmacknz commented Oct 26, 2022

cmacknz commented Oct 31, 2022

cmacknz commented Oct 31, 2022

leehinman commented Nov 4, 2022

fearful-symmetry commented Nov 9, 2022

nimarezainia commented Dec 7, 2022

leehinman commented Dec 12, 2022

nimarezainia commented Jan 17, 2023

leehinman commented Jan 17, 2023

cmacknz commented Feb 23, 2023

Measure and expose shipper queue performance metrics #11

Measure and expose shipper queue performance metrics #11

Comments

cmacknz commented Mar 17, 2022 • edited by fearful-symmetry Loading

ph commented Mar 24, 2022

michel-laterman commented Mar 24, 2022

fearful-symmetry commented Apr 14, 2022 • edited by jlind23 Loading

nimarezainia commented Oct 4, 2022

cmacknz commented Oct 26, 2022

cmacknz commented Oct 31, 2022

cmacknz commented Oct 31, 2022

leehinman commented Nov 4, 2022

fearful-symmetry commented Nov 9, 2022

nimarezainia commented Dec 7, 2022

leehinman commented Dec 12, 2022

nimarezainia commented Jan 17, 2023

leehinman commented Jan 17, 2023

cmacknz commented Feb 23, 2023

cmacknz commented Mar 17, 2022 •

edited by fearful-symmetry

Loading

fearful-symmetry commented Apr 14, 2022 •

edited by jlind23

Loading