Swap events_processed_total for events in/out totals #6367

jszwedko · 2021-02-05T14:33:22Z

Extracted from comment: #6294 (review)

Currently we report processed_events_total in various places:

Some sources
Some sinks
All transforms as this is handled at the topology layer

However, there are a few issues with reporting a single metric for this:

sources don't really "process" events, they just emit them
It is difficult to know when a streaming transform has finished "processing" an event at the topology layer. All we really know is when it receives an event and when it emits an event. For the reduce transform, for example, when is an event processed? When it is received by the transform? Or when it has emitted a reduced event? For lua and wasm transforms, we truly never really know when an event has been "processed".
For sinks, it can be useful to know when it has received an event but before it has sent it as there is often a gap due to batching behavior.

For both of these reasons, I propose that instead of having a singular events_processed_total metric that each component reports, we instead report separate in and out metrics. These would be called:

events_in_total: counts the events that been accepted by the components. For transform and sink components, we could emit at the topology layer similar as you are doing now, for source components, I think this metric is less useful, but the sources themselves could emit as they parse events, but before they flush them downstream.
events_out_total: counts the events that have been emitted by the component, For the source and transform components, we could emit at the topology layer similar as you are doing now. For sink components, I think this metric is still useful given that sinks commonly do batching so it is useful to know the number of events that have actually been flushed to the external system. I think this would require instrumentation by each sink or possibly in the batching layer.

And that we remove the events_processed_total metric.

I think this would clearly convey that these are simply events going in and out of components rather than when an event has been "processed", which can be ambiguous. It would also make it easy to see when transforms either reduce events by combining them or create a number of events from a single event.

This has implications for vector top so I'm curious what @leebenson thinks. I think we could just have separate columns for in/out rather than one throughput column.

Ref: #5595

The text was updated successfully, but these errors were encountered:

leebenson · 2021-02-05T15:13:07Z

Copied from #6294 (comment)

Good thoughts @jszwedko 👍

In the context of the API.. would you imagine both in/out metrics being available?

Do you have an opinion on which would be the more compelling stat for top? (thinking, given the limited screen real estate, we might prefer one over the other.)

jszwedko · 2021-02-05T15:21:45Z

@leebenson thanks for taking a look! For the API, I would imagine making both available. For vector top people will probably be more interested in events going out of each component than those going in. I think that more closely matches what events_processed_total currently represents for components. It'd still be nice to be able to make them visible in vector top (IMO), even if the column is hidden by default.

leebenson · 2021-02-05T15:34:12Z

Makes sense, thanks 👍

ktff · 2021-02-05T17:15:19Z

Source events_out_total, transform events_in_total/events_out_total, and sink events_in_total, are pretty well defined so that leaves us with defining source events_in_total and sink events_out_total.

It would be great for sink events_out_total to count acked events in buffer layer since then, paired with sink events_in_total placed before the buffers and the sink, the difference of those two would give the size of the backlog of events for that sink. So a user would be able to see on a glance, and/or we can add the diff to UI/top and give some kind of warning when the backlog crosses certain threshold, which sinks/downstream services aren't able to keep up.

For source events_in_total, if we don't give it any meaning then we can just emit at or reuse source events_out_total for the time. That way all of the components will have both metrics.

To ease the the transition, we can:

Add events_in_total and events_out_total, enhancement(observability): Add events_in_total & events_out_total metrics #6433
Then replace real emits of processed_events_total by emitting it side by side of source events_out_total, transform events_in_total, and sink events_out_total, (adding an alias for those three would also do the trick). EDIT: we can skip this.
Expose new metrics through API enhancement(graphql api): Expose events_in & events_out metrics #6888
Then update top/UI and other things. (Delegated to Update top with events_in_total and events_out_total metrics #7257)
Alias processed_events_total with events_out_total. chore(metrics): Remove emits & alias processed_events_total metric #7345
And finally remove processed_events_total all together. (Delegated to Remove processed_events_total metric #7346)

binarylogic · 2021-02-05T17:52:34Z

For additional context, events_processed_total - events_discarded_total = number of events output, but that doesn't make as much sense for transforms like reduce. I agree, events_in_total and events_out_total are clearer and more useful. They are also implemented at the topology level where they will always be accurate. I am willing to bet we don't emit events_discarded_total perfectly.

jszwedko · 2021-02-05T19:11:24Z

It would be great for sink events_out_total to count acked events in buffer layer since then, paired with sink events_in_total placed before the buffers and the sink, the difference of those two would give the size of the backlog of events for that sink. So a user would be able to see on a glance, and/or we can add the diff to UI/top and give some kind of warning when the backlog crosses certain threshold, which sinks/downstream services aren't able to keep up.

Agreed 👍

For source events_in_total, if we don't give it any meaning then we can just emit at or reuse source events_out_total for the time. That way all of the components will have both metrics.

Also agreed 😄

The plan you outlined sounds good.

jszwedko · 2021-04-25T15:53:17Z

Just noting that we should probably delay this last step:

And finally remove processed_events_total all together.

For at least a release or two after we introduce events_in_total / events_out_total (deprecating it in the docs). We may even want to keep it around until 1.0 since it seems like the maintenance overhead is minimal (it should be an alias for events_out_total).

leebenson · 2021-04-25T16:32:42Z

Should we update top to break out events in/out?

ktff · 2021-04-27T11:30:00Z

Should we update top to break out events in/out?

Yes, that's the next step. I'll open an separate issue for it.

We may even want to keep it around until 1.0

It will add some overhead too changing the api of graphql api, but I think not so much that we can't keep it until then.

(it should be an alias for events_out_total).

events_processed_total and events_out_total don't quite correspond one to one for some transforms, but it's a fair price for maintaining backward compatibility and removing it's performance impact. So I'll add it to todo list.

jszwedko added the type: task Generic non-code related tasks label Feb 5, 2021

jszwedko mentioned this issue Feb 5, 2021

fix: Emit processed_events_total after transform has processed event #6294

Merged

ktff mentioned this issue Feb 5, 2021

Emit outputted_events_total for transforms #5595

Closed

binarylogic added domain: observability Anything related to monitoring/observing Vector domain: metrics Anything related to Vector's metrics events labels Feb 5, 2021

binarylogic assigned ktff Feb 5, 2021

binarylogic added this to the 2021-02-01 D-Fuel milestone Feb 5, 2021

ktff mentioned this issue Feb 12, 2021

enhancement(observability): Add events_in_total & events_out_total metrics #6433

Merged

2 tasks

ktff modified the milestones: 2021-02-01 D-Fuel, 2021-02-15 Scythe of Elune Feb 14, 2021

ktff mentioned this issue Mar 13, 2021

enhancement(observability): Add events_in_total to sources #6758

Merged

jszwedko mentioned this issue Mar 17, 2021

enhancement(prometheus_exporter sink): Add internal events #6790

Closed

ktff mentioned this issue Mar 25, 2021

enhancement(graphql api): Expose events_in & events_out metrics #6888

Merged

ktff mentioned this issue Apr 27, 2021

Update top with events_in_total and events_out_total metrics #7257

Closed

This was referenced May 5, 2021

chore(metrics): Remove emits & alias processed_events_total metric #7345

Merged

Remove processed_events_total metric #7346

Closed

ktff closed this as completed in #7345 May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap events_processed_total for events in/out totals #6367

Swap events_processed_total for events in/out totals #6367

jszwedko commented Feb 5, 2021 •

edited

Loading

leebenson commented Feb 5, 2021

jszwedko commented Feb 5, 2021

leebenson commented Feb 5, 2021

ktff commented Feb 5, 2021 •

edited

Loading

binarylogic commented Feb 5, 2021

jszwedko commented Feb 5, 2021

jszwedko commented Apr 25, 2021

leebenson commented Apr 25, 2021

ktff commented Apr 27, 2021

Swap events_processed_total for events in/out totals #6367

Swap events_processed_total for events in/out totals #6367

Comments

jszwedko commented Feb 5, 2021 • edited Loading

leebenson commented Feb 5, 2021

jszwedko commented Feb 5, 2021

leebenson commented Feb 5, 2021

ktff commented Feb 5, 2021 • edited Loading

binarylogic commented Feb 5, 2021

jszwedko commented Feb 5, 2021

jszwedko commented Apr 25, 2021

leebenson commented Apr 25, 2021

ktff commented Apr 27, 2021

jszwedko commented Feb 5, 2021 •

edited

Loading

ktff commented Feb 5, 2021 •

edited

Loading