Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap events_processed_total for events in/out totals #6367

Closed
jszwedko opened this issue Feb 5, 2021 · 9 comments · Fixed by #7345
Closed

Swap events_processed_total for events in/out totals #6367

jszwedko opened this issue Feb 5, 2021 · 9 comments · Fixed by #7345
Assignees
Labels
domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks

Comments

@jszwedko
Copy link
Member

jszwedko commented Feb 5, 2021

Extracted from comment: #6294 (review)

Currently we report processed_events_total in various places:

  • Some sources
  • Some sinks
  • All transforms as this is handled at the topology layer

However, there are a few issues with reporting a single metric for this:

  • sources don't really "process" events, they just emit them
  • It is difficult to know when a streaming transform has finished "processing" an event at the topology layer. All we really know is when it receives an event and when it emits an event. For the reduce transform, for example, when is an event processed? When it is received by the transform? Or when it has emitted a reduced event? For lua and wasm transforms, we truly never really know when an event has been "processed".
  • For sinks, it can be useful to know when it has received an event but before it has sent it as there is often a gap due to batching behavior.

For both of these reasons, I propose that instead of having a singular events_processed_total metric that each component reports, we instead report separate in and out metrics. These would be called:

  • events_in_total: counts the events that been accepted by the components. For transform and sink components, we could emit at the topology layer similar as you are doing now, for source components, I think this metric is less useful, but the sources themselves could emit as they parse events, but before they flush them downstream.
  • events_out_total: counts the events that have been emitted by the component, For the source and transform components, we could emit at the topology layer similar as you are doing now. For sink components, I think this metric is still useful given that sinks commonly do batching so it is useful to know the number of events that have actually been flushed to the external system. I think this would require instrumentation by each sink or possibly in the batching layer.

And that we remove the events_processed_total metric.

I think this would clearly convey that these are simply events going in and out of components rather than when an event has been "processed", which can be ambiguous. It would also make it easy to see when transforms either reduce events by combining them or create a number of events from a single event.

This has implications for vector top so I'm curious what @leebenson thinks. I think we could just have separate columns for in/out rather than one throughput column.

Ref: #5595

@leebenson
Copy link
Member

Copied from #6294 (comment)

Good thoughts @jszwedko 👍

In the context of the API.. would you imagine both in/out metrics being available?

Do you have an opinion on which would be the more compelling stat for top? (thinking, given the limited screen real estate, we might prefer one over the other.)

@jszwedko
Copy link
Member Author

jszwedko commented Feb 5, 2021

@leebenson thanks for taking a look! For the API, I would imagine making both available. For vector top people will probably be more interested in events going out of each component than those going in. I think that more closely matches what events_processed_total currently represents for components. It'd still be nice to be able to make them visible in vector top (IMO), even if the column is hidden by default.

@leebenson
Copy link
Member

Makes sense, thanks 👍

@binarylogic binarylogic added domain: observability Anything related to monitoring/observing Vector domain: metrics Anything related to Vector's metrics events labels Feb 5, 2021
@binarylogic binarylogic added this to the 2021-02-01 D-Fuel milestone Feb 5, 2021
@ktff
Copy link
Contributor

ktff commented Feb 5, 2021

Source events_out_total, transform events_in_total/events_out_total, and sink events_in_total, are pretty well defined so that leaves us with defining source events_in_total and sink events_out_total.

It would be great for sink events_out_total to count acked events in buffer layer since then, paired with sink events_in_total placed before the buffers and the sink, the difference of those two would give the size of the backlog of events for that sink. So a user would be able to see on a glance, and/or we can add the diff to UI/top and give some kind of warning when the backlog crosses certain threshold, which sinks/downstream services aren't able to keep up.

For source events_in_total, if we don't give it any meaning then we can just emit at or reuse source events_out_total for the time. That way all of the components will have both metrics.

To ease the the transition, we can:

@binarylogic
Copy link
Contributor

For additional context, events_processed_total - events_discarded_total = number of events output, but that doesn't make as much sense for transforms like reduce. I agree, events_in_total and events_out_total are clearer and more useful. They are also implemented at the topology level where they will always be accurate. I am willing to bet we don't emit events_discarded_total perfectly.

@jszwedko
Copy link
Member Author

jszwedko commented Feb 5, 2021

It would be great for sink events_out_total to count acked events in buffer layer since then, paired with sink events_in_total placed before the buffers and the sink, the difference of those two would give the size of the backlog of events for that sink. So a user would be able to see on a glance, and/or we can add the diff to UI/top and give some kind of warning when the backlog crosses certain threshold, which sinks/downstream services aren't able to keep up.

Agreed 👍

For source events_in_total, if we don't give it any meaning then we can just emit at or reuse source events_out_total for the time. That way all of the components will have both metrics.

Also agreed 😄

The plan you outlined sounds good.

@jszwedko
Copy link
Member Author

Just noting that we should probably delay this last step:

And finally remove processed_events_total all together.

For at least a release or two after we introduce events_in_total / events_out_total (deprecating it in the docs). We may even want to keep it around until 1.0 since it seems like the maintenance overhead is minimal (it should be an alias for events_out_total).

@leebenson
Copy link
Member

Should we update top to break out events in/out?

@ktff
Copy link
Contributor

ktff commented Apr 27, 2021

Should we update top to break out events in/out?

Yes, that's the next step. I'll open an separate issue for it.

We may even want to keep it around until 1.0

It will add some overhead too changing the api of graphql api, but I think not so much that we can't keep it until then.

(it should be an alias for events_out_total).

events_processed_total and events_out_total don't quite correspond one to one for some transforms, but it's a fair price for maintaining backward compatibility and removing it's performance impact. So I'll add it to todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector type: task Generic non-code related tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants