Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMF: Store the initial value for cumulative metrics #3425

Merged
merged 1 commit into from
Jun 8, 2021

Conversation

bjrara
Copy link
Contributor

@bjrara bjrara commented May 18, 2021

Why do we need it?
This is a follow-up PR of open-telemetry/opentelemetry-collector#3047.

In this PR, EMF exporter changes its strategy when handling cumulative metric values such as counters, and prometheus summaries - it stores the first scraped records as initial value in cache, and skip sending them to the backend. As a result, the deltas calculated from the second are treated as the first values that EMF exporter will export.

For example, if counter metrics are received in the following order:

1st batch: request_total {code=400, value=100}
2nd batch: request_total {code=400, value=101}
3rd batch: request_total {code=400, value=102}

The metrics will be sent by EMF as:
1st batch: SKIP
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=2}

1st batch: SKIP
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=1}

@bjrara bjrara requested a review from anuraaga as a code owner May 18, 2021 18:28
@bjrara bjrara requested a review from a team May 18, 2021 18:28
@Aneurysm9
Copy link
Member

I'm having trouble understanding why this is necessary. If the exporter needs to translate cumulative metrics to deltas, shouldn't the initial sample still be present? I'd expect the delta output stream to look something like this:

1st batch: request_total {code=400, value=100}
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=2}

The alternative means that I can never know the actual magnitude of the tracked value since the baseline is missing.

Can some more prose be added to the README explaining what this adjustment process is doing and why?

@bjrara
Copy link
Contributor Author

bjrara commented May 18, 2021

I'm having trouble understanding why this is necessary. If the exporter needs to translate cumulative metrics to deltas, shouldn't the initial sample still be present? I'd expect the delta output stream to look something like this:

1st batch: request_total {code=400, value=100}
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=2}

The alternative means that I can never know the actual magnitude of the tracked value since the baseline is missing.

Can some more prose be added to the README explaining what this adjustment process is doing and why?

Just realized the example in the description was wrong..., I corrected the adjusted value in EMF part. Sorry about the confusion.

Not sure if the issue becomes clearer as the example is corrected. In EMF exporter, we send deltas (of the current value and its previous value) to the CloudWatch backend when handling cumulative metrics.
If we scrape metrics every one minute, deltas will be the amount that is increased in that one minute.

Take the case above as an example, as request_total is increased by 1 every minute, delta in the 2nd, 3rd minute should be 1.

@bjrara
Copy link
Contributor Author

bjrara commented May 18, 2021

I'm having trouble understanding why this is necessary. If the exporter needs to translate cumulative metrics to deltas, shouldn't the initial sample still be present? I'd expect the delta output stream to look something like this:

1st batch: request_total {code=400, value=100}
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=2}

The alternative means that I can never know the actual magnitude of the tracked value since the baseline is missing.

Can some more prose be added to the README explaining what this adjustment process is doing and why?

IMHO, as we are exporting deltas by the most recent two records, we don't need the baseline, we only need know what its previous metric value is. The delta concept is alike rate() in Prometheus, but instead of user defining [5m] in the expression, the time window is decided by the scrape interval.

Previously, prometheus receiver sent the delta of the current metric value and the initial one, so EMF exporter could take it for granted that the first record received is delta-ed (by its previous value which is also the initial), for instance:

Metrics received from prometheus receiver were:

1st batch: SKIP (EMF wouldn't receive it)
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=2}

And EMF exporter would send them as:

1st batch: NOT RECEIVED
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=1}

Now since the logic in Prometheus receiver is changed, the cache and skip for the initials needs to be handled in EMF exporter instead.

Metrics received from prometheus receiver are:

1st batch: request_total {code=400, value=100}
2nd batch: request_total {code=400, value=101}
3rd batch: request_total {code=400, value=102}

And EMF exporter will send them as:

1st batch: SAVE as initial and SKIP
2nd batch: request_total {code=400, value=1}
3rd batch: request_total {code=400, value=1}

@bjrara bjrara force-pushed the delta branch 5 times, most recently from 93be710 to 4ff1746 Compare May 19, 2021 20:12
Copy link
Member

@mxiamxia mxiamxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bjrara bjrara force-pushed the delta branch 6 times, most recently from f49fa9e to df6ad6b Compare May 28, 2021 22:15
@bogdandrutu
Copy link
Member

Please rebase to fix conflicts

@bjrara
Copy link
Contributor Author

bjrara commented Jun 3, 2021

Please rebase to fix conflicts

Done. Thx!

@bogdandrutu bogdandrutu merged commit e6e5817 into open-telemetry:main Jun 8, 2021
alexperez52 referenced this pull request in open-o11y/opentelemetry-collector-contrib Aug 18, 2021
As discussed during the SIG, we want to move storage extension to core, starting with the interface (this PR) so persistent buffer implementation (open-telemetry/opentelemetry-collector#2285) could use it

**Link to tracking Issue:** #3424 

**Testing:** Just the interface, no tests

**Documentation:** README.md with API

cc @djaglowski @tigrannajaryan
@mircohacker
Copy link
Contributor

Hey @bjrara @Aneurysm9 @mxiamxia this behaviour does not work in my usecase. Is there a way to disable it or a work around?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants