Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Metric Aggregation Processor #4968

Open
huyan0 opened this issue Jul 22, 2020 · 38 comments
Open

Proposal: Metric Aggregation Processor #4968

huyan0 opened this issue Jul 22, 2020 · 38 comments
Labels
Accepted Component New component has been sponsored never stale Issues marked with this label will be never staled and automatically removed

Comments

@huyan0
Copy link
Member

huyan0 commented Jul 22, 2020

Metric Aggregation Processor Proposal

Objectives

The objective of the metric aggregation processor is to provide OpenTelemetry (OT) Collector users the ability to use exporters to send cumulative metrics to backend systems like Prometheus.

Background

Currently, the Prometheus exporter for the OT Collector is not functioning as expected. This is because OTLP Exporter in the SDK is pass through, and currently the Collector doesn't have any aggregation functionality, so the Collector receives deltas aggregations and instantaneous metric events and export them directly without converting them to cumulative values(#1255). In the Go SDK, aggregation of delta values is performed by the Accumulator and the Integrator(SDK Processor), but there is no similar component in the Collector. Any cumulative exporter that may be implemented in the future will encounter the same problem (a proposed exporter: #1150). A processor that maintains states of metrics (time series) and applies incoming deltas to cumulative values solves this problem.

Requirements

The processor should convert instantaneous and delta OTLP metrics to cumulative OTLP metrics. The following table proposes a mapping from instantaneous and delta metric kinds to cumulative metric kinds based on OTel-Proto PR open-telemetry/opentelemetry-collector#168. In the table, red entries are invalid type and kind combinations, white entries correspond to OTLP metrics that are already cumulative(or gauge), and should pass through the processor, and green entries correspond to OTLP metrics that the processor should aggregate.

image
*Grouping scalar indicates the most recent value to occur in a collection interval.
**Discussions around PR open-telemetry/opentelemetry-collector#168 are still ongoing.This table reflects a current status; will need to be updated when status changes

For green entries, the processor should maintain the cumulative value as a state. Metric state should be identified by a OTLP metric descriptor and a label set. For each metric of a stateful kind, the processor should aggregate cumulative values for every data point by labels.

Design Idea

The metric aggregation processor should be similar to the SDK Controller; it should contain:

  • Accumulator, which is the entry point of OTLP metric events and manages incremental changes of each metric
  • Integrator, which takes checkpoints of the incremental changes from Accumulator and apply them to cumulative values
  • Ticker, which initiates the process of collection and sends metrics to the next metric consumer and
  • Custom Aggregators, which maintain the state of metrics; a unique metric is defined by the combination of a metric name, a kind, a type and a label set.

The metric aggregation processor should accept concurrent updates of the metric state by taking in OTLP metrics passed from the pipeline, and perform collection periodically to send cumulative metrics to exporters, acting as a SDK push Controller. The following diagram illustrates the workflow.

image1

cc @alolita @danielbang907 @huyan0 @jmacd

@tigrannajaryan
Copy link
Member

We have metrictransform process which does some aggregation. Moving this after GA in case there is more work that we need to do in this area.

@nabam
Copy link
Contributor

nabam commented Oct 23, 2020

I'm happy to help with implementing that functionality. Is someone already working on it? Should such processor belong to core repository or contrib repository?
CC: @huyan0 @tigrannajaryan

@alolita
Copy link
Member

alolita commented Oct 24, 2020

Hi @nabam We (@alolita @amanbrar1999 @JasonXZLiu) are working on this. You're welcome to code review and provide feedback when we submit a PR.

@nabam
Copy link
Contributor

nabam commented Oct 27, 2020

@alolita that's great! It's going to be very useful for streaming telemetry from AWS Lambdas. Do you have any time estimates for that? I want to understand if we have to build some temporary custom solution while you are working in it

@amanbrar1999
Copy link
Contributor

amanbrar1999 commented Nov 2, 2020

I believe that the use cases for this issue have since been solved by changes to the metrics proto, such as the addition of cumulative aggregation temporality. I could be wrong, does anyone know any specific use cases for this processor?
@tigrannajaryan @bogdandrutu @nabam

For example I believe this is a use case that has since been resolved: open-telemetry/opentelemetry-collector#1541

In the last comment of this linked issue it is mentioned that temporality translation exists as of OTLP v0.5, which I believe is what this processor is intended for

@nabam
Copy link
Contributor

nabam commented Nov 15, 2020

@amanbrar1999 one of the use-cases would be collecting metrics from short-living jobs such as AWS lambdas. Exporting uniquely tagged cumulative metrics for every lambda instance will cause high cardinality of time series they produce.

@amitsaha
Copy link

amitsaha commented Jan 7, 2021

Having something like this will make open-telemetry/opentelemetry-python#93 possible. Without this, Python WSGI applications using gunicorn and uwsgi and the like will suffer from not being able to use Prometheus exporter directly. In fact, the lack of statsd exporter (open-telemetry/opentelemetry-specification#374) will make prometheus unusable with opentelemetry for this category of applications.

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021
@alolita alolita added comp:prometheus Prometheus related issues enhancement New feature or request labels Sep 2, 2021
@alolita alolita self-assigned this Sep 24, 2021
@pragmaticivan
Copy link

pragmaticivan commented Sep 28, 2021

Howdy! @huyan0 @alolita, was there any progress towards implementing that proposal?

I'm currently validating a standalone (multiple workloads) Otel collector for single page apps/client/StatsD(Metric) collection but found that was not very clear.

I have taken a look at:

Currently investigating this approach open-telemetry/opentelemetry-js#2118 although that will get affected by cardinality.

@andyzhangdialpad
Copy link

andyzhangdialpad commented Mar 10, 2022

Thanks for this amazing proposal! @alolita was there any progress on the implementation? We are considering contributing to this. Don't want to reinvent the wheel of course.

@chenluo1234
Copy link

@huyan0 @alolita @amanbrar1999 @JasonXZLiu Any progress on this issue?

I believe that the use cases for this issue have since been solved by changes to the metrics proto, such as the addition of cumulative aggregation temporality. I could be wrong, does anyone know any specific use cases for this processor?

One use case we are having is using statsd receiver and prometheusremotewriteexporter together. Given that statsd receiver produces delta metrics, we will need this processor to convert them into cumulative metrics before handing over to prometheusremotewriteexporter, which only accepts cumulative metrics.

hex1848 pushed a commit to hex1848/opentelemetry-collector-contrib that referenced this issue Jun 2, 2022
…n-telemetry#4968)

Signed-off-by: Anthony J Mirabella <a9@aneurysm9.com>
@danielgblanco
Copy link

danielgblanco commented Oct 5, 2022

Looking through some of the Metrics Data Model use cases in the specification, I understand any mentions of Collector re-aggregates [e.g. 10s OTLP delta temporality exports] into 60 second resolution would require a processor like the one mentioned in this proposal that can handle state, or a non-OTLP exporter that can aggregate delta sums into cumulative temporality like the prometheusexporter does (although support for delta histograms is pending on #9006). As far I can tell this wouldn't be possible to achieve in a single pipeline with OTLP receiver and exporter.

We (Skyscanner) are currently PoC'ing a solution to generate http.*.duration and rpc.*.duration metrics from Envoy spans to collect Istio metrics without Mixer (or Prometheus), which we would like to export in OTLP histograms with delta temporality. We use the spanmetricsprocessor for this, and to avoid exporting a huge number of data points (as the processor is stateless) we currently use cumulative temporality to a prometheusexporter to aggregate histograms and a separate pipeline to scrape that endpoint, use the cumulativetodeltaprocessor to turn them into delta histograms, and then export them via OTLP.

Having something like this would make our pipelines much simpler. Is somebody already working on this? I've also read there may be plans on the transformprocessor to handle aggregations? Any comments/insight would be much appreciated. Thanks!

@mplachter
Copy link

A little bump for an update here.

Something like this would reduce the complexity of the otel config for using this for cronjobs and lambda functions.
Yes this could be done with the metricstransform by aggregating based on a label but it causes some extra implementation on the dev side or more complex otel config.

@danielgblanco
Copy link

The transformprocessor cannot aggregate metrics on consistent intervals in its current form though, as it's stateless. It can only aggregate metrics handled as part of a single batch, not across batches (i.e. exports for different clients).

@mplachter
Copy link

@alolita @amanbrar1999 @JasonXZLiu any update on this?

@ishworg
Copy link

ishworg commented Jan 24, 2023

bump. folks, is there any guidance and/or timeline on this proposal whether accepted or rejected? It has been more than two years now. kindly advise what operators should expect in the future.

@AlexeiZenin
Copy link

+1

@MovieStoreGuy
Copy link
Contributor

I can't speak for the maintainers, however, I myself see the benefit of a dedicated processor to perform aggregations as a means of reducing complexity of other processors to reimplement this. I don't think this is strictly related to prom but I see it would greatly benefit from it.

@MovieStoreGuy MovieStoreGuy added Sponsor Needed New component seeking sponsor and removed enhancement New feature or request comp:prometheus Prometheus related issues labels Jan 26, 2023
@rbizos
Copy link

rbizos commented Jan 30, 2023

This feature would be highly beneficial to my team as well, we would be happy to give a hand in the implementation

@bitomaxsp
Copy link

@kovrus Hi Ruslan. What does it mean that you are sponsoring? Are you doing all the implemention work? or are you responsible for finding someone who will work on it?

@kovrus
Copy link
Member

kovrus commented May 2, 2023

Hi @bitomaxsp, please take a look at the first point of the Adding new components. It provides a good explanation on sponsorship.

@MovieStoreGuy
Copy link
Contributor

Any way that we can help get this started?

It was brought up in the APAC end user group as something that was needed.

@devurandom
Copy link

Would the proposed solution to this issue allow aggregation of metrics data points in user-defined intervals?

We gather metrics using the OpenTelemetry Java Agent, batch them using the OpenTelemetry Collector and export them to Grafana Cloud. Grafana Cloud plans have a maximum included "data points per minute" (DPM, cf. https://grafana.com/docs/grafana-cloud/billing-and-usage/active-series-and-dpm/#data-points-per-minute-dpm) which we would like to not exceed. If the OpenTelemetry Collector would support aggregating each time series in intervals of 60s that would be a great help.

Related: open-telemetry/opentelemetry-java#5393

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Aug 28, 2023
@bitomaxsp
Copy link

These is also a bigger issue when it comes to accumulating intervals from "lambdas". That i think is not covered by OTEL SDK at all iirc. It would be nice to come up with some guides on how to deal with that and include that in that processor as well.

@github-actions github-actions bot removed the Stale label Aug 29, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Oct 30, 2023
@gouthamve gouthamve added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 7, 2023
djaglowski pushed a commit that referenced this issue Jan 8, 2024
…#23790)

**Description:** <Describe what has changed.>
This continues the work done in the now closed
[PR](#20530).
I have addressed issues raised in the original PR by
- Adding logic to handle timestamp misalignments 
- Adding fix + a out-of-bounds bug  

In addition, I have performed end-to-end testing in a local setup, and
confirmed that accumulated histogram time series are correct.

**Link to tracking Issue:** <Issue number if applicable>

#4968

#9006

#19153
**Testing:** <Describe what testing was performed and which tests were
added.>
Added tests for timestamp misalignment and an out-of-bounds bug
discovered in the previous PR.
End-to-end testing to ensure histogram bucket counts exported to
Prometheus are correct

---------

Signed-off-by: Loc Mai <locmai0201@gmail.com>
Signed-off-by: xchen <xchen@axon.com>
Signed-off-by: stephenchen <x.chen1016@gmail.com>
Co-authored-by: Lev Popov <nabam@nabam.net>
Co-authored-by: Lev Popov <leo@nabam.net>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Co-authored-by: Loc Mai <locmai0201@gmail.com>
Co-authored-by: Alex Boten <aboten@lightstep.com>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Jan 10, 2024
…open-telemetry#23790)

**Description:** <Describe what has changed.>
This continues the work done in the now closed
[PR](open-telemetry#20530).
I have addressed issues raised in the original PR by
- Adding logic to handle timestamp misalignments 
- Adding fix + a out-of-bounds bug  

In addition, I have performed end-to-end testing in a local setup, and
confirmed that accumulated histogram time series are correct.

**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#4968

open-telemetry#9006

open-telemetry#19153
**Testing:** <Describe what testing was performed and which tests were
added.>
Added tests for timestamp misalignment and an out-of-bounds bug
discovered in the previous PR.
End-to-end testing to ensure histogram bucket counts exported to
Prometheus are correct

---------

Signed-off-by: Loc Mai <locmai0201@gmail.com>
Signed-off-by: xchen <xchen@axon.com>
Signed-off-by: stephenchen <x.chen1016@gmail.com>
Co-authored-by: Lev Popov <nabam@nabam.net>
Co-authored-by: Lev Popov <leo@nabam.net>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
Co-authored-by: Loc Mai <locmai0201@gmail.com>
Co-authored-by: Alex Boten <aboten@lightstep.com>
@yuri-rs
Copy link
Contributor

yuri-rs commented Mar 16, 2024

I've achieved my goal of Metric Aggregation by chaining several processors in the following order:
batch -> groupbyattrs -> transform/TruncateTime -> metricstransform/aggregate_labels

Although the process might seem extensive, it effectively gets the job done.
This approach could be useful for anyone tackling a similar challenge.

@axaxs
Copy link

axaxs commented Mar 27, 2024

I've achieved my goal of Metric Aggregation by chaining several processors in the following order: batch -> groupbyattrs -> transform/TruncateTime -> metricstransform/aggregate_labels

Although the process might seem extensive, it effectively gets the job done. This approach could be useful for anyone tackling a similar challenge.

Any chance you could expand on your settings? Did you use delta or cumulative counters? Our use case is multiple short lived processes, all writing same metrics and attributes but with different counts. Our expectation is to have otel treat these as cumulative, and be able to export to prometheus, but we've hit a bit of a block as others have stated.

@diranged
Copy link

diranged commented Jun 5, 2024

Agreed - @yuri-rs would you mind posting your config?

@yuri-rs
Copy link
Contributor

yuri-rs commented Jun 5, 2024

Agreed - @yuri-rs would you mind posting your config?

Sure @diranged @axaxs. I did collector aggregation for delta metrics.
Example below aggregate connector cunters (Deltas) into 5s buckets (from 1s individual measurements) and aggregate multiple series into one keeping only service.name label:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 1s
          static_configs:
            - targets: ['0.0.0.0:8888']

exporters:
  logging:
    verbosity: detailed

processors:
  groupbyattrs:
  batch:
    timeout: 5s
  resource/personal_metrics:
    attributes:
      - action: upsert
        value: "local"
        key: service.name
  transform/count_metric:
    metric_statements:
      - context: resource
        statements:
          - keep_keys(attributes, ["service.name"])
  metricstransform/count_metric:
    transforms:
      - include: "otlp.*"
        match_type: regexp
        action: update
        operations:
          - action: aggregate_labels
            aggregation_type: sum
            label_set: []
  transform/TruncateTime:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, TruncateTime(time, Duration("5s")))

connectors:
  count:
    metrics:
      otlp.collector.metric.count:
    datapoints:
      otlp.collector.metric.data_point.count:

service:
  pipelines:
    metrics/internal:
      receivers: [prometheus]
      processors: [resource/personal_metrics]
      exporters: [count]
    metrics/count:
      receivers: [ count]
      processors: [ transform/count_metric, batch, groupbyattrs, transform/TruncateTime, metricstransform/count_metric ]
      exporters: [ logging ]

  telemetry:
    logs:
      level: debug

@robincw-gr
Copy link

Hey @yuri-rs thanks for sharing, that is great. What kind of scale have you tested this at, e.g. active series cardinality and rate of data,, and what were the memory/cpu requirements like?

@azunna1
Copy link

azunna1 commented Oct 24, 2024

@yuri-rs Ideally You should place your batch processors last in your pipelines. Any reason for the current placement?

@yuri-rs
Copy link
Contributor

yuri-rs commented Oct 25, 2024

@robincw-gr I'm running similar config on 100+ pods (2cpu, 2G each) k8s deployment that provide an endpoint for all OTEL telemetry.
I use the count connector to monitor the amount of logs/metrics/traces per each service.name, so I keep very few tags aka service.name, so cardinality for the count metric is not that high and it aggregates well because of that.
Ingestion rate for the deployment are: 200k metrics, 70k logs, 100k traces - all per second.
The scale doesn't matter much since the change is for a given pod/config.

@yuri-rs
Copy link
Contributor

yuri-rs commented Oct 25, 2024

@azunna1
Please note that groupbyattrs and metricstransform -> aggregate_labels could work only for the telemetry inside a given batch, so I have to batch the telemetry before these transformers.
Of course you can add one more batch at the end of the processors chain for more efficiency, but without the batch in its current place the whole "aggregation" config would not work.

@azunna1
Copy link

azunna1 commented Oct 25, 2024

Thanks for the explanation @yuri-rs

@SrinidhiK22
Copy link

@yuri-rs I want to do similar aggregation for cumilative metrics, any soln or idea? Can someone help which set of processors can be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored never stale Issues marked with this label will be never staled and automatically removed
Projects
None yet
Development

No branches or pull requests