[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

kohrapha · 2020-10-27T19:23:10Z

Is your feature request related to a problem? Please describe.
Customers might want to perform ECS Service Discovery for Prometheus metric scraping in their pipeline. Currently, there is no option to perform this functionality.

Describe the solution you'd like
Our proposal is to create a new extension that performs this ECS Service Discovery logic for the Prometheus receiver. This extension would invoke a separate process that would query ECS API for task metadata and output the discovered list of scrape targets to a defined output file that the Prometheus receiver can scrape from (configured via file_sd_configs). This functionality would be similar to the ECS Service Discovery logic in AWS CloudWatch. More info of how this is done in the CloudWatch Agent can be found in this README and the online documentation.

Describe alternatives you've considered

Implement as part of Prometheus receiver to support ECS service discovery. There was an earlier proposal for this that was rejected because the ECS API can be throttled and is expensive to call.
Implement this service discovery in a wrapper around Otel Collector. This would not make the functionality available to all customers using the collector.

The text was updated successfully, but these errors were encountered:

bogdandrutu · 2020-11-05T00:31:52Z

@jrcamp can you please comment and propose a solution here?

jrcamp · 2020-11-05T00:40:46Z

Take a look at:

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/extension/observer
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/receiver/receivercreator

I think you should be able to add an ecs observer that works with receiver_creator to dynamically create a prometheus_simple receiver. Would this do what you want?

Also take a look at the proposal here for adding additional resource attributes: #1333. Would there be resource attributes you would want to add by default from the observed ecs endpoint as well?

kohrapha · 2020-11-05T22:17:40Z

Thanks @jrcamp for the suggestion! I looked into this and it looks like it might suit our use case. How expensive would it be to dynamically create new receivers based on each newly discovered scrape target? Our original proposal would only require one Prometheus receiver where we would just update its scrape targets in the config file. Is this overhead significant?

jrcamp · 2020-11-06T17:30:47Z

@kohrapha In general these dynamic receivers should not have much overhead. I can't say for sure about the Prometheus receiver because it's quite complex. In the future we will probably simplify the prometheus_simple receiver to only support the Prometheus protocol instead of wrapping the prometheus receiver which has a lot of functionality and overhead we don't need. (Prometheus is basically an agent into itself, but for our purposes we mainly care about the protocol.)

This is a primary use case of ours so if there are any performance issues we will have a vested interest in making sure they are addressed.

Do you know how many Prometheus targets you will be scraping?

kohrapha · 2020-11-06T21:25:56Z

@jrcamp the number of Prometheus targets is highly dependent on the customer's configuration. We can expect a single agent to retrieve metrics from 2000 targets. Having one receiver per target might be a bit overkill and resource-intensive.

kohrapha · 2020-11-09T20:12:54Z

@jrcamp thinking more about this approach, would it be possible instead to just spawn one Prometheus receiver at init time and then only perform updates to that receiver via the update endpoint API? This method requires that a Prometheus receiver can have multiple targets, but it seems like currently we can only configure one target per endpoint/receiver?

jrcamp · 2020-11-09T21:41:38Z

receiver_creator is built around having a single instance per-target for purposes of starting and stoping scraping as well as for adding additional resource attributes linked above. Changing that assumption will require non-trivial changes.

Before we go down that path we should validate whether 2000 targets are a problem. It's not clear to me that whether you have 2000 receivers scraping 1 host each or 1 receiver scraping 2000 hosts that the later will be way more efficient. I think it needs to be tested to say for sure. If the current prometheus receiver has too much overhead we can add one that only does the prometheus protocol.

Can you gather some hard data to remove some of the unknowns?

kohrapha · 2020-11-18T00:13:20Z

Hi @jrcamp, I spent some time this week to come up with an experiment to test the above hypothesis. In this experiment, I have 2 set ups:

Setup 1 has one receiver per endpoint, in which I used the k8sobserver + receivercreator

extensions:
  k8s_observer:
    auth_type: serviceAccount

receivers:
  receiver_creator:
    # Name of the extensions to watch for endpoints to start and stop.
    watch_observers: [k8s_observer]
    receivers:
      prometheus_simple:
        rule: type.pod && labels["app"] == "prom-app"
        config:
          metrics_path: /metrics
          endpoint: '`endpoint`:8081'

exporters:
  logging:

service:
  extensions: [k8s_observer]
  pipelines:
    metrics:
      receivers: [receiver_creator]
      exporters: [logging]

Setup 2 has one receiver for all endpoints, where I used the in-built k8s service discovery in the Prometheus receiver

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: performance-test
        metrics_path: '/metrics'
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names:
              - eks-aoc

exporters:
  logging:

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [logging]

I tested this on EKS and ran the set up on 1/10/30/69 pods (69 pods was the max number of allowed pods in my cluster). I also used the CloudWatch Agent to monitor and track CPU/memory usage of the collector. We got the following results:

CPU Utilization (%)

Num Targets	Setup 1	Setup 2	Difference
1	0.02	0.02	-0
10	0.112	0.033	-0.079
30	0.33	0.059	-0.271
69	0.705	0.105	-0.6

Memory Utilization (%)

Num Targets	Setup 1	Setup 2	Difference
1	0.196	0.182	-0.014
10	0.311	0.23	-0.081
30	0.468	0.332	-0.136
69	0.773	0.468	-0.305

It looks like the overhead in using multiple receivers scales as the number of scrape targets grows. This might get really hairy with larger numbers of scrape targets.

If we do still want to use the observer API, then maybe an option is to create a new receiver that listens in on the observer and writes the scraped targets to a config file which the Prometheus receiver can pick up on?

jrcamp · 2020-11-23T23:16:42Z

@kohrapha Thanks for running the numbers. I think if we want to optimize it the right thing to do is write a "native" OpenTelemetry collector receiver for the Prometheus protocol. It would then behave like all of our other scraping receivers. I think it should be even more performant than Setup 2 because it will be able to do less translation.

While I understand that having the single Prometheus receiver work with the observer would be more performant today, it's not in the direction things are going.

Would you be able to implement the native Prometheus receiver for the collector and rerun your benchmarks? I don't think it'll take very long if we can ignore summaries and histograms for testing purposes. I have the skeleton of it written but it still needs the translations of expfmt to pdata implemented. Happy to assist. Here's the skeleton of what I'm talking about: https://github.com/jrcamp/opentelemetry-collector-contrib/blob/prom-proto/receiver/simpleprometheusreceiver/factory.go

kohrapha · 2020-12-11T00:44:50Z

@jrcamp I took a look at your skeleton and wrote a minimal POC for the native Prometheus receiver that exports only gauges and counters. You can check it out here.

I ran the same experiment as before on this setup and got the following results, where "Setup 3" contains the native Prometheus receiver POC:

CPU Utilization (%)

Num Targets	Setup 3	Setup 2	Difference
1	0.016	0.02	+0.004
10	0.071	0.033	-0.038
30	0.212	0.059	-0.153
69	0.466	0.105	-0.361

Memory Utilization (%)

Num Targets	Setup 3	Setup 2	Difference
1	0.179	0.182	+0.003
10	0.231	0.23	-0.001
30	0.259	0.332	+0.073
69	0.328	0.468	+0.14

We expect that the new native receiver (Setup 3) should be more performant than Setup 2 since it performs less translation and also does not offer the same features and capabilities as the Prometheus receiver. However, we see that it suffers a worse CPU Usage, scaling worse than Setup 2 as the number of targets increases. Although memory usage is better, this will likely change once we start implementing more features on top of this POC.

It does seem that spawning one receiver per endpoint has worse performance than one receiver for all endpoints. Thus, our team prefers to implement a service discovery extension that dynamically updates the SD file, as initially proposed. Since this does not seem like the direction that this repository is headed, we are thinking of implementing this on our end.

jrcamp · 2020-12-11T17:52:49Z

@kohrapha thanks for running the numbers. Can you join next week's SIG to discuss?

kohrapha · 2020-12-11T18:05:23Z

@jrcamp sure, see you there!

kohrapha · 2020-12-17T21:32:53Z

Made a draft PR #1793 for the native simple Prometheus receiver POC I implemented based off of @jrcamp's template above. In case anyone wants to develop and iterate on this idea further.

kohrapha · 2020-12-17T21:34:52Z

Also, following yesterday's discussion during the SIG meeting, our team will contribute a new ECS observer that will eventually perform service discovery using the observer/receiver creator framework that @jrcamp outlined above. However, for our short-term needs, we will first be implementing the proposal I outlined earlier (i.e. writing to an SD config file for Prometheus to read from).

The steps forward look something like this:

AWS will contribute an ECS observer based on the proposal I outlined above. It will query the ECS API and output the targets to a file. AWS will be using this method for our distro of the collector.
Implement spawning receivers based on scrape targets, but include a config option to still output to a file.
Figure out why there was a performance loss for the observer/receiver creator method as seen above.
Optimize the observer/receiver creator method to get a better performance.
Once we've resolved the performance issue, we will then swap over to use the recommended method of spawning a receiver for each scrape target.

cc: @mxiamxia

Bumps [go.opencensus.io](https://github.com/census-instrumentation/opencensus-go) from 0.22.3 to 0.22.4. - [Release notes](https://github.com/census-instrumentation/opencensus-go/releases) - [Commits](census-instrumentation/opencensus-go@v0.22.3...v0.22.4) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

This is the first (divided) implementation PR for #1395. In order to reduce the size of the pr, the feature is incomplete. - mock api client is used, no actual aws api calls - Only docker label based filter/matcher is included - Extension impl under `ecsobserver` is excluded Those excluded logic are tested in other branches, so it works with real API end to end (eventually). The package is located under internal/aws/ecssd because - it can be used by other aws specific packages - we may extract it out to a standalone repo in the future as it can run as a binary and support other prometheus compatiable collectors as a sidecar

Adding a feature - These changes introduce a new receiver creator-compatible watch observer for ECS tasks. It differs from the existing ECS observer in that it supports only sidecar deployments and doesn't have a hard dependency on the Prometheus receiver, instead being able to sync with a receiver creator instance by listing [container endpoints](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/957eb18037f000f5740c4044fdf5ffeee4311ebc/extension/observer/endpoints.go#L166). This approach was landed on after reviewing the existing and ECS observer config and implementation and from following the guidance of the initial observer's author: #5316 (comment). It does not resolve [the outstanding item](#1395 (comment)) for #1395, as I think observer endpoint creation should be added to the existing observer as well but should be done in unrelated changes. **Link to tracking Issue:** #5316 **Testing:** ~No tests as this time since the changes just include the initial types and readme.~ Basic factory test to help ensure receiver creator type compatibility.* **Documentation:** Adding a readme with high level plan and config documentation generated by project configschema tool.

kohrapha added the feature request label Oct 27, 2020

theRoughCode mentioned this issue Dec 29, 2020

[Extension] Implement ECS Observer #1920

Closed

andrewhsu added enhancement New feature or request and removed feature request labels Jan 6, 2021

pingleig mentioned this issue Feb 18, 2021

[extension][ecsobserver] Init ECS Observer README #2377

Merged

2 tasks

mxiamxia mentioned this issue Mar 11, 2021

REQUEST: New membership for mxiamxia open-telemetry/community#677

Closed

6 tasks

pingleig mentioned this issue Mar 18, 2021

[extension][ecsobserver] Init ECS service discovery using docker label #2734

Closed

pingleig mentioned this issue Mar 21, 2021

[feature][prometheus] ECS service discovery extension for prometheus aws-observability/aws-otel-collector#412

Closed

4 tasks

pingleig mentioned this issue Mar 29, 2021

[extension][ecsobserver] Init config and target definition #2939

Merged

This was referenced Apr 15, 2021

[extension/ecsobserver] Add get ip and port from task #3133

Merged

aws: Move duplicated ECS metadata clients into internal/aws #3188

Closed

This was referenced Apr 28, 2021

[extension/ecsobserver] Add docker label based matcher #3276

Merged

[extension/ecsobserver] Init service client and mock #3284

Merged

pingleig mentioned this issue May 6, 2021

[extension/ecsobserver] Add exporter to convert task to target #3333

Merged

pingleig mentioned this issue May 14, 2021

[extension/ecsobserver] Add service and task definition based matcher #3386

Merged

pingleig mentioned this issue May 24, 2021

[extension/ecsobserver] Add task definition, ec2 and service fetcher #3503

Merged

pingleig mentioned this issue Jun 14, 2021

[extension/ecsobserver] Write discovered targets as Prometheus file sd #3785

Merged

jrcamp mentioned this issue Jun 28, 2021

Investigate multiple scraper performance #3977

Closed

alolita added the comp:prometheus Prometheus related issues label Sep 2, 2021

rmfitzpatrick mentioned this issue Sep 21, 2021

ECS sidecar task container observer #5316

Closed

rmfitzpatrick mentioned this issue Nov 3, 2021

ecs_task_observer: initial structure #6121

Merged

tigrannajaryan closed this as completed in #6121 Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

kohrapha commented Oct 27, 2020 •

edited

Loading

bogdandrutu commented Nov 5, 2020

jrcamp commented Nov 5, 2020

kohrapha commented Nov 5, 2020

jrcamp commented Nov 6, 2020 •

edited

Loading

kohrapha commented Nov 6, 2020

kohrapha commented Nov 9, 2020

jrcamp commented Nov 9, 2020

kohrapha commented Nov 18, 2020

jrcamp commented Nov 23, 2020

kohrapha commented Dec 11, 2020 •

edited

Loading

jrcamp commented Dec 11, 2020

kohrapha commented Dec 11, 2020

kohrapha commented Dec 17, 2020

kohrapha commented Dec 17, 2020 •

edited

Loading

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

Comments

kohrapha commented Oct 27, 2020 • edited Loading

bogdandrutu commented Nov 5, 2020

jrcamp commented Nov 5, 2020

kohrapha commented Nov 5, 2020

jrcamp commented Nov 6, 2020 • edited Loading

kohrapha commented Nov 6, 2020

kohrapha commented Nov 9, 2020

jrcamp commented Nov 9, 2020

kohrapha commented Nov 18, 2020

jrcamp commented Nov 23, 2020

kohrapha commented Dec 11, 2020 • edited Loading

jrcamp commented Dec 11, 2020

kohrapha commented Dec 11, 2020

kohrapha commented Dec 17, 2020

kohrapha commented Dec 17, 2020 • edited Loading

kohrapha commented Oct 27, 2020 •

edited

Loading

jrcamp commented Nov 6, 2020 •

edited

Loading

kohrapha commented Dec 11, 2020 •

edited

Loading

kohrapha commented Dec 17, 2020 •

edited

Loading