Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

Closed
kohrapha opened this issue Oct 27, 2020 · 14 comments · Fixed by #6121
Closed

[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395

kohrapha opened this issue Oct 27, 2020 · 14 comments · Fixed by #6121
Labels
comp:prometheus Prometheus related issues enhancement New feature or request

Comments

@kohrapha
Copy link
Contributor

kohrapha commented Oct 27, 2020

Is your feature request related to a problem? Please describe.
Customers might want to perform ECS Service Discovery for Prometheus metric scraping in their pipeline. Currently, there is no option to perform this functionality.

Describe the solution you'd like
Our proposal is to create a new extension that performs this ECS Service Discovery logic for the Prometheus receiver. This extension would invoke a separate process that would query ECS API for task metadata and output the discovered list of scrape targets to a defined output file that the Prometheus receiver can scrape from (configured via file_sd_configs). This functionality would be similar to the ECS Service Discovery logic in AWS CloudWatch. More info of how this is done in the CloudWatch Agent can be found in this README and the online documentation.

Describe alternatives you've considered

  • Implement as part of Prometheus receiver to support ECS service discovery. There was an earlier proposal for this that was rejected because the ECS API can be throttled and is expensive to call.
  • Implement this service discovery in a wrapper around Otel Collector. This would not make the functionality available to all customers using the collector.
@bogdandrutu
Copy link
Member

@jrcamp can you please comment and propose a solution here?

@jrcamp
Copy link
Contributor

jrcamp commented Nov 5, 2020

Take a look at:

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/extension/observer
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/receiver/receivercreator

I think you should be able to add an ecs observer that works with receiver_creator to dynamically create a prometheus_simple receiver. Would this do what you want?

Also take a look at the proposal here for adding additional resource attributes: #1333. Would there be resource attributes you would want to add by default from the observed ecs endpoint as well?

@kohrapha
Copy link
Contributor Author

kohrapha commented Nov 5, 2020

Thanks @jrcamp for the suggestion! I looked into this and it looks like it might suit our use case. How expensive would it be to dynamically create new receivers based on each newly discovered scrape target? Our original proposal would only require one Prometheus receiver where we would just update its scrape targets in the config file. Is this overhead significant?

@jrcamp
Copy link
Contributor

jrcamp commented Nov 6, 2020

@kohrapha In general these dynamic receivers should not have much overhead. I can't say for sure about the Prometheus receiver because it's quite complex. In the future we will probably simplify the prometheus_simple receiver to only support the Prometheus protocol instead of wrapping the prometheus receiver which has a lot of functionality and overhead we don't need. (Prometheus is basically an agent into itself, but for our purposes we mainly care about the protocol.)

This is a primary use case of ours so if there are any performance issues we will have a vested interest in making sure they are addressed.

Do you know how many Prometheus targets you will be scraping?

@kohrapha
Copy link
Contributor Author

kohrapha commented Nov 6, 2020

@jrcamp the number of Prometheus targets is highly dependent on the customer's configuration. We can expect a single agent to retrieve metrics from 2000 targets. Having one receiver per target might be a bit overkill and resource-intensive.

@kohrapha
Copy link
Contributor Author

kohrapha commented Nov 9, 2020

@jrcamp thinking more about this approach, would it be possible instead to just spawn one Prometheus receiver at init time and then only perform updates to that receiver via the update endpoint API? This method requires that a Prometheus receiver can have multiple targets, but it seems like currently we can only configure one target per endpoint/receiver?

@jrcamp
Copy link
Contributor

jrcamp commented Nov 9, 2020

receiver_creator is built around having a single instance per-target for purposes of starting and stoping scraping as well as for adding additional resource attributes linked above. Changing that assumption will require non-trivial changes.

Before we go down that path we should validate whether 2000 targets are a problem. It's not clear to me that whether you have 2000 receivers scraping 1 host each or 1 receiver scraping 2000 hosts that the later will be way more efficient. I think it needs to be tested to say for sure. If the current prometheus receiver has too much overhead we can add one that only does the prometheus protocol.

Can you gather some hard data to remove some of the unknowns?

@kohrapha
Copy link
Contributor Author

Hi @jrcamp, I spent some time this week to come up with an experiment to test the above hypothesis. In this experiment, I have 2 set ups:

  1. Setup 1 has one receiver per endpoint, in which I used the k8sobserver + receivercreator
extensions:
  k8s_observer:
    auth_type: serviceAccount

receivers:
  receiver_creator:
    # Name of the extensions to watch for endpoints to start and stop.
    watch_observers: [k8s_observer]
    receivers:
      prometheus_simple:
        rule: type.pod && labels["app"] == "prom-app"
        config:
          metrics_path: /metrics
          endpoint: '`endpoint`:8081'

exporters:
  logging:

service:
  extensions: [k8s_observer]
  pipelines:
    metrics:
      receivers: [receiver_creator]
      exporters: [logging]
  1. Setup 2 has one receiver for all endpoints, where I used the in-built k8s service discovery in the Prometheus receiver
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: performance-test
        metrics_path: '/metrics'
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names:
              - eks-aoc

exporters:
  logging:

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [logging]

I tested this on EKS and ran the set up on 1/10/30/69 pods (69 pods was the max number of allowed pods in my cluster). I also used the CloudWatch Agent to monitor and track CPU/memory usage of the collector. We got the following results:

CPU Utilization (%)

Num Targets Setup 1 Setup 2 Difference
1 0.02 0.02 -0
10 0.112 0.033 -0.079
30 0.33 0.059 -0.271
69 0.705 0.105 -0.6

Memory Utilization (%)

Num Targets Setup 1 Setup 2 Difference
1 0.196 0.182 -0.014
10 0.311 0.23 -0.081
30 0.468 0.332 -0.136
69 0.773 0.468 -0.305

It looks like the overhead in using multiple receivers scales as the number of scrape targets grows. This might get really hairy with larger numbers of scrape targets.

If we do still want to use the observer API, then maybe an option is to create a new receiver that listens in on the observer and writes the scraped targets to a config file which the Prometheus receiver can pick up on?

@jrcamp
Copy link
Contributor

jrcamp commented Nov 23, 2020

@kohrapha Thanks for running the numbers. I think if we want to optimize it the right thing to do is write a "native" OpenTelemetry collector receiver for the Prometheus protocol. It would then behave like all of our other scraping receivers. I think it should be even more performant than Setup 2 because it will be able to do less translation.

While I understand that having the single Prometheus receiver work with the observer would be more performant today, it's not in the direction things are going.

Would you be able to implement the native Prometheus receiver for the collector and rerun your benchmarks? I don't think it'll take very long if we can ignore summaries and histograms for testing purposes. I have the skeleton of it written but it still needs the translations of expfmt to pdata implemented. Happy to assist. Here's the skeleton of what I'm talking about: https://github.com/jrcamp/opentelemetry-collector-contrib/blob/prom-proto/receiver/simpleprometheusreceiver/factory.go

@kohrapha
Copy link
Contributor Author

kohrapha commented Dec 11, 2020

@jrcamp I took a look at your skeleton and wrote a minimal POC for the native Prometheus receiver that exports only gauges and counters. You can check it out here.

I ran the same experiment as before on this setup and got the following results, where "Setup 3" contains the native Prometheus receiver POC:

CPU Utilization (%)

Num Targets Setup 3 Setup 2 Difference
1 0.016 0.02 +0.004
10 0.071 0.033 -0.038
30 0.212 0.059 -0.153
69 0.466 0.105 -0.361

Memory Utilization (%)

Num Targets Setup 3 Setup 2 Difference
1 0.179 0.182 +0.003
10 0.231 0.23 -0.001
30 0.259 0.332 +0.073
69 0.328 0.468 +0.14

We expect that the new native receiver (Setup 3) should be more performant than Setup 2 since it performs less translation and also does not offer the same features and capabilities as the Prometheus receiver. However, we see that it suffers a worse CPU Usage, scaling worse than Setup 2 as the number of targets increases. Although memory usage is better, this will likely change once we start implementing more features on top of this POC.

It does seem that spawning one receiver per endpoint has worse performance than one receiver for all endpoints. Thus, our team prefers to implement a service discovery extension that dynamically updates the SD file, as initially proposed. Since this does not seem like the direction that this repository is headed, we are thinking of implementing this on our end.

@jrcamp
Copy link
Contributor

jrcamp commented Dec 11, 2020

@kohrapha thanks for running the numbers. Can you join next week's SIG to discuss?

@kohrapha
Copy link
Contributor Author

@jrcamp sure, see you there!

@kohrapha
Copy link
Contributor Author

Made a draft PR #1793 for the native simple Prometheus receiver POC I implemented based off of @jrcamp's template above. In case anyone wants to develop and iterate on this idea further.

@kohrapha
Copy link
Contributor Author

kohrapha commented Dec 17, 2020

Also, following yesterday's discussion during the SIG meeting, our team will contribute a new ECS observer that will eventually perform service discovery using the observer/receiver creator framework that @jrcamp outlined above. However, for our short-term needs, we will first be implementing the proposal I outlined earlier (i.e. writing to an SD config file for Prometheus to read from).

The steps forward look something like this:

  1. AWS will contribute an ECS observer based on the proposal I outlined above. It will query the ECS API and output the targets to a file. AWS will be using this method for our distro of the collector.
  2. Implement spawning receivers based on scrape targets, but include a config option to still output to a file.
  3. Figure out why there was a performance loss for the observer/receiver creator method as seen above.
  4. Optimize the observer/receiver creator method to get a better performance.
  5. Once we've resolved the performance issue, we will then swap over to use the recommended method of spawning a receiver for each scrape target.

cc: @mxiamxia

@andrewhsu andrewhsu added enhancement New feature or request and removed feature request labels Jan 6, 2021
dyladan referenced this issue in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
Bumps [go.opencensus.io](https://github.com/census-instrumentation/opencensus-go) from 0.22.3 to 0.22.4.
- [Release notes](https://github.com/census-instrumentation/opencensus-go/releases)
- [Commits](census-instrumentation/opencensus-go@v0.22.3...v0.22.4)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
pingleig referenced this issue in pingleig/opentelemetry-collector-contrib Mar 18, 2021
This is the first (divided) implementation PR for #1395.
In order to reduce the size of the pr, the feature is incomplete.

- mock api client is used, no actual aws api calls
- Only docker label based filter/matcher is included
- Extension impl under `ecsobserver` is excluded

Those excluded logic are tested in other branches, so it works with real
API end to end (eventually).

The package is located under internal/aws/ecssd because

- it can be used by other aws specific packages
- we may extract it out to a standalone repo in the future
  as it can run as a binary and support other prometheus
  compatiable collectors as a sidecar
@alolita alolita added the comp:prometheus Prometheus related issues label Sep 2, 2021
tigrannajaryan pushed a commit that referenced this issue Nov 11, 2021
Adding a feature - These changes introduce a new receiver creator-compatible watch observer for ECS tasks.  It differs from the existing ECS observer in that it supports only sidecar deployments and doesn't have a hard dependency on the Prometheus receiver, instead being able to sync with a receiver creator instance by listing [container endpoints](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/957eb18037f000f5740c4044fdf5ffeee4311ebc/extension/observer/endpoints.go#L166).  This approach was landed on after reviewing the existing and ECS observer config and implementation and from following the guidance of the initial observer's author: #5316 (comment).  It does not resolve [the outstanding item](#1395 (comment)) for #1395, as I think observer endpoint creation should be added to the existing observer as well but should be done in unrelated changes.

**Link to tracking Issue:**
#5316

**Testing:**
~No tests as this time since the changes just include the initial types and readme.~ Basic factory test to help ensure receiver creator type compatibility.*

**Documentation:**
Adding a readme with high level plan and config documentation generated by project configschema tool.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:prometheus Prometheus related issues enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants