-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] [Extension] ECS Service Discovery for Prometheus #1395
Comments
@jrcamp can you please comment and propose a solution here? |
Take a look at: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/extension/observer I think you should be able to add an ecs observer that works with receiver_creator to dynamically create a prometheus_simple receiver. Would this do what you want? Also take a look at the proposal here for adding additional resource attributes: #1333. Would there be resource attributes you would want to add by default from the observed ecs endpoint as well? |
Thanks @jrcamp for the suggestion! I looked into this and it looks like it might suit our use case. How expensive would it be to dynamically create new receivers based on each newly discovered scrape target? Our original proposal would only require one Prometheus receiver where we would just update its scrape targets in the config file. Is this overhead significant? |
@kohrapha In general these dynamic receivers should not have much overhead. I can't say for sure about the Prometheus receiver because it's quite complex. In the future we will probably simplify the prometheus_simple receiver to only support the Prometheus protocol instead of wrapping the prometheus receiver which has a lot of functionality and overhead we don't need. (Prometheus is basically an agent into itself, but for our purposes we mainly care about the protocol.) This is a primary use case of ours so if there are any performance issues we will have a vested interest in making sure they are addressed. Do you know how many Prometheus targets you will be scraping? |
@jrcamp the number of Prometheus targets is highly dependent on the customer's configuration. We can expect a single agent to retrieve metrics from 2000 targets. Having one receiver per target might be a bit overkill and resource-intensive. |
@jrcamp thinking more about this approach, would it be possible instead to just spawn one Prometheus receiver at init time and then only perform updates to that receiver via the update endpoint API? This method requires that a Prometheus receiver can have multiple targets, but it seems like currently we can only configure one target per endpoint/receiver? |
receiver_creator is built around having a single instance per-target for purposes of starting and stoping scraping as well as for adding additional resource attributes linked above. Changing that assumption will require non-trivial changes. Before we go down that path we should validate whether 2000 targets are a problem. It's not clear to me that whether you have 2000 receivers scraping 1 host each or 1 receiver scraping 2000 hosts that the later will be way more efficient. I think it needs to be tested to say for sure. If the current prometheus receiver has too much overhead we can add one that only does the prometheus protocol. Can you gather some hard data to remove some of the unknowns? |
Hi @jrcamp, I spent some time this week to come up with an experiment to test the above hypothesis. In this experiment, I have 2 set ups:
extensions:
k8s_observer:
auth_type: serviceAccount
receivers:
receiver_creator:
# Name of the extensions to watch for endpoints to start and stop.
watch_observers: [k8s_observer]
receivers:
prometheus_simple:
rule: type.pod && labels["app"] == "prom-app"
config:
metrics_path: /metrics
endpoint: '`endpoint`:8081'
exporters:
logging:
service:
extensions: [k8s_observer]
pipelines:
metrics:
receivers: [receiver_creator]
exporters: [logging]
receivers:
prometheus:
config:
scrape_configs:
- job_name: performance-test
metrics_path: '/metrics'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- eks-aoc
exporters:
logging:
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [logging] I tested this on EKS and ran the set up on 1/10/30/69 pods (69 pods was the max number of allowed pods in my cluster). I also used the CloudWatch Agent to monitor and track CPU/memory usage of the collector. We got the following results: CPU Utilization (%)
Memory Utilization (%)
It looks like the overhead in using multiple receivers scales as the number of scrape targets grows. This might get really hairy with larger numbers of scrape targets. If we do still want to use the observer API, then maybe an option is to create a new receiver that listens in on the observer and writes the scraped targets to a config file which the Prometheus receiver can pick up on? |
@kohrapha Thanks for running the numbers. I think if we want to optimize it the right thing to do is write a "native" OpenTelemetry collector receiver for the Prometheus protocol. It would then behave like all of our other scraping receivers. I think it should be even more performant than Setup 2 because it will be able to do less translation. While I understand that having the single Prometheus receiver work with the observer would be more performant today, it's not in the direction things are going. Would you be able to implement the native Prometheus receiver for the collector and rerun your benchmarks? I don't think it'll take very long if we can ignore summaries and histograms for testing purposes. I have the skeleton of it written but it still needs the translations of expfmt to pdata implemented. Happy to assist. Here's the skeleton of what I'm talking about: https://github.com/jrcamp/opentelemetry-collector-contrib/blob/prom-proto/receiver/simpleprometheusreceiver/factory.go |
@jrcamp I took a look at your skeleton and wrote a minimal POC for the native Prometheus receiver that exports only gauges and counters. You can check it out here. I ran the same experiment as before on this setup and got the following results, where "Setup 3" contains the native Prometheus receiver POC: CPU Utilization (%)
Memory Utilization (%)
We expect that the new native receiver (Setup 3) should be more performant than Setup 2 since it performs less translation and also does not offer the same features and capabilities as the Prometheus receiver. However, we see that it suffers a worse CPU Usage, scaling worse than Setup 2 as the number of targets increases. Although memory usage is better, this will likely change once we start implementing more features on top of this POC. It does seem that spawning one receiver per endpoint has worse performance than one receiver for all endpoints. Thus, our team prefers to implement a service discovery extension that dynamically updates the SD file, as initially proposed. Since this does not seem like the direction that this repository is headed, we are thinking of implementing this on our end. |
@kohrapha thanks for running the numbers. Can you join next week's SIG to discuss? |
@jrcamp sure, see you there! |
Also, following yesterday's discussion during the SIG meeting, our team will contribute a new ECS observer that will eventually perform service discovery using the observer/receiver creator framework that @jrcamp outlined above. However, for our short-term needs, we will first be implementing the proposal I outlined earlier (i.e. writing to an SD config file for Prometheus to read from). The steps forward look something like this:
cc: @mxiamxia |
Bumps [go.opencensus.io](https://github.com/census-instrumentation/opencensus-go) from 0.22.3 to 0.22.4. - [Release notes](https://github.com/census-instrumentation/opencensus-go/releases) - [Commits](census-instrumentation/opencensus-go@v0.22.3...v0.22.4) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This is the first (divided) implementation PR for #1395. In order to reduce the size of the pr, the feature is incomplete. - mock api client is used, no actual aws api calls - Only docker label based filter/matcher is included - Extension impl under `ecsobserver` is excluded Those excluded logic are tested in other branches, so it works with real API end to end (eventually). The package is located under internal/aws/ecssd because - it can be used by other aws specific packages - we may extract it out to a standalone repo in the future as it can run as a binary and support other prometheus compatiable collectors as a sidecar
Adding a feature - These changes introduce a new receiver creator-compatible watch observer for ECS tasks. It differs from the existing ECS observer in that it supports only sidecar deployments and doesn't have a hard dependency on the Prometheus receiver, instead being able to sync with a receiver creator instance by listing [container endpoints](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/957eb18037f000f5740c4044fdf5ffeee4311ebc/extension/observer/endpoints.go#L166). This approach was landed on after reviewing the existing and ECS observer config and implementation and from following the guidance of the initial observer's author: #5316 (comment). It does not resolve [the outstanding item](#1395 (comment)) for #1395, as I think observer endpoint creation should be added to the existing observer as well but should be done in unrelated changes. **Link to tracking Issue:** #5316 **Testing:** ~No tests as this time since the changes just include the initial types and readme.~ Basic factory test to help ensure receiver creator type compatibility.* **Documentation:** Adding a readme with high level plan and config documentation generated by project configschema tool.
Is your feature request related to a problem? Please describe.
Customers might want to perform ECS Service Discovery for Prometheus metric scraping in their pipeline. Currently, there is no option to perform this functionality.
Describe the solution you'd like
Our proposal is to create a new extension that performs this ECS Service Discovery logic for the Prometheus receiver. This extension would invoke a separate process that would query ECS API for task metadata and output the discovered list of scrape targets to a defined output file that the Prometheus receiver can scrape from (configured via
file_sd_configs
). This functionality would be similar to the ECS Service Discovery logic in AWS CloudWatch. More info of how this is done in the CloudWatch Agent can be found in this README and the online documentation.Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: