Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Extension] Implement ECS Observer #1920

Closed
wants to merge 19 commits into from

Conversation

theRoughCode
Copy link

@theRoughCode theRoughCode commented Dec 29, 2020

Description

This PR implements the ECS Observer that queries the ECS/EC2 API to discover scrape targets within a configured ECS cluster.

There are 2 modes to discover the targets based on the config:

  • Mode 1: Docker label-Based: Add docker labels to the containers to indicate the port and metric path. The ECS Observer can then be configured to discover targets with the matching docker label and container port.
  • Mode 2: ECS Task Definition ARN-based: The ECS Observer can be configured to discover targets when the target's ECS task definition ARN matches the configured regex and it matches the configured container ports.

Two modes can be enabled together and the ECS Observer will de-duplicate the discovered targets based on: {private_ip}:{port}/{metrics_path}

Implementation Overview

  1. The ECS Observer's discovery is triggered by EndpointsListers:ListEndpoints.
  2. The TaskRetrievalProcessor will query the ECS:ListTasks API to retrieve the list of current running ECS task ARNs for the specified ECS cluster, and each task will be retrieved using ECS:DescribeTasks.
  3. The TaskDefinitionProcessor will get the ECS Task Definition from LRU cache or, if there is none in cache, call ECS:DescribeTaskDefinition and cache. LRU cache size (2000) based on ECS service quota
  4. The TaskFilterProcessor will filter out tasks based on whether they match the configured docker label-based SD config or task definition ARN-based SD config.
  5. The MetadataProcessor will get the containerInstance/ec2 instance info from LRU cache if the tasks is running on EC2 launch type. LRU cache size (2000) based on ECS service quota. If not cached, call ECS: DescribeContainerInstances / EC2:DescribeInstances with batch size = 100.
  6. If sd_result_file is configured, the list of discovered targets will be written to the configured file path. Otherwise, EndpointsListers:ListEndpoints will return the list of discovered endpoints and all subscribers will be notified of the updated endpoints.

Note: The main reason for enabling writing to a static file is because the current implementation of the observer/receiver creator framework is less efficient than using a single receiver to scrape from multiple targets. The related discussion can be found in #1395.

Link to tracking Issue: #1395

Testing:
Unit tests were added and ran successfully. A test setup in AWS ECS was also created to verify the implementation.
Note: Functions involving the ECS/EC2 API don't have unit tests. A next step would be to add mocked APIs to test these functions.

Documentation:
A new README doc was added for the ecs_observer, and the observer/receiver_creator READMEs were modified.

@theRoughCode theRoughCode requested a review from jrcamp as a code owner December 29, 2020 23:34
@theRoughCode theRoughCode requested a review from a team December 29, 2020 23:34
@theRoughCode theRoughCode marked this pull request as draft December 29, 2020 23:34
@theRoughCode
Copy link
Author

cc: @mxiamxia for feedback

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@tigrannajaryan
Copy link
Member

@jrcamp please review this.

@theRoughCode theRoughCode marked this pull request as ready for review January 7, 2021 21:31
@codecov
Copy link

codecov bot commented Jan 8, 2021

Codecov Report

Merging #1920 (d55f5cf) into main (fe1b43a) will decrease coverage by 0.66%.
The diff coverage is 57.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1920      +/-   ##
==========================================
- Coverage   89.90%   89.24%   -0.67%     
==========================================
  Files         380      389       +9     
  Lines       18335    18723     +388     
==========================================
+ Hits        16484    16709     +225     
- Misses       1385     1535     +150     
- Partials      466      479      +13     
Flag Coverage Δ
integration 69.82% <ø> (ø)
unit 87.97% <57.84%> (-0.64%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
receiver/receivercreator/observerhandler.go 60.37% <0.00%> (ø)
receiver/receivercreator/rules.go 64.28% <ø> (ø)
...xtension/observer/ecsobserver/metadataprocessor.go 9.21% <9.21%> (ø)
...on/observer/ecsobserver/taskdefinitionprocessor.go 17.85% <17.85%> (ø)
...ion/observer/ecsobserver/taskretrievalprocessor.go 29.62% <29.62%> (ø)
extension/observer/ecsobserver/extension.go 40.00% <40.00%> (ø)
extension/observer/ecsobserver/servicediscovery.go 73.68% <73.68%> (ø)
extension/observer/ecsobserver/ecstask.go 88.88% <88.88%> (ø)
extension/observer/ecsobserver/config.go 100.00% <100.00%> (ø)
extension/observer/ecsobserver/factory.go 100.00% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe1b43a...d55f5cf. Read the comment docs.

Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theRoughCode sorry for delayed review.

Can you follow https://github.com/open-telemetry/opentelemetry-collector/blob/master/CONTRIBUTING.md#how-to-structure-prs-to-get-expedient-reviews to break this up into more manageable pieces? Overall it's on a good track though.

I'd suggest doing a PR with just README.md changes first so we can iterate there first if needed to avoid unnecessary code churn.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jan 21, 2021
@bogdandrutu
Copy link
Member

@theRoughCode friendly ping :)

@github-actions github-actions bot removed the Stale label Jan 26, 2021
Base automatically changed from master to main January 28, 2021 00:57
@pingleig
Copy link
Contributor

pingleig commented Feb 1, 2021

Hi @bogdandrutu @jrcamp sorry for not responding the comments. I am from @theRoughCode team at AWS, he has finished his internship and I got this task from @mxiamxia (prometheus service discovery using ECS API).

Just want to have some clarification before I start working on it

@theRoughCode
Copy link
Author

@bogdandrutu @jrcamp apologies for not responding earlier, I've been super busy with work the past month and was hoping to get to this once stuff settled. But it looks like @pingleig is taking over, so I'll hand it over to him!

@pingleig that's right, from our discussion a month back, we decided not to use the observer/creator framework for now until its performance is up to par with the "one receiver" method.

I've added some documentation in this PR already that should help in drafting up the README for the initial PR that @jrcamp proposed. Feel free to ping me if you have any questions!

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 9, 2021
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 16, 2021
@jrcamp jrcamp reopened this Feb 17, 2021
@tigrannajaryan
Copy link
Member

@pingleig do you plan to continue working on this?

Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|-------------|--------------------------------------------------------------------|
| type.task | `true` |
| port | port number |
| metricsPath | metrics path |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endpoint variables are usually of the form metrics_path


| Variable | Description |
|-------------|--------------------------------------------------------------------|
| type.task | `true` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would call this ecstask. The existing ones are kind of generic naming (e.g. port) but there's an issue to make them more type-specific.

| type.task | `true` |
| port | port number |
| metricsPath | metrics path |
| labels | labels generated by the ECS Observer. Mainly used with Prometheus. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these docker labels? I think they'd probably be used in discovery rules as well like:

type.ecstask && labels["app"] == "redis"

See related comment below.

Comment on lines +96 to +101
LaunchType: EC2
SubnetId: subnet-0347624eeea6c5969
TaskDefinitionFamily: demo-jar-ec2-bridge-dynamic-port-subset-b
TaskGroup: family:demo-jar-ec2-bridge-dynamic-port-subset-b
TaskRevision: "7"
VpcId: vpc-033b021cd7ecbcedb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these should be individual entries in the Endpoint struct so a user can use them easily in a discovery rule and are documented.

| Name | | Description |
|---------------------|-------------|---------------------------------------------------------------|
| task_definition_arn_pattern | Mandatory | Regex pattern to match against ECS task definition ARN |
| metrics_ports | Mandatory | container ports separated by semicolon. Only containers that expose these ports will be discovered |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a list?

@pingleig
Copy link
Contributor

Hi @jrcamp I just created a new PR that only contains the doc #2377 I will address the comment you have in current PR in the new one.

@github-actions github-actions bot removed the Stale label Feb 19, 2021
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 26, 2021
@pingleig
Copy link
Contributor

@theRoughCode I think we can close this one as the content is being ported into other (smaller) PRs

@theRoughCode
Copy link
Author

@pingleig sounds good! Thanks ☺️

ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants