Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CloudWatch Metric Input Plugin #935

Closed
wants to merge 1 commit into from

Conversation

joshhardy
Copy link

We want to use the TICK stack for monitoring and alerting across our system.
We are very happy with the breadth of inputs for telegraf, but found that we would be unable to monitor and alert on statistics from AWS Elastic Load Balancers - a critical components for us.

This plugin allows us to pull Metrics Statistics from CloudWatch for our ELBs.
Although we built this plugin to monitor ELBs, the plugin call pull any CloudWatch Metric.


func (c *CloudWatch) SampleConfig() string {
return `
[[inputs.cloudwatch]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the [[inputs.cloudwatch]] line, it will be auto-added

@sparrc
Copy link
Contributor

sparrc commented Mar 28, 2016

overall this looks great, thank you @joshhardy

@titilambert
Copy link
Contributor

@joshhardy how did you configure Telegraf to monitor CloudWatch ? Do you have only one Telegraf to pull data ? If yes, this is a singke point of failure, isn't it ?

@joshhardy
Copy link
Author

@titilambert - we are just getting started with this (thus, open to suggestions), but right now we are running a single container of Telegraf on our Docker cluster specifically for monitoring CloudWatch. Failures in the container or loss of the host will cause the container to be spun up elsewhere in the cluster within a few seconds. Given the 1m interval we are running with, we are currently ok with that.

@ljosa
Copy link
Contributor

ljosa commented Mar 29, 2016

@joshhardy, @sparrc: I wrote a Cloudwatch input plugin yesterday as well, and submitted a PR before I noticed that you had just submitted one. We should unify them.

It looks like the main difference is that my plugin only requires the user to configure the AWS region and CloudWatch namespace; it uses ListMetrics to discover the metrics and dimensions. I think that is valuable, wdyt?

@sparrc
Copy link
Contributor

sparrc commented Mar 29, 2016

seems to me that both approaches have their advantages. I like that @ljosa's converts all strings to snake_case, which makes the metrics fit into the influxdb style a bit better.

In this PR I like that the plugin has the ability to select period and delay, and I think it's a good idea to put in a longer default interval, since this plugin will incure fees.

Perhaps there could simply be an option: all_metrics = true, and if set to false then the user will need to specify the metrics that they want as in @joshhardy's PR?

@sparrc
Copy link
Contributor

sparrc commented Mar 29, 2016

so I would say, keep the following from @ljosa:

  1. snake_case metrics
  2. default to fetching all metrics with a boolean flag (fetch_all_metrics = true?)

keep the following from @joshhardy:

  1. keep period, delay, and interval as default options in the config.
  2. keep the ability to specify specific metrics that the user wants fetched, but only if fetch_all_metrics = false

what do you guys think of that as a unified plugin?

@joshhardy
Copy link
Author

@sparrc @ljosa - sounds good to me.
For AWS accounts of our scale, monitoring all metrics in a CloudWatch Namespace is not feasible (we have a lot of resources, it would cost a fortune), but I can see the value in other circumstances where you really just want to monitor everything.
For us, ability to specify more granular metric filters is critical.

IMO, the best way to handle it would be to make more of the config options here optional.

That way, users can be prescriptive on the aspects they want to scope and be wide open on the stuff they don't care about filtering down.

The other thing #936 allows is multiple namespaces in the same input... To me, it makes sense to define multiple cloudwatch inputs for different namespaces like that, but I don't have a strong opinion on it. What is the recommended approach there? Each namespace would have to provide the ability to drill into specific statistics/metrics within the namespace.

Also, what is the recommended way for @ljosa and I to collaborate on implementing these changes? (apologies if this is a stupid question - I'm a bit new to the process)

@sparrc
Copy link
Contributor

sparrc commented Mar 29, 2016

One possibility for the two of you collaborating would be for one person to give the other access to their fork, and then you can both work off the same branch (and don't do any force pushes). Another would be one person works in master of the fork while the other rebases those changes into their branch.

@joshhardy
Copy link
Author

@sparrc @ljosa - please review updates.
I updated the required config so that explicitly defined Metrics are not required.
If a list of Metrics is provided, it will be used. If not, the list of all available Metrics is pulled for the Namespace (refreshed once per hour so you don't have to bounce the service on account changes).

I also updated the recorded Measurements and Tags to use more standard snake casing and updated the format a bit... I'm pleased with the output now, thanks for the input!

There were some complications with pulling very large numbers of metrics - exhausting user connection limit when fetching with unbridled parallelism, taking forever with no parallelism, handling ListMetrics results across multiple pages, etc...
I think I solved all of those in a good way. This approach works well for our use case where we want to monitor a few specific resources, but also works well to monitor all the things.

@sparrc
Copy link
Contributor

sparrc commented Apr 1, 2016

thanks @joshhardy, I'll take a look tomorrow

@sparrc
Copy link
Contributor

sparrc commented Apr 1, 2016

this looks good, but can you remove the azer/snakecase dependency? I've put the SnakeCase function into telegraf/internal (https://github.com/influxdata/telegraf/blob/master/internal/internal.go#L128) so you don't need to add an external dependency.

## Metrics to Pull (optional)
## Defaults to all Metrics in Namespace if nothing is provided
## Refreshes Namespace available metrics every 1h
[[inputs.cloudwatch.metrics]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we comment these by default since they're optional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updating.

@sparrc
Copy link
Contributor

sparrc commented Apr 1, 2016

@ljosa what do you think?


```
$ ./telegraf -config telegraf.conf -input-filter cloudwatch -test
> cloudwatch_aws_elb,load_balancer_name=p-motd,region=us-east-1,unit=seconds,latency_average=0.0012776487302826093,latency_maximum=0.07484889030456543,latency_minimum=0.0006072521209716797,latency_sample_count=13493,latency_sum=17.239314317703247 1459474260000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should have a space somewhere? from this it looks like everything is a tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - there is a space between unit=seconds and latency_average, but I somehow messed it up in the copy+paste to the README. Fixing.

@joshhardy
Copy link
Author

@sparrc - I updated as per your feedback and removed the dependency on azer/snakecase. I also found the handy internal.Duration and utilized that instead of the parsing I was doing manually.

@sparrc
Copy link
Contributor

sparrc commented Apr 1, 2016

@joshhardy it looks like you may have merged your fork. I'd prefer if contributors didn't do this because it creates a messy PR and git commit history (you can see that you now have 34 commits included here).

Could you re-apply your changes as a single commit rebased off of HEAD?

@ljosa
Copy link
Contributor

ljosa commented Apr 1, 2016

I meant to look at this today but got tied up, and won't have a chance to look at it in detail until Monday. But based on @joshhardy's comment it sounds great, and I don't see a reason to wait to merge it.

@sparrc
Copy link
Contributor

sparrc commented Apr 2, 2016

@joshhardy looks like there was a problem with CircleCI's output, here is the error message it's getting from the unit tests:

# github.com/influxdata/telegraf/plugins/inputs/cloudwatch
plugins/inputs/cloudwatch/cloudwatch_test.go:27: cannot use "1m" (type string) as type internal.Duration in field value
plugins/inputs/cloudwatch/cloudwatch_test.go:28: cannot use "1m" (type string) as type internal.Duration in field value
plugins/inputs/cloudwatch/cloudwatch_test.go:47: cannot use "1mins" (type string) as type internal.Duration in field value

@joshglympse
Copy link
Contributor

@sparrc - apologies, I missed that behavior change for the test and did that wonky fork merge... I will fix both of those with a single commit rebased from HEAD as you suggested ASAP.

@sparrc
Copy link
Contributor

sparrc commented Apr 2, 2016

no rush, I'll be cutting a 0.12.0 release soon and this won't be making it in either way :)

Rebased commit of previously reviewed branch.
Added cloudwatch client Mock and more rich unit tests.
@joshglympse
Copy link
Contributor

@sparrc - please take a look, I rebased into a single commit. I also added some CloudWatch mocks, so the UTs cover much more than before.
The CircleCI check failed with the Kafka error below that seems unrelated (the cloudwatch tests passed)... the Kafka error does not repro for me locally.

--- FAIL: TestConnectAndWrite (1.17s)
    Error Trace:    kafka_test.go:30
    Error:      Received unexpected error "FAILED to send kafka message: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.\n"


FAIL

@sparrc
Copy link
Contributor

sparrc commented Apr 4, 2016

I'll kick a rebuild

@sparrc
Copy link
Contributor

sparrc commented Apr 6, 2016

great, thanks @joshhardy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants