Add CloudWatch Metric Input Plugin #935

joshhardy · 2016-03-28T18:18:05Z

We want to use the TICK stack for monitoring and alerting across our system.
We are very happy with the breadth of inputs for telegraf, but found that we would be unable to monitor and alert on statistics from AWS Elastic Load Balancers - a critical components for us.

This plugin allows us to pull Metrics Statistics from CloudWatch for our ELBs.
Although we built this plugin to monitor ELBs, the plugin call pull any CloudWatch Metric.

sparrc · 2016-03-28T19:42:33Z

plugins/inputs/cloudwatch/cloudwatch.go

+
+func (c *CloudWatch) SampleConfig() string {
+	return `
+[[inputs.cloudwatch]]


you don't need the [[inputs.cloudwatch]] line, it will be auto-added

sparrc · 2016-03-28T21:57:55Z

overall this looks great, thank you @joshhardy

titilambert · 2016-03-29T12:02:24Z

@joshhardy how did you configure Telegraf to monitor CloudWatch ? Do you have only one Telegraf to pull data ? If yes, this is a singke point of failure, isn't it ?

joshhardy · 2016-03-29T14:30:14Z

@titilambert - we are just getting started with this (thus, open to suggestions), but right now we are running a single container of Telegraf on our Docker cluster specifically for monitoring CloudWatch. Failures in the container or loss of the host will cause the container to be spun up elsewhere in the cluster within a few seconds. Given the 1m interval we are running with, we are currently ok with that.

ljosa · 2016-03-29T15:08:50Z

@joshhardy, @sparrc: I wrote a Cloudwatch input plugin yesterday as well, and submitted a PR before I noticed that you had just submitted one. We should unify them.

It looks like the main difference is that my plugin only requires the user to configure the AWS region and CloudWatch namespace; it uses ListMetrics to discover the metrics and dimensions. I think that is valuable, wdyt?

sparrc · 2016-03-29T16:49:35Z

seems to me that both approaches have their advantages. I like that @ljosa's converts all strings to snake_case, which makes the metrics fit into the influxdb style a bit better.

In this PR I like that the plugin has the ability to select period and delay, and I think it's a good idea to put in a longer default interval, since this plugin will incure fees.

Perhaps there could simply be an option: all_metrics = true, and if set to false then the user will need to specify the metrics that they want as in @joshhardy's PR?

sparrc · 2016-03-29T16:53:17Z

so I would say, keep the following from @ljosa:

snake_case metrics
default to fetching all metrics with a boolean flag (fetch_all_metrics = true?)

keep the following from @joshhardy:

keep period, delay, and interval as default options in the config.
keep the ability to specify specific metrics that the user wants fetched, but only if fetch_all_metrics = false

what do you guys think of that as a unified plugin?

joshhardy · 2016-03-29T18:19:37Z

@sparrc @ljosa - sounds good to me.
For AWS accounts of our scale, monitoring all metrics in a CloudWatch Namespace is not feasible (we have a lot of resources, it would cost a fortune), but I can see the value in other circumstances where you really just want to monitor everything.
For us, ability to specify more granular metric filters is critical.

IMO, the best way to handle it would be to make more of the config options here optional.

statistics could be optional and default to all (the way AWS Cloudwatch input plugin #936 behaves)
Metric name could be optional and default to all available Metrics for the namespace
Entire metrics array could be optional and default to all (by calling ListMetrics like AWS Cloudwatch input plugin #936)

That way, users can be prescriptive on the aspects they want to scope and be wide open on the stuff they don't care about filtering down.

The other thing #936 allows is multiple namespaces in the same input... To me, it makes sense to define multiple cloudwatch inputs for different namespaces like that, but I don't have a strong opinion on it. What is the recommended approach there? Each namespace would have to provide the ability to drill into specific statistics/metrics within the namespace.

Also, what is the recommended way for @ljosa and I to collaborate on implementing these changes? (apologies if this is a stupid question - I'm a bit new to the process)

sparrc · 2016-03-29T18:26:37Z

One possibility for the two of you collaborating would be for one person to give the other access to their fork, and then you can both work off the same branch (and don't do any force pushes). Another would be one person works in master of the fork while the other rebases those changes into their branch.

joshhardy · 2016-04-01T02:39:44Z

@sparrc @ljosa - please review updates.
I updated the required config so that explicitly defined Metrics are not required.
If a list of Metrics is provided, it will be used. If not, the list of all available Metrics is pulled for the Namespace (refreshed once per hour so you don't have to bounce the service on account changes).

I also updated the recorded Measurements and Tags to use more standard snake casing and updated the format a bit... I'm pleased with the output now, thanks for the input!

There were some complications with pulling very large numbers of metrics - exhausting user connection limit when fetching with unbridled parallelism, taking forever with no parallelism, handling ListMetrics results across multiple pages, etc...
I think I solved all of those in a good way. This approach works well for our use case where we want to monitor a few specific resources, but also works well to monitor all the things.

sparrc · 2016-04-01T02:59:55Z

thanks @joshhardy, I'll take a look tomorrow

sparrc · 2016-04-01T17:50:45Z

this looks good, but can you remove the azer/snakecase dependency? I've put the SnakeCase function into telegraf/internal (https://github.com/influxdata/telegraf/blob/master/internal/internal.go#L128) so you don't need to add an external dependency.

sparrc · 2016-04-01T17:51:42Z

plugins/inputs/cloudwatch/cloudwatch.go

+  ## Metrics to Pull (optional)
+  ## Defaults to all Metrics in Namespace if nothing is provided
+  ## Refreshes Namespace available metrics every 1h
+  [[inputs.cloudwatch.metrics]]


should we comment these by default since they're optional?

Sure, updating.

sparrc · 2016-04-01T17:52:02Z

@ljosa what do you think?

sparrc · 2016-04-01T17:55:42Z

plugins/inputs/cloudwatch/README.md

+
+```
+$ ./telegraf -config telegraf.conf -input-filter cloudwatch -test
+> cloudwatch_aws_elb,load_balancer_name=p-motd,region=us-east-1,unit=seconds,latency_average=0.0012776487302826093,latency_maximum=0.07484889030456543,latency_minimum=0.0006072521209716797,latency_sample_count=13493,latency_sum=17.239314317703247 1459474260000000000


this line should have a space somewhere? from this it looks like everything is a tag

Good catch - there is a space between unit=seconds and latency_average, but I somehow messed it up in the copy+paste to the README. Fixing.

joshhardy · 2016-04-01T20:47:11Z

@sparrc - I updated as per your feedback and removed the dependency on azer/snakecase. I also found the handy internal.Duration and utilized that instead of the parsing I was doing manually.

sparrc · 2016-04-01T20:53:24Z

@joshhardy it looks like you may have merged your fork. I'd prefer if contributors didn't do this because it creates a messy PR and git commit history (you can see that you now have 34 commits included here).

Could you re-apply your changes as a single commit rebased off of HEAD?

ljosa · 2016-04-01T23:00:51Z

I meant to look at this today but got tied up, and won't have a chance to look at it in detail until Monday. But based on @joshhardy's comment it sounds great, and I don't see a reason to wait to merge it.

sparrc · 2016-04-02T00:06:29Z

@joshhardy looks like there was a problem with CircleCI's output, here is the error message it's getting from the unit tests:

# github.com/influxdata/telegraf/plugins/inputs/cloudwatch
plugins/inputs/cloudwatch/cloudwatch_test.go:27: cannot use "1m" (type string) as type internal.Duration in field value
plugins/inputs/cloudwatch/cloudwatch_test.go:28: cannot use "1m" (type string) as type internal.Duration in field value
plugins/inputs/cloudwatch/cloudwatch_test.go:47: cannot use "1mins" (type string) as type internal.Duration in field value

joshglympse · 2016-04-02T00:15:18Z

@sparrc - apologies, I missed that behavior change for the test and did that wonky fork merge... I will fix both of those with a single commit rebased from HEAD as you suggested ASAP.

sparrc · 2016-04-02T00:55:04Z

no rush, I'll be cutting a 0.12.0 release soon and this won't be making it in either way :)

Rebased commit of previously reviewed branch. Added cloudwatch client Mock and more rich unit tests.

joshglympse · 2016-04-04T04:14:35Z

@sparrc - please take a look, I rebased into a single commit. I also added some CloudWatch mocks, so the UTs cover much more than before.
The CircleCI check failed with the Kafka error below that seems unrelated (the cloudwatch tests passed)... the Kafka error does not repro for me locally.

--- FAIL: TestConnectAndWrite (1.17s)
    Error Trace:    kafka_test.go:30
    Error:      Received unexpected error "FAILED to send kafka message: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.\n"


FAIL

sparrc · 2016-04-04T04:25:42Z

I'll kick a rebuild

sparrc · 2016-04-06T19:45:53Z

great, thanks @joshhardy!

sparrc reviewed Mar 28, 2016
View reviewed changes

ljosa mentioned this pull request Mar 29, 2016

AWS Cloudwatch input plugin #936

Closed

sparrc reviewed Apr 1, 2016
View reviewed changes

Add CloudWatch input plugin

fec902b

Rebased commit of previously reviewed branch. Added cloudwatch client Mock and more rich unit tests.

joshhardy force-pushed the master branch from 62077e1 to fec902b Compare April 4, 2016 04:04

sparrc closed this in d5b9e00 Apr 6, 2016

sparrc mentioned this pull request Apr 6, 2016

Consume AWS metrics (EC2, ELB) #346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CloudWatch Metric Input Plugin #935

Add CloudWatch Metric Input Plugin #935

joshhardy commented Mar 28, 2016

sparrc Mar 28, 2016

sparrc commented Mar 28, 2016

titilambert commented Mar 29, 2016

joshhardy commented Mar 29, 2016

ljosa commented Mar 29, 2016

sparrc commented Mar 29, 2016

sparrc commented Mar 29, 2016

joshhardy commented Mar 29, 2016

sparrc commented Mar 29, 2016

joshhardy commented Apr 1, 2016

sparrc commented Apr 1, 2016

sparrc commented Apr 1, 2016

sparrc Apr 1, 2016

joshhardy Apr 1, 2016

sparrc commented Apr 1, 2016

sparrc Apr 1, 2016

joshhardy Apr 1, 2016

joshhardy commented Apr 1, 2016

sparrc commented Apr 1, 2016

ljosa commented Apr 1, 2016

sparrc commented Apr 2, 2016

joshglympse commented Apr 2, 2016

sparrc commented Apr 2, 2016

joshglympse commented Apr 4, 2016

sparrc commented Apr 4, 2016

sparrc commented Apr 6, 2016

Add CloudWatch Metric Input Plugin #935

Add CloudWatch Metric Input Plugin #935

Conversation

joshhardy commented Mar 28, 2016

sparrc Mar 28, 2016

Choose a reason for hiding this comment

sparrc commented Mar 28, 2016

titilambert commented Mar 29, 2016

joshhardy commented Mar 29, 2016

ljosa commented Mar 29, 2016

sparrc commented Mar 29, 2016

sparrc commented Mar 29, 2016

joshhardy commented Mar 29, 2016

sparrc commented Mar 29, 2016

joshhardy commented Apr 1, 2016

sparrc commented Apr 1, 2016

sparrc commented Apr 1, 2016

sparrc Apr 1, 2016

Choose a reason for hiding this comment

joshhardy Apr 1, 2016

Choose a reason for hiding this comment

sparrc commented Apr 1, 2016

sparrc Apr 1, 2016

Choose a reason for hiding this comment

joshhardy Apr 1, 2016

Choose a reason for hiding this comment

joshhardy commented Apr 1, 2016

sparrc commented Apr 1, 2016

ljosa commented Apr 1, 2016

sparrc commented Apr 2, 2016

joshglympse commented Apr 2, 2016

sparrc commented Apr 2, 2016

joshglympse commented Apr 4, 2016

sparrc commented Apr 4, 2016

sparrc commented Apr 6, 2016