Collect cloudwatch stats in a more efficient and cost effective manner #5544

glinton · 2019-03-07T00:51:13Z

Resolves #5420
Resolves #5609
closes #3255

Due to the structure of the data returned by GetMetricData (as opposed to GetMetricStatistics) metrics will no longer be tagged with the unit type.

This version makes one call to the AWS api and creates metrics based on the dimensions and specified metric names from the config file.

Config:

  [[inputs.cloudwatch.metrics]]
    names = ["NetworkIn"]
    [[inputs.cloudwatch.metrics.dimensions]]
      name = "AutoScalingGroupName"
      value = "k8s-ci"
  [[inputs.cloudwatch.metrics]]
    names = ["NetworkOut"]
    [[inputs.cloudwatch.metrics.dimensions]]
      name = "AutoScalingGroupName"
      value = "k8s-frontline"

Output:

cloudwatch_aws_ec2,AutoScalingGroupName=k8s-frontline,region=us-east-1 network_out_sum=6625229694,network_out_average=3312614847,network_out_maximum=3345284533,network_out_minimum=3279945161 1551919140000000000
cloudwatch_aws_ec2,AutoScalingGroupName=k8s-ci,region=us-east-1 network_in_minimum=1167141,network_in_sum=45580787,network_in_average=15193595.666666666,network_in_maximum=42813872 1551919140000000000

Todo:

tests
handle pagination
handle window better

CpuID · 2019-03-22T20:37:11Z

Due to the structure of the data returned by GetMetricData (as opposed to GetMetricStatistics) metrics will no longer be tagged with the unit type.

❤️ I planned to rename this out anyway tbh

danielnelson · 2019-03-22T20:43:09Z

@glinton regarding the unit type, I'm not super attached to it but it seems we still know that information when we make the query, are you still thinking to remove it?

CpuID

@glinton minor tweaks :)

plugins/inputs/cloudwatch/README.md

plugins/inputs/cloudwatch/cloudwatch.go

CpuID · 2019-03-22T23:45:08Z

@glinton regarding the unit type, I'm not super attached to it but it seems we still know that information when we make the query, are you still thinking to remove it?

I think the main upside to retaining it would be retaining namespace compatibility for existing users. ie, user upgrades, things "just work" in terms of other subsystems/dashboards that expect a pre-existing metric namespace.

maybe having a feature flag to disable the units would be extremely nice though...

glinton · 2019-03-25T18:02:25Z

I'm not certain that we know the unit type at query time. What about adding a config option to set a unit type, as the user may know what units a metric/namespace uses? If it's not set, there is no unit type tag.

danielnelson · 2019-03-25T18:26:58Z

Don't worry about the unit tag then, we will add it to the release notes, there are other ways one could manually attach tags.

CpuID · 2019-03-25T19:29:30Z

@glinton can you drop some "napkin math" in here re how you foresee the "cost effective manner" part playing out?

background: I've been running this branch in "protoduction" for a few days, reading a reasonable volume of metrics from CW:

(4 inputs.cloudwatch stanzas, EBS (5m), ECS (1m), EC2 (5m), RDS (5m), most selectively, ECS all dimensions/statistics)

2019-03-25T19:00:03Z D! [outputs.graphite] wrote batch of 352 metrics in 72.25408ms
2019-03-25T19:00:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:01:00Z D! [outputs.graphite] wrote batch of 593 metrics in 106.623415ms
2019-03-25T19:01:00Z D! [outputs.graphite] buffer fullness: 0 / 10000 metrics. 
2019-03-25T19:02:03Z D! [outputs.graphite] wrote batch of 664 metrics in 178.746237ms
2019-03-25T19:02:03Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:03:03Z D! [outputs.graphite] wrote batch of 636 metrics in 109.464365ms
2019-03-25T19:03:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:04:01Z D! [outputs.graphite] wrote batch of 376 metrics in 63.310413ms
2019-03-25T19:04:01Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:05:01Z D! [outputs.graphite] wrote batch of 260 metrics in 199.167616ms
2019-03-25T19:05:01Z D! [outputs.graphite] buffer fullness: 5 / 10000 metrics. 
2019-03-25T19:06:04Z D! [outputs.graphite] wrote batch of 772 metrics in 115.374741ms
2019-03-25T19:06:04Z D! [outputs.graphite] buffer fullness: 2 / 10000 metrics. 
2019-03-25T19:07:01Z D! [outputs.graphite] wrote batch of 481 metrics in 52.644945ms
2019-03-25T19:07:01Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:08:03Z D! [outputs.graphite] wrote batch of 689 metrics in 212.339554ms
2019-03-25T19:08:03Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:09:03Z D! [outputs.graphite] wrote batch of 467 metrics in 86.771158ms
2019-03-25T19:09:03Z D! [outputs.graphite] buffer fullness: 3 / 10000 metrics. 
2019-03-25T19:10:01Z D! [outputs.graphite] wrote batch of 160 metrics in 39.880934ms
2019-03-25T19:10:01Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:11:01Z D! [outputs.graphite] wrote batch of 697 metrics in 163.696045ms
2019-03-25T19:11:01Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:12:02Z D! [outputs.graphite] wrote batch of 592 metrics in 171.005482ms
2019-03-25T19:12:02Z D! [outputs.graphite] buffer fullness: 5 / 10000 metrics. 
2019-03-25T19:13:04Z D! [outputs.graphite] wrote batch of 687 metrics in 207.710244ms
2019-03-25T19:13:04Z D! [outputs.graphite] buffer fullness: 4 / 10000 metrics. 
2019-03-25T19:14:02Z D! [outputs.graphite] wrote batch of 407 metrics in 47.990755ms
2019-03-25T19:14:02Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics. 
2019-03-25T19:15:03Z D! [outputs.graphite] wrote batch of 329 metrics in 75.403892ms
2019-03-25T19:15:03Z D! [outputs.graphite] buffer fullness: 1 / 10000 metrics.

When I enabled ECS only last week on telegraf master, it took our Cloudwatch usage from $2.12/day (unrelated usage, baseline) to $9~/day ($6.88~/day in telegraf usage, $206.40~/mth. I then enabled the other 3 stanzas and used this branch, left it going over the weekend and checked our daily usage, $10.26/day ($8.14~/day in telegraf usage, $244.20~/mth)

I'm wondering if my assumptions on how GetMetricData is billed are wrong here...? how do you interpret the pricing model?

My equivalent "napkin math", which I feel may be off:

Threw some print debugging in my local build of the #5544 branch, to see how many GetMetricData calls were fired off:

for the 235~ per minute metrics for AWS/ECS, I saw 8 API calls total. the bulk collection vs the old GetMetricStatistic is so much more efficient
using https://aws.amazon.com/cloudwatch/pricing/

You can request up to five statistics for the same metric in a single GetMetricData API request. Additional statistics are billed as an additional metric.
$0.01/1,000 metrics requested using GetMetricData
$0.01/1,000 GetMetricStatistics, ListMetrics, PutMetricData, GetDashboard, ListDashboards, PutDashboard and DeleteDashboards requests

in the code, the breakout comment says // loop through metrics and send groups of 100 at a time to gatherMetrics
so the code does 100 metrics per GetMetricData call, vs a single metric in the current "released" code per GetMetricStatistic call
some math:
- 235~ metrics per min divided by 5 = 47 billable GetMetricData API calls. Could be slightly higher depending on the qty of data queries per API call, but let's go with that for napkin math sake
- thats 2030400 billable GetMetricData API calls per month
- 2030.4 x $0.01 = $20.304/mth
the ECS metrics are per minute, the EBS + RDS + EC2 metrics are per 5mins so retrieved 80% less often, a minority of the overall usage.

what we are seeing right now:

391~ x 1 GetMetricStatistic calls, per min
- 40% higher due to no statistic filtering, so we get sum/min/max/avg/sample_count, instead of just avg/max/sample_count
thats 16891200 billable GetMetricStatistic API calls per mth, for ECS metrics alone
$168.912/mth, or $5.6304/day
this lines up pretty closely with the $7~/day I was seeing for ECS + EC2 + RDS + EBS metrics (last 24hrs $9~ vs a couple of days ago $2~ - the $2 is mostly cloudwatch-to-graphite)

and after deploying with the 5min metrics added also (3 extra stanzas):

all the other 5min AWS metrics enabled are about 665280 stats per month. the 5min metric count is something like 1k~, some napkin math tells me $17.28~/mth for those.

I'd say it's pretty safe to say that #5544 will bring our telegraf Cloudwatch cost down to something like $37~/mth instead of $206~/mth for latest released telegraf

glinton · 2019-03-26T16:42:42Z

When I started, I didn't read the pricing docs, but it makes it sound like it'll be the same cost for either way?

$0.01/1,000 metrics requested using GetMetricData
$0.01/1,000 GetMetricStatistics, ListMetrics, PutMetricData, GetDashboard, ListDashboards, PutDashboard and DeleteDashboards requests

Sure, we make up to 20-100x fewer calls at best using `GetMetricData.

Say you want stats for 100 metrics. Using GetMetricStatistic API, it would be one call per metric and you would get the 5 stats back (max, min, avg, sum, count), resulting in 100 API calls. (statistics API is billed per request)
Using GetMetricData and no filtering, it would be 5 calls, because you can request 100 data points per call (20 metrics with 5 stats per request = 100 data points)
Using GetMetricData and only requesting the average of all 100 metrics, it would be 1 request.

What has me confused now is that the pricing for GetMetricData is per metric requested. Sure it's more efficient, but the pricing doc makes it sound like it won't actually save any money.

In their knowledge-center, it says:

GetMetricData is cheaper. You can ask for more data points for a single metric by using GetMetricData rather than by using GetMetricStatistics. To get the same number of data points, you must perform 70 GetMetricStatistics calls.

Which makes it sound not only cheaper computationally, but financially as well.

I know docs docs are eventually consistent, so there's a chance (and it seems you've experienced it) that this new API call does save money while making fewer requests.

CpuID · 2019-03-26T17:04:35Z

I know docs docs are eventually consistent, so there's a chance (and it seems you've experienced it) that this new API call does save money while making fewer requests.

I wonder if it's really only a saving for someone retrieving bulk metrics, ie more than 1 data point per interval? I guess one route there if you are willing to accept a delayed retrieval, is tune the configuration (still thinking through the exact combination) to pull say 5 x 1min metrics, once per 5mins using GetMetricData since that API call permits it? That would reduce the retrieval cost substantially, at the expense of retrieval "latency".

I think I concur with your thoughts, from my initial reading. Admittedly I did just wake up though, so if this opinion changes... :)

glinton · 2019-03-26T18:43:15Z

That's a great point.. I'll think about a way to preserve the timestamps when and if multiple data points are returned.

goller

Hey @glinton I can't wait to use this plugin. I like what you have done. I've found a few bugs and put a variety of nit-picks in. Let me know what you think!!

plugins/inputs/cloudwatch/cloudwatch.go

plugins/inputs/cloudwatch/cloudwatch_test.go

plugins/inputs/cloudwatch/cloudwatch.go

plugins/inputs/cloudwatch/README.md

glinton · 2019-04-10T17:04:15Z

@druchoo @CpuID can you test this build? there are some caching improvements and modifications to how metrics are created:

druchoo · 2019-04-11T13:04:31Z

Initial testing of build looks good. I plan on taking a closing look at each namespace tomorrow. We're doing some cost analysis first. Notes I have now:

It's been running ~1d and we're not really seeing favorable results for cost vs the GetMetricStatistics version. Cost Explorer is showing roughly same cost. Granted there's no filtering in any configs but would have thought there would be savings from just switching API calls. I was hopeful at least from the discussion and napkin math above.
I believe ratelimit section in README needs to be updated. GetMetricData is 50 TPS vs 400 for GetMetricStatistics.
Can you elaborate a bit on the caching (cache_ttl)?

glinton · 2019-04-11T16:58:27Z

Since the napkin math, I've been worried this was a waste of time. I do believe that cost savings can be seen at the cost of not having immediate stats. Can you try increasing the interval to 5-10x what you currently have it set to, but leave the period the same (I assume it's `1m), this should limit the number of metrics requested, but will return all backfilled data at 1 minute intervals.
Thanks, I'll make that update to the README.
The caching applies to how frequently metrics are listed from a namespace. If you don't have any metrics defined for a namespace, this plugin lists the available metrics no more frequently than cache_ttl. This setting also serves as ttl for other cache's used in the plugin to limit processing.

druchoo · 2019-04-11T17:39:48Z

@glinton Totally not a waste of time :-). The performance/speed benefit alone is insane. Before it wasn't practical to consume larger namespaces due to time it took to complete.

Increasing interval showed cost savings with GetMetricStatistics. I imagine it will be the same for GetMetricData. I think we'll start with actually filtering our metrics and dimensions. Depending on how that goes we can play around with interval.

Maybe worth mentioning.. In many cases, all available metrics are not useful. Similarly, although multiple statistics are available for a given metric, there may only be one meaningful one. Unfortunately, you need to read the CloudWatch docs for each namespace.

Regardless of cost savings, thank you for working on this!

danielnelson · 2019-04-11T20:21:45Z

@druchoo Is this helping with #3255 at all?

CpuID · 2019-04-12T20:41:38Z

@druchoo @CpuID can you test this build? there are some caching improvements and modifications to how metrics are created:

OOO, I can test next wk if required

CpuID · 2019-04-12T20:43:48Z

ya I think the filtering of metrics helps a lot for $ reasons over the prior implementation.

and I think with longer intervals, $ savings are possible.

this method is definitely the more efficient way to bulk retrieve from CW, so I think it's definitely not a waste

druchoo · 2019-04-15T18:02:47Z

@danielnelson, although TCP connections are way down I don't think the underlying issue is resolved. Connections are still left open but it's looks to be very unlikely this would be an issue now. Just another plus for GetMetricData :-)

Not sure what to do with #3255. Close I guess?

plugins/inputs/cloudwatch/README.md

plugins/inputs/cloudwatch/cloudwatch.go

danielnelson · 2019-04-15T22:48:38Z

@druchoo I marked this PR to close #3255 issue when merged.

CpuID · 2019-04-23T01:13:30Z

🎉

Collect cloudwatch stats in a more efficient and cost effective manner

e63ab50

glinton added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin area/aws AWS plugins including cloudwatch, ecs, kinesis labels Mar 7, 2019

glinton added this to the 1.11.0 milestone Mar 7, 2019

glinton changed the title ~~Collect cloudwatch stats in a more efficient and cost effective manner~~ [WIP] Collect cloudwatch stats in a more efficient and cost effective manner Mar 7, 2019

Update tests for getMetricData

7606b5e

glinton changed the title ~~[WIP] Collect cloudwatch stats in a more efficient and cost effective manner~~ Collect cloudwatch stats in a more efficient and cost effective manner Mar 8, 2019

Make fmt

1db08ee

glinton changed the title ~~Collect cloudwatch stats in a more efficient and cost effective manner~~ [WIP] Collect cloudwatch stats in a more efficient and cost effective manner Mar 8, 2019

glinton added 9 commits March 11, 2019 15:06

Limit requested data to maximum allowed

cdf83d9

Guarantee id starts with lowercase letter

29d425b

Cleanup metric ids better

c271c90

Start layering in selective statistics gathering

043caf1

Add todo notes

0bdec1d

Merge branch 'master' into feature/5420

2af1041

Add means of choosing which statistics to fetch for metrics

6629000

Cache generated data queries

122e439

Update tests

327fdda

CpuID suggested changes Mar 22, 2019

View reviewed changes

plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved

plugins/inputs/cloudwatch/README.md Outdated Show resolved Hide resolved

CpuID reviewed Mar 22, 2019

View reviewed changes

plugins/inputs/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved

glinton self-assigned this Mar 25, 2019

goller suggested changes Apr 3, 2019

View reviewed changes

Address some feedback

423cbd6

danielnelson suggested changes Apr 4, 2019

View reviewed changes

glinton added 3 commits April 8, 2019 15:32

WIP Addressing feedback

634bdbf

Improve caching. Simplify metric grouping

8741dfe

Add some documentation on how to lower cost

aeda8ac

Update ttl information for GetMetricData

2f0e438

danielnelson suggested changes Apr 15, 2019

View reviewed changes

glinton added 2 commits April 16, 2019 16:29

Address feedback and simplify caching

aef2f6e

Handle condition from tests

43e28ea

danielnelson approved these changes Apr 17, 2019

View reviewed changes

danielnelson requested a review from goller April 17, 2019 22:06

Minor adjustments

bf13424

danielnelson merged commit e334830 into master Apr 23, 2019

danielnelson deleted the feature/5420 branch April 23, 2019 00:36

hwaastad pushed a commit to hwaastad/telegraf that referenced this pull request Jun 13, 2019

Collect cloudwatch stats using GetMetricData (influxdata#5544)

d3bdcb4

bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019

Collect cloudwatch stats using GetMetricData (influxdata#5544)

bae57db

athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020

Collect cloudwatch stats using GetMetricData (influxdata#5544)

f2bd528

idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020

Collect cloudwatch stats using GetMetricData (influxdata#5544)

8b8ed28

sspaink mentioned this pull request Sep 1, 2022

[Feature Request] Add accumulated api requests made to cloudwatch API endpoint #5199

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect cloudwatch stats in a more efficient and cost effective manner #5544

Collect cloudwatch stats in a more efficient and cost effective manner #5544

glinton commented Mar 7, 2019 •

edited by danielnelson

Loading

CpuID commented Mar 22, 2019

danielnelson commented Mar 22, 2019

CpuID left a comment

CpuID commented Mar 22, 2019

glinton commented Mar 25, 2019

danielnelson commented Mar 25, 2019

CpuID commented Mar 25, 2019

glinton commented Mar 26, 2019

CpuID commented Mar 26, 2019

glinton commented Mar 26, 2019

goller left a comment

glinton commented Apr 10, 2019

druchoo commented Apr 11, 2019

glinton commented Apr 11, 2019

druchoo commented Apr 11, 2019

danielnelson commented Apr 11, 2019

CpuID commented Apr 12, 2019

CpuID commented Apr 12, 2019

druchoo commented Apr 15, 2019

danielnelson commented Apr 15, 2019

CpuID commented Apr 23, 2019

Collect cloudwatch stats in a more efficient and cost effective manner #5544

Collect cloudwatch stats in a more efficient and cost effective manner #5544

Conversation

glinton commented Mar 7, 2019 • edited by danielnelson Loading

CpuID commented Mar 22, 2019

danielnelson commented Mar 22, 2019

CpuID left a comment

Choose a reason for hiding this comment

CpuID commented Mar 22, 2019

glinton commented Mar 25, 2019

danielnelson commented Mar 25, 2019

CpuID commented Mar 25, 2019

glinton commented Mar 26, 2019

CpuID commented Mar 26, 2019

glinton commented Mar 26, 2019

goller left a comment

Choose a reason for hiding this comment

glinton commented Apr 10, 2019

druchoo commented Apr 11, 2019

glinton commented Apr 11, 2019

druchoo commented Apr 11, 2019

danielnelson commented Apr 11, 2019

CpuID commented Apr 12, 2019

CpuID commented Apr 12, 2019

druchoo commented Apr 15, 2019

danielnelson commented Apr 15, 2019

CpuID commented Apr 23, 2019

glinton commented Mar 7, 2019 •

edited by danielnelson

Loading