Skip to content

Latest commit

 

History

History
847 lines (750 loc) · 92.7 KB

CHANGELOG.md

File metadata and controls

847 lines (750 loc) · 92.7 KB

Change Log

v2.4.3 (2019-06-24)

Full Changelog

Closed issues:

  • Tag v2.4.2 release #396
  • Tag v2.4.0 release #393
  • Predict disk checks tmpfs #391
  • Tag v2.3.1 release #388
  • Tag v2.3.1 release #385
  • Tag v2.3.0 release #382
  • Tag v2.3.0 release #379
  • Tag v2.3.0 release #376
  • Prometheus startup script calls a consul kv value that doesn't exist #374
  • Fix metadata error on boot #372
  • Tag v2.3.0 release #369
  • Tag v2.3.0 release #366
  • Tag v2.3.0 release #362
  • Tag v2.3.0 release #359
  • Scrape Redis CloudWatch metrics #357
  • Support gracefully reloading configuration for services that support it #351
  • Categorize other alerts #349
  • [notifications] Categorize cron alerts as non-critical #346
  • Slack notification to all alert routes #344
  • Upgrade Traefik to 1.5.4 #342
  • Upgrade blackbox_exporter to 0.12.0 #341
  • Upgrade AlertManager to 0.14.0 #340
  • Upgrade to Prometheus 2.2.1 #339
  • Tag v2.2.0 release #336
  • [TF] Cleanup for 0.11.x #334
  • Rename 'sink' alert route to something more descriptive #330
  • Tag project with platform tag #328
  • [VarnishCacheHitRateTooLow] Don't alert if overall traffic is minimal #324
  • Cleanup old pagerduty integration #322
  • [monitoring] TimeSanity: 15 minutes too short of a time for NTP state to settle #320
  • [pagerduty] Add more pagerduty integration #318
  • Tag v2.1.0 release #314
  • Fixing prometheus rules #313
  • Tag v2.1.0 release #310
  • Use base's swap support in favor of our own solution #308
  • Tag v2.1.0 release #305
  • [pagerduty] handle <UNSET> as magical value #303
  • Tag v2.1.0 release #300
  • Improve Monitoring Coverage #298
  • [traefik] Upgrade to 1.4.6 #294

Merged pull requests:

v2.4.2 (2019-06-20)

Full Changelog

Closed issues:

  • Tag v2.4.0 release #393
  • Predict disk checks tmpfs #391
  • Tag v2.3.1 release #388
  • Tag v2.3.1 release #385
  • Tag v2.3.0 release #382
  • Tag v2.3.0 release #379
  • Tag v2.3.0 release #376
  • Prometheus startup script calls a consul kv value that doesn't exist #374
  • Fix metadata error on boot #372
  • Tag v2.3.0 release #369
  • Tag v2.3.0 release #366
  • Tag v2.3.0 release #362
  • Tag v2.3.0 release #359
  • Scrape Redis CloudWatch metrics #357
  • Support gracefully reloading configuration for services that support it #351
  • Categorize other alerts #349
  • [notifications] Categorize cron alerts as non-critical #346
  • Slack notification to all alert routes #344
  • Upgrade Traefik to 1.5.4 #342
  • Upgrade blackbox_exporter to 0.12.0 #341
  • Upgrade AlertManager to 0.14.0 #340
  • Upgrade to Prometheus 2.2.1 #339
  • Tag v2.2.0 release #336
  • [TF] Cleanup for 0.11.x #334
  • Rename 'sink' alert route to something more descriptive #330
  • Tag project with platform tag #328
  • [VarnishCacheHitRateTooLow] Don't alert if overall traffic is minimal #324
  • Cleanup old pagerduty integration #322
  • [monitoring] TimeSanity: 15 minutes too short of a time for NTP state to settle #320
  • [pagerduty] Add more pagerduty integration #318
  • Tag v2.1.0 release #314
  • Fixing prometheus rules #313
  • Tag v2.1.0 release #310
  • Use base's swap support in favor of our own solution #308
  • Tag v2.1.0 release #305
  • [pagerduty] handle <UNSET> as magical value #303
  • Tag v2.1.0 release #300
  • Improve Monitoring Coverage #298
  • [traefik] Upgrade to 1.4.6 #294
  • [backup] S3 sync doesn't support empty files #292
  • Tag v2.0.4 release #290

Merged pull requests:

v2.4.0 (2019-03-06)

Full Changelog

Closed issues:

  • Predict disk checks tmpfs #391
  • Tag v2.3.1 release #388
  • Tag v2.3.1 release #385
  • Tag v2.3.0 release #382
  • Tag v2.3.0 release #379
  • Tag v2.3.0 release #376
  • Prometheus startup script calls a consul kv value that doesn't exist #374
  • Fix metadata error on boot #372
  • Tag v2.3.0 release #369
  • Tag v2.3.0 release #366
  • Tag v2.3.0 release #362
  • Tag v2.3.0 release #359
  • Scrape Redis CloudWatch metrics #357
  • Support gracefully reloading configuration for services that support it #351
  • Categorize other alerts #349
  • [notifications] Categorize cron alerts as non-critical #346
  • Slack notification to all alert routes #344
  • Upgrade Traefik to 1.5.4 #342
  • Upgrade blackbox_exporter to 0.12.0 #341
  • Upgrade AlertManager to 0.14.0 #340
  • Upgrade to Prometheus 2.2.1 #339
  • Tag v2.2.0 release #336
  • [TF] Cleanup for 0.11.x #334
  • Rename 'sink' alert route to something more descriptive #330
  • Tag project with platform tag #328
  • [VarnishCacheHitRateTooLow] Don't alert if overall traffic is minimal #324
  • Cleanup old pagerduty integration #322
  • [monitoring] TimeSanity: 15 minutes too short of a time for NTP state to settle #320
  • [pagerduty] Add more pagerduty integration #318
  • Tag v2.1.0 release #314
  • Fixing prometheus rules #313
  • Tag v2.1.0 release #310
  • Use base's swap support in favor of our own solution #308
  • Tag v2.1.0 release #305
  • [pagerduty] handle <UNSET> as magical value #303
  • Tag v2.1.0 release #300
  • Improve Monitoring Coverage #298
  • [traefik] Upgrade to 1.4.6 #294
  • [backup] S3 sync doesn't support empty files #292
  • Tag v2.0.4 release #290
  • [alerts] Group by project #287

Merged pull requests:

v2.3.1 (2018-08-21)

Full Changelog

Closed issues:

  • Tag v2.3.0 release #382
  • Tag v2.3.1 release #385

Merged pull requests:

v2.3.0 (2018-08-01)

Full Changelog

Closed issues:

  • Prometheus startup script calls a consul kv value that doesn't exist #374
  • Fix metadata error on boot #372
  • Scrape Redis CloudWatch metrics #357
  • Support gracefully reloading configuration for services that support it #351
  • Categorize other alerts #349
  • [notifications] Categorize cron alerts as non-critical #346
  • Slack notification to all alert routes #344
  • Upgrade Traefik to 1.5.4 #342
  • Upgrade blackbox_exporter to 0.12.0 #341
  • Upgrade AlertManager to 0.14.0 #340
  • Upgrade to Prometheus 2.2.1 #339
  • Tag v2.2.0 release #336
  • Tag v2.3.0 release #379
  • Tag v2.3.0 release #376
  • Tag v2.3.0 release #369
  • Tag v2.3.0 release #366
  • Tag v2.3.0 release #362
  • Tag v2.3.0 release #359

Merged pull requests:

v2.2.0 (2018-04-06)

Full Changelog

Closed issues:

  • [TF] Cleanup for 0.11.x #334
  • Rename 'sink' alert route to something more descriptive #330
  • Tag project with platform tag #328
  • [VarnishCacheHitRateTooLow] Don't alert if overall traffic is minimal #324
  • Cleanup old pagerduty integration #322
  • [monitoring] TimeSanity: 15 minutes too short of a time for NTP state to settle #320
  • [pagerduty] Add more pagerduty integration #318
  • Fixing prometheus rules #313

Merged pull requests:

  • Fix #334 #335 (gozer)
  • Cleanup alert routing #333 (limed)
  • All these alerts are for fluentd-elasticsearch so we just tag them as nubis #332 (limed)
  • Fixing up all rules and fixing up alert routes #331 (limed)
  • Tag project as a platform component #329 (limed)
  • Fixing broken rule #327 (limed)
  • Limit alerting to Varnish if it's seeing at least a non-trivial amount of overall traffic #325 (gozer)
  • Remove old pagerduty integration key #323 (limed)
  • Give NTPd 30 minutes to stabilize #321 (gozer)
  • Added support for pagerduty integration #319 (limed)
  • Update nubis-travis #317 (tinnightcap)

v2.1.0 (2018-02-23)

Full Changelog

Closed issues:

  • Use base's swap support in favor of our own solution #308
  • [pagerduty] handle <UNSET> as magical value #303
  • Improve Monitoring Coverage #298
  • [traefik] Upgrade to 1.4.6 #294
  • [backup] S3 sync doesn't support empty files #292
  • [rds] Scrape rds metrics from cloudwatch #284
  • Tag v2.1.0 release #314
  • Tag v2.1.0 release #310
  • Tag v2.1.0 release #305
  • Tag v2.1.0 release #300
  • [cloudwatch] Add AWS/Lambda Throttles metric #163
  • [dashboard] Update rules for consul_exporter 0.3.0 #161
  • [cloudwatch] Scrape EFS metrics #148

Merged pull requests:

v2.0.4 (2017-12-08)

Full Changelog

Implemented enhancements:

  • [cloudwatch] Evaluate data we are scraping from cloudwatch #260
  • [cloudwatch] Cloudwatch scraping not working for ASG #258

Fixed bugs:

  • [blackbox] Blackbox exporter not starting up #271
  • Traefik does not start up after v1.4 update #269

Closed issues:

  • [alerts] Group by project #287
  • [consul] consul_catalog_service_node_healthy service label is now service_id #280
  • Remove wildcard *.mon. DNS entry #277
  • [varnish] Add Varnish Dashboard #272
  • [prometheus] Probe /-/healthy for liveness #264
  • Make instance_type tunable #252
  • Detect when EFS mount does not come up on boot #207
  • Tag v2.0.4 release #290

Merged pull requests:

v2.0.3 (2017-11-06)

Full Changelog

Closed issues:

  • [traefik] Upgrade traefik to 1.4.1 #249
  • [rules] Get rid of IpForwardingEnabledNonNAT #248
  • [memory] Create some swap on startup #246
  • Tag v2.0.3 release #265
  • Tag v2.0.3 release #261

Merged pull requests:

v2.0.2 (2017-10-25)

Full Changelog

Fixed bugs:

  • Fix broken squid alerts #230

Closed issues:

  • Scrape squid exporter metrics #228
  • Tag v2.0.2 release #241
  • Tag v2.0.2 release #237
  • Tag v2.0.2 release #234

Merged pull requests:

v2.0.1 (2017-10-18)

Full Changelog

Closed issues:

  • Limit heap size to 75% of available RAM #219
  • Cronjob alerts are not very specific #217
  • Upgrade to Prometheus 1.8.0 #215
  • Disable backups in favor of snapshots #142
  • [backups] Enable some swap #110
  • [backups] Make the in-progress page expose a backup metric of some sort ? #106
  • [duplicity] Cleanup orphaned lockfiles #66
  • Tag v2.0.1 release #225
  • Tag v2.0.1 release #221

Merged pull requests:

v2.0.0 (2017-10-06)

Full Changelog

Closed issues:

  • [unicreds] Cleanup resources on destruction #190
  • Use persistent storage #184
  • [dashboard] Add ES grafana dashboard #182
  • [traefik] Move traefik port to 9100 range #176
  • [traefik] Configure traefik to expose metrics endpoint as well #172
  • Scrape traefik metrics #171
  • Update packages #168
  • Upgrade traefik to v1.3.8 #166
  • [grafana] Upgrade grafana to stable #160
  • Switch from atlas to using terraform image search #158
  • Add IAM permission #157
  • [dashboard] Update grafana json file #150
  • [dashboard] Add EFS dashboard #147
  • Tag v2.0.0 release #212
  • Tag v2.0.0 release #208
  • Tag v2.0.0 release #203
  • Tag v2.0.0 release #199
  • Tag v2.0.0 release #195
  • Tag v2.0.0 release #192
  • Tag v2.0.0 release #186

Merged pull requests:

v1.5.1 (2017-08-18)

Full Changelog

Closed issues:

  • [security] Close up access to unneeded ports #145
  • Disable external services routing #143
  • [traefik] Upgrade to 1.3.4 #140
  • Tag v1.5.1 release #151

Merged pull requests:

  • Merge v1.5.1 release into develop. [skip ci] #154 (tinnightcap)
  • Update CHANGELOG for v1.5.1 release [skip ci] #153 (tinnightcap)
  • Close down unnecessary open tcp ports #146 (gozer)
  • Don't route for *.mon... anymore, these services are exposed via SSO now #144 (gozer)
  • Upgrade to traefik v1.3.4 #141 (gozer)

v1.5.0 (2017-06-24)

Full Changelog

Closed issues:

  • [grafana] Enable ProxyAuth #135
  • Upgrade Prometheus to 1.7.1 and Alertmanager to 0.7.1 #131
  • ALlow discovery of custom scraping targets #130
  • [datadog] Remove support #128
  • [blackbox] Upgrade to 0.5.0 #122
  • [alertmanager] Upgrade to 0.6.2 #121
  • [prometheus] Upgrade to 1.6.2 #120
  • [traefik] Upgrade to v1.2.3 #119
  • Tag v1.5.0 release #137

Merged pull requests:

  • Merge v1.5.0 release into develop. [skip ci] #139 (tinnightcap)
  • Update CHANGELOG for v1.5.0 release [skip ci] #138 (tinnightcap)
  • Use OIDC_CLAIM_email as logged-in user in Grafana #136 (gozer)
  • Increasing disk space for prometheus federators - Bug 1367263 #134 (kfferrando)
  • Version upgrades #133 (gozer)
  • Implement custom scrape target discovery via Consul service tags #132 (gozer)
  • Remove support for DataDog #129 (gozer)
  • Enable SSO for Prometheus #127 (gozer)
  • Upgrade blackbox exporter to 0.5.0 #126 (gozer)
  • Upgrade alertmanager to 0.6.2 #125 (gozer)
  • Upgrade prometheus to v1.6.2 #124 (gozer)
  • Upgrade Traefik to 1.2.3 #123 (gozer)

v1.4.2 (2017-05-05)

Full Changelog

Closed issues:

  • Add nubis/builder/artifacts/AMIs.json to .gitignore #111
  • Tag v1.4.2 release #116
  • Tag v1.4.2 release #113

Merged pull requests:

v1.4.1 (2017-04-11)

Full Changelog

Closed issues:

  • [backups] Make sure in progress page is parseable by Prometheus #104
  • [typo] curl -retry instead of --retry in prometheus-onboot #102
  • Tag v1.4.1 release #107

Merged pull requests:

  • Merge v1.4.1 release into develop. [skip ci] #109 (tinnightcap)
  • Update CHANGELOG for v1.4.1 release [skip ci] #108 (tinnightcap)
  • Make sure the backup in progress landing page is parseable by prometheus #105 (gozer)
  • Fix curl -retry tyop #103 (gozer)

v1.4.0 (2017-03-31)

Full Changelog

Closed issues:

  • Upgrade Traefik to v1.2.0 #85
  • Add a configurable live_app label #83
  • Allow sink alerts (apps) destination to be configured #81
  • Add backup in progress landing page #79
  • [labels] Add technical_owner and account_number labels #77
  • Disable detailled monitoring #75
  • [mysql] Discover mysqld-exporter #73
  • Apache alerts are application alerts, don't consider them platform alerts #71
  • Alert only for platform alerts, leave application alerting up to upstream federators #69
  • [upgrade] Prometheus 1.5.2 #67
  • [cron] Setup random delay on intensive jobs #61
  • [cloudwatch] Filter metrics on VPCs #59
  • [billing] Currently reporting in triplicate #58
  • [bug] Can't scrape metrics less frequently than every 5 minutes #56
  • [cloudwatch] Billing only exposed in us-east-1 #54
  • Add support for ingesting cloudwatch metrics #53
  • Upgrade blackbox exporter to 0.4.0 #50
  • Convert storage type to gp2 #47
  • Upgrade Prometheus to 1.5.0 #46
  • Tag v1.4.0 release #99
  • Tag v1.4.0 release #95
  • Tag v1.4.0 release #91
  • Tag v1.4.0 release #45

Merged pull requests:

  • Merge v1.4.0 release into develop. [skip ci] #101 (tinnightcap)
  • Update CHANGELOG for v1.4.0 release [skip ci] #100 (tinnightcap)
  • Don't expose Consul to the internet, because #98 (gozer)
  • Merge v1.4.0 release into develop. [skip ci] #97 (tinnightcap)
  • Update CHANGELOG for v1.4.0 release [skip ci] #96 (tinnightcap)
  • Fixups to pass Travis lint checks #94 (tinnightcap)
  • Merge v1.4.0 release into develop. [skip ci] #93 (tinnightcap)
  • Update CHANGELOG for v1.4.0 release [skip ci] #92 (tinnightcap)
  • Fix typo, missing $ #90 (gozer)
  • Merge v1.4.0 release into develop. [skip ci] #89 (tinnightcap)
  • Update CHANGELOG for v1.4.0 release [skip ci] #88 (tinnightcap)
  • Upgrade duplicity to 0.7.12-0ubuntu0ppa1276~ubuntu14.04.1 #87 (gozer)
  • Upgade Traefik to v1.2.0 #86 (gozer)
  • Add a configurable live_app label #84 (gozer)
  • Allow notification configuration of alert sink (app alerts) #82 (gozer)
  • Add a "Backup in progress..." landing page during backup runs #80 (gozer)
  • Show technical_owner and account_id in federated metrics #78 (gozer)
  • Disable detailled EC2 monitoring #76 (gozer)
  • Detect and scrape mysqld-exporter instances #74 (gozer)
  • Remove the platform=nubis tag from Apache alerts #72 (gozer)
  • Ignore non-platform alerts #70 (gozer)
  • Upgrade to Prometheus 1.5.2 #68 (gozer)
  • Update builder artifacts for v1.4.0 release [skip ci] #65 (tinnightcap)
  • Terraform 0.8 Upgrade #64 (gozer)
  • Add 10 minute jitter to the backup jobs #63 (gozer)
  • Only scrape our CloudWatch Billing metrifs from the admin VPC #62 (gozer)
  • Filter all CloudWatch resources with an environment filter: #60 (gozer)
  • Set our billing scrape interval to the maximum allowed of 5 minutes #57 (gozer)
  • Create a separete cloudwatch_exporter_billing for just AWS/Billing metrics #55 (gozer)
  • Upgrade Blackbox Exporter to 0.4.0 #52 (gozer)
  • Upgrade Prometheus to 1.5.0 #51 (gozer)
  • Fix alert description comment for accuracy #49 (gozer)
  • Switch root storage to gp2(SSD) #48 (gozer)

v1.3.0 (2017-01-18)

Closed issues:

  • Randomize cron::hourly, to avoid concurrent backup runs everywhere #43
  • Apache Dashboard alert overlay is wrong #41
  • Backup to S3 with duply/duplicity #38
  • Lower default metrics retention #36
  • Upgrade to Traefik 1.1.2 #29
  • Provision the monitoring password from TF #20
  • Expose nubis_sudo_groups and nubis_user_groups userdata #18
  • Move API secrets/keys to nubis-secret #15
  • [alertmanager] Add PagerDuty support #14
  • [alertmanager] Send notification of resolved alerts #12
  • [squid] Pull telemetry from snmp_exporter #10
  • Fix small upstart tyop #7
  • Increase instance size, t2.nano is probably too small #6
  • [cleanup] Productionize Prometheus #3
  • Backup regularly to S3 #2
  • Set the Consul environments/<env>/global/node_exporter/config/enabled boolean on startup #1
  • Tag v1.3.0 release #31

Merged pull requests:

  • Little improvements for Prometheus Backups #44 (gozer)
  • Limit displayed alerts to the firing ones #42 (gozer)
  • Update builder artifacts for v1.4.0-dev release #40 (tinnightcap)
  • Use duply & duplicity to drive backups to S3 #39 (gozer)
  • lower metrics retention to 14 days #37 (gozer)
  • Update CHANGELOG for v1.3.0 release #35 (tinnightcap)
  • Update CHANGELOG for v1.3.0 release #34 (tinnightcap)
  • Update builder artifacts for v1.3.0 release #33 (tinnightcap)
  • Update CHANGELOG for v1.3.0 release #32 (tinnightcap)
  • Upgrade to Traefik 1.1.2 (includes our reported fix) #30 (gozer)
  • Fix Links #28 (tinnightcap)
  • Add Documentation #27 (tinnightcap)
  • update to nubis-travis v0.1.3 #26 (gozer)
  • use nubis-cron wrapper #25 (gozer)
  • Scrape ES exporters if present #24 (gozer)
  • enable ES in Grafana #23 (gozer)
  • tell Traefik about the admin password #22 (gozer)
  • Massive Prometheus merge of current state #21 (gozer)
  • Exposing ldap group userdata #19 (limed)
  • fix tyop #17 (gozer)
  • Add PagerDuty notification support #16 (gozer)
  • Send resolved notification to both email and slack notifiers #13 (gozer)
  • Add snmp target for proxies #11 (gozer)
  • Bump size to t2.small #9 (gozer)
  • Fix tyop #8 (gozer)
  • Don't alert on Consul services down if consul itself is not healthy #5 (gozer)
  • Refactor most of everything to ready for deployment with nubis-deploy #4 (gozer)

* This Change Log was automatically generated by github_changelog_generator