metricdog: anonymous bottlerocket metrics #1006

webern · 2020-07-31T22:42:01Z

Issue Number

Description

Adds a program and systemd service that checks whether certain critical services are running. Reports this to a URL via GET request.

Testing

Unit Testing

I organized the code in such a way as to facilitate unit testing everything except the calls to systemctl.

Happy Integ

I set up an S3 bucket and confirmed that metricdog requests are working. I ran metricdog manually from the command line under various scenarious and traced the URL that was produced.

Healthy Health Ping

metricdog --log-level trace send-health-ping

Result:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false&failed_services=&is_healthy=true

Boot Success

metricdog --log-level trace send-boot-success

https://somedomain.com/metricstest?sender=metricdog&event=boot-success&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false

One Failed Service

I 'broke' kubelet and ran a health ping, and received:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1712&version-lock=latest&ignore-waves=false&failed_services=kubelet%3A255&is_healthy=false

The failed services list de-escapes to kubelet:255

Two Failed Services

I 'broke' an additional service:

https://somedomain.com/metricstest?sender=metricdog&event=health-ping&version=0.4.1&variant=aws-k8s-1.15&arch=x86_64&region=us-west-2&seed=1605&version-lock=latest&ignore-waves=false&failed_services=containerd%3A1%2Ckubelet%3A255&is_healthy=false

The failed services list de-escapes to containerd:1,kubelet:255.

Bad URL Integ

To make sure that a bad URL doesn't cause problems, I ran an instance with a fake URL. I found that that, send-boot-success and send-health-ping each caused an error to be reported in the system journal, but no other ill affects were observed.

Systemd

I ran hosts and observed that the systemd unit and timer work as expected.

Edit 8/28/2020: I tested an ECS variant with ab6930c and found the the URLs are still being constructed properly, including the reporting of failed services.

Migration

I tested the migration through upgrade and downgrade, confirming that settings were added/removed and the system was healthy at all three waypoints.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

tjkirch

Some comments from the surrounding bits; still need to review healthdog itself.

sources/models/src/aws-k8s-1.15/mod.rs

packages/os/healthdog-toml

packages/os/healthdog.service

packages/os/healthdog.timer

sources/models/defaults.toml

sources/models/src/aws-k8s-1.15/override-defaults.toml

sources/healthdog/README.md

webern · 2020-08-04T00:15:16Z

webern force-pushed the webern:healthdog branch from ec0abbf to f18d841

renamed settings.healthdog to settings.metrics.
renamed region to aws-region.
added (and tested) settings for ecs and dev variants.
made suggested changes to systemd configs (and tested them)

tjkirch

Not quite done with service_check, but have to stop for the moment...

sources/healthdog/Cargo.toml

sources/healthdog/src/main_test.rs

sources/models/src/aws-dev/override-defaults.toml

sources/healthdog/src/healthdog.rs

sources/healthdog/src/service_check.rs

sources/healthdog/src/main.rs

sources/models/src/aws-dev/override-defaults.toml

sources/models/defaults.toml

sources/models/src/aws-ecs-1/override-defaults.toml

tjkirch · 2020-08-07T23:38:16Z

Maybe also chrony, containerd, host-containerd, since they're all related to default settings.

host-containerd sure, but containerd might not be used in all variants..?

webern · 2020-08-24T21:42:09Z

webern force-pushed the webern:healthdog branch from f18d841 to 4401e25

First attempt at a slightly precarious rebase.

webern · 2020-08-24T22:42:25Z

webern force-pushed the webern:healthdog branch from 4401e25 to 809f98f

Fix horrible things from the first rebase attempt.

webern · 2020-08-24T22:46:39Z

webern force-pushed the webern:healthdog branch from 809f98f to 3fae608

Undo a tiny re-order that was a rebase artifact.

webern · 2020-08-25T19:53:33Z

webern force-pushed the webern:healthdog branch from 3fae608 to 1083d9b

Use aws_region instead of region and normalize on snake_case for params.

webern · 2020-08-25T20:01:36Z

webern force-pushed the webern:healthdog branch from 1083d9b to d62f75d

Rebase the swift-moving develop again.

webern · 2020-08-25T22:04:24Z

webern force-pushed the webern:healthdog branch from d62f75d to f515d38

Create a join-strings helper in schnauzer to simplify the healthdog-toml template.

webern · 2020-08-26T19:35:40Z

webern force-pushed the webern:healthdog branch from f515d38 to 2a5aef0

Fixes most of the lowest-hanging fruit in the PR comments.

webern · 2020-08-26T19:49:17Z

webern force-pushed the webern:healthdog branch from 2a5aef0 to 9dc047c

Weeee... rebase the migration versions again from develop! :)

webern · 2020-08-26T21:28:25Z

webern force-pushed the webern:healthdog branch from 9dc047c to a58c5e2

Use structopt.

webern · 2020-08-26T23:36:33Z

webern force-pushed the webern:healthdog branch from a58c5e2 to cbc5d1d

Rename the services setting to service-checks.

webern · 2020-08-28T00:02:38Z

Maybe also chrony, containerd, host-containerd, since they're all related to default settings.

host-containerd sure, but containerd might not be used in all variants..?

It's in all current variants. I think it would be good to add these to the top-level defaults.toml. A variant that doesn't use containerd can still override.

P.S. We're actually overriding for all current variants now (on my WIP branch), so defaults.toml more like an example anyway.

webern · 2020-08-28T01:09:08Z

webern force-pushed the webern:healthdog branch from cbc5d1d to 11d0d7b

Enormous improvement to the systemctl exit code parsing (thank you, Tom)
Add containerd and chronyd to the default service checks
fix some issues with my renaming of the service-checks setting

sources/healthdog/src/error.rs

jamieand

Looking specifically at the interactions with systemd:

It'd be nice to eventually add support for non-lingering oneshot services, but I think this is a good place to start.

webern · 2020-08-28T21:02:47Z

webern force-pushed the webern:healthdog branch from 328b475 to 3909ee7

Fix remaining issues, esp. providing construction defaults in two places.

webern · 2020-08-28T22:34:09Z

webern force-pushed the webern:healthdog branch from 3909ee7 to ab6930c

Fix a bug with the exit code parsing.

tjkirch

I think we should have the - prefix to not block startup, but otherwise these are basically nits. Thanks!

packages/os/mark-successful-boot.service

sources/Cargo.toml

sources/api/schnauzer/src/helpers.rs

sources/models/defaults.toml

sources/metricdog/src/metricdog.rs

sources/metricdog/src/service_check.rs

webern · 2020-10-05T20:16:42Z

webern force-pushed the webern:healthdog branch from 153acff to b8b5c3a

Addresses some of the latest comments.

webern · 2020-10-06T23:07:45Z

webern force-pushed the webern:healthdog branch from b8b5c3a to c966f19

Addresses more PR comments. Everything is addressed except for the version bumping of the migration. Will do that after our next release and rebase then.

sources/api/schnauzer/src/lib.rs

sources/metricdog/src/main.rs

sources/api/schnauzer/src/helpers.rs

tjkirch

Sorry, I wasn't trying to block this, I forgot to come back after the SimpleLogger discovery. Just a couple suggestions above, and we need to update the version now that 1.0.2 is released.

webern · 2020-10-20T00:32:36Z

webern force-pushed the webern:healthdog branch from c966f19 to 7404d48

rename schnauzer helper from join-array to join_array
change schnauzer join_arrayhelper to not add spaces
rebase
Change metrics URL to include the v1/metrics path, which the back-end expects and enforces.

Note: I wish I had done the rebase separately. I will remember to do rebases in their own push in the future to make diffing easier.

webern · 2020-10-20T20:57:30Z

webern force-pushed the webern:healthdog branch from c966f19 to 7404d48

Rebase, fix conflict in Cargo.toml

bcressey

Nicely done!

bcressey · 2020-10-20T22:01:36Z

packages/os/metricdog.timer

+# Don't fire at exactly the same second across machines started together.
+RandomizedDelaySec=10


We probably want more jitter than this, maybe as much as an hour.

Added jitter in this push 11b0927

Oops set jitter to 3600 this time for one hour 92230f5

sources/metricdog/README.md

sources/metricdog/src/metricdog.rs

packages/os/metricdog.service

zmrow · 2020-10-21T15:16:48Z

sources/metricdog/README.md

+* `version`: the Bottlerocket version.
+* `variant`: the Bottlerocket variant.
+* `arch`: the machine architecture, e.g.'x86_64' or 'arm'.
+* `aws_region`: the AWS region the machine is running in.


Does this need to be AWS specific? As far as I can tell, we could just leave this as "region" and be a bit more agnostic so other platforms can make use of this field as well

If it were running on another platform that had the same region name, we'd need some way to tell the difference, and I don't think the other fields would help. Another option would be a "platform" field set to "aws", and a "region" field?

TL;DR: I agree region is better than aws_region, and I would prefer not to add a field such as platform=aws.

I initially called this field region and I strongly agree with changing it back to region. My thinking is that region is general-enough of a concept to be potentially useful with any cloud provider, and even for on-prem deployments, since it really just means 'general location'. In fact, I dislike calling aws_region so much that the current version of the backend accepts either region or aws_region! I should have spoken up sooner.

My disinclination to include platform=aws is less strong, but it feels unnecessary since the variant value conveys that information, and especially because we have not yet decided how we plan to codify such 'attributes of variant' in the system itself.

I don't think the other fields would help

Adding platform=aws is currently no different than substr(variant, 3).

P.S. Another way of stating my objection for platform=aws is, where does the value come from at runtime or compile time? Basically if variant in (aws-ecs-1, aws-k8s.1.15, aws-k8s.1.16, aws-k8s.1.17, aws-k8s.1.18) then platform=aws? Doesn't seem any cleaner to code that logic into the build/runtime of Bottlerocket right now than to do so when interpreting the metrics.

I don't think variant will convey all of the information we need. aws-dev can run anywhere, but may only be broken on one platform, for example.

Re: the region field - let's change it to be region; we all seem to agree on that.

Thinking about the platform field - I think it's a good idea, but it raises some questions like "How do we know what platform we're on?, etc"

I don't think we need to block on this, and if we choose to ship without the changes, let's open an issue to track fixing this for other platforms.

zmrow

🥼

webern · 2020-11-03T20:35:54Z

webern force-pushed the webern:healthdog branch from a938b1f to 7c89fe2

Changes the name of the aws_region query parameter back to region (as discussed above).

webern · 2020-11-03T21:22:24Z

webern force-pushed the webern:healthdog branch from 7c89fe2 to 518f4c6

Rebase.

webern · 2021-01-13T21:24:55Z

webern force-pushed the webern:healthdog branch from 518f4c6 to 520e55d

A substantial rebase.

bcressey · 2021-01-28T20:38:55Z

metricdog should also respect settings.network.https-proxy and settings.network.no-proxy.

webern · 2021-01-28T20:58:08Z

metricdog should also respect settings.network.https-proxy and settings.network.no-proxy.

Maybe as simple as adding .proxy(reqwest::Proxy::http("https://my.prox")?) to the reqwest client when settings.network.https-proxy is set? I'm not sure I understand what settings.network.no-proxy means at the moment.

webern · 2021-02-04T20:12:25Z

webern force-pushed the webern:healthdog branch from 520e55d to b7a6dfe

rebase

webern · 2021-02-09T23:37:46Z

webern force-pushed the webern:healthdog branch from b7a6dfe to 10a32f7

rebase

webern · 2021-02-10T23:09:31Z

webern force-pushed the webern:healthdog branch from 10a32f7 to c6e68bf

Add migration to Release.toml

webern · 2021-02-11T21:53:16Z

webern force-pushed the webern:healthdog branch from c6e68bf to 3db70e2

Fixed the migration so it does the right things. I tested it through upgrade and downgrade, will add that to the PR notes.

webern · 2021-02-12T19:13:26Z

webern force-pushed the webern:healthdog branch from 3db70e2 to 11b0927

Increase jitter to one hour.

webern · 2021-02-12T19:34:26Z

webern force-pushed the webern:healthdog branch from 560f336 to 92230f5

Correct the Jitter to 3600 (one hour)
Add a TimeoutStartSec to the service

Metricdog is a new command-line program that sends anonymous information about a Bottlerocket instance to a metrics endpoint. A report of boot success is sent by the mark-boot-success service and at 2 minutes after boot, and every 6 hours thereafter, a report of service health is sent.

webern · 2021-02-12T21:13:07Z

webern force-pushed the webern:healthdog branch from 92230f5 to 312eff5

The last rebase?

webern requested review from tjkirch, iliana, bcressey and jamieand July 31, 2020 22:42

tjkirch reviewed Aug 3, 2020

View reviewed changes

samuelkarp reviewed Aug 3, 2020

View reviewed changes

sources/healthdog/README.md Outdated Show resolved Hide resolved

tjkirch reviewed Aug 4, 2020

View reviewed changes

sources/healthdog/src/service_check.rs Outdated Show resolved Hide resolved

etungsten reviewed Aug 5, 2020

View reviewed changes

sources/healthdog/src/main.rs Outdated Show resolved Hide resolved

bcressey reviewed Aug 7, 2020

View reviewed changes

sources/models/src/aws-dev/override-defaults.toml Outdated Show resolved Hide resolved

sources/models/defaults.toml Outdated Show resolved Hide resolved

sources/models/src/aws-ecs-1/override-defaults.toml Outdated Show resolved Hide resolved

webern marked this pull request as draft August 10, 2020 18:05

webern commented Aug 28, 2020

View reviewed changes

sources/healthdog/src/error.rs Outdated Show resolved Hide resolved

jamieand approved these changes Aug 28, 2020

View reviewed changes

webern requested a review from etungsten August 28, 2020 22:41

tjkirch reviewed Oct 5, 2020

View reviewed changes

tjkirch reviewed Oct 6, 2020

View reviewed changes

sources/api/schnauzer/src/lib.rs Outdated Show resolved Hide resolved

sources/metricdog/src/main.rs Show resolved Hide resolved

sources/api/schnauzer/src/helpers.rs Outdated Show resolved Hide resolved

tjkirch approved these changes Oct 13, 2020

View reviewed changes

tjkirch mentioned this pull request Oct 13, 2020

Switch from TermLogger to SimpleLogger #1157

Closed

bcressey approved these changes Oct 20, 2020

View reviewed changes

zmrow reviewed Oct 21, 2020

View reviewed changes

zmrow approved these changes Oct 22, 2020

View reviewed changes

gregdek added the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021

tjkirch removed the status/needs-triage Pending triage or re-evaluation label Jan 7, 2021

webern mentioned this pull request Feb 2, 2021

metrics: respect proxy settings #1298

Closed

webern requested a review from tjkirch February 12, 2021 16:14

webern merged commit b51b3b2 into bottlerocket-os:develop Feb 12, 2021

webern deleted the healthdog branch February 12, 2021 23:44

		# Don't fire at exactly the same second across machines started together.
		RandomizedDelaySec=10

metricdog: anonymous bottlerocket metrics #1006

metricdog: anonymous bottlerocket metrics #1006

Conversation

webern commented Jul 31, 2020 • edited Loading

Issue Number

Description

Testing

Unit Testing

Happy Integ

Healthy Health Ping

Boot Success

One Failed Service

Two Failed Services

Bad URL Integ

Systemd

Migration

Terms of contribution:

tjkirch left a comment

Choose a reason for hiding this comment

webern commented Aug 4, 2020

tjkirch left a comment

Choose a reason for hiding this comment

tjkirch commented Aug 7, 2020

webern commented Aug 24, 2020

webern commented Aug 24, 2020

webern commented Aug 24, 2020

webern commented Aug 25, 2020 • edited Loading

webern commented Aug 25, 2020

webern commented Aug 25, 2020

webern commented Aug 26, 2020

webern commented Aug 26, 2020

webern commented Aug 26, 2020 • edited Loading

webern commented Aug 26, 2020

webern commented Aug 28, 2020 • edited Loading

webern commented Aug 28, 2020

jamieand left a comment

Choose a reason for hiding this comment

webern commented Aug 28, 2020

webern commented Aug 28, 2020

tjkirch left a comment

Choose a reason for hiding this comment

webern commented Oct 5, 2020

webern commented Oct 6, 2020 • edited Loading

tjkirch left a comment

Choose a reason for hiding this comment

webern commented Oct 20, 2020 • edited Loading

webern commented Oct 20, 2020

bcressey left a comment

Choose a reason for hiding this comment

bcressey Oct 20, 2020

Choose a reason for hiding this comment

webern Feb 12, 2021

Choose a reason for hiding this comment

webern Feb 12, 2021

Choose a reason for hiding this comment

zmrow Oct 21, 2020

Choose a reason for hiding this comment

tjkirch Oct 21, 2020

Choose a reason for hiding this comment

webern Oct 21, 2020 • edited Loading

Choose a reason for hiding this comment

tjkirch Oct 21, 2020

Choose a reason for hiding this comment

zmrow Oct 22, 2020

Choose a reason for hiding this comment

zmrow left a comment

Choose a reason for hiding this comment

webern commented Nov 3, 2020

webern commented Nov 3, 2020

webern commented Jan 13, 2021

bcressey commented Jan 28, 2021

webern commented Jan 28, 2021

webern commented Feb 4, 2021

webern commented Feb 9, 2021

webern commented Feb 10, 2021

webern commented Feb 11, 2021

webern commented Feb 12, 2021

webern commented Feb 12, 2021

webern commented Feb 12, 2021

webern commented Jul 31, 2020 •

edited

Loading

webern commented Aug 25, 2020 •

edited

Loading

webern commented Aug 26, 2020 •

edited

Loading

webern commented Aug 28, 2020 •

edited

Loading

webern commented Oct 6, 2020 •

edited

Loading

webern commented Oct 20, 2020 •

edited

Loading

webern Oct 21, 2020 •

edited

Loading