-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metricdog: anonymous bottlerocket metrics #1006
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from the surrounding bits; still need to review healthdog itself.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite done with service_check, but have to stop for the moment...
host-containerd sure, but containerd might not be used in all variants..? |
First attempt at a slightly precarious rebase. |
Fix horrible things from the first rebase attempt. |
Undo a tiny re-order that was a rebase artifact. |
Use aws_region instead of region and normalize on snake_case for params. |
Rebase the swift-moving develop again. |
Create a |
Fixes most of the lowest-hanging fruit in the PR comments. |
Weeee... rebase the migration versions again from develop! :) |
Use |
Rename the |
It's in all current variants. I think it would be good to add these to the top-level defaults.toml. A variant that doesn't use containerd can still override. P.S. We're actually overriding for all current variants now (on my WIP branch), so |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking specifically at the interactions with systemd:
It'd be nice to eventually add support for non-lingering oneshot services, but I think this is a good place to start.
Fix remaining issues, esp. providing construction defaults in two places. |
Fix a bug with the exit code parsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have the -
prefix to not block startup, but otherwise these are basically nits. Thanks!
Addresses some of the latest comments. |
Addresses more PR comments. Everything is addressed except for the version bumping of the migration. Will do that after our next release and rebase then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't trying to block this, I forgot to come back after the SimpleLogger discovery. Just a couple suggestions above, and we need to update the version now that 1.0.2 is released.
Note: I wish I had done the rebase separately. I will remember to do rebases in their own push in the future to make diffing easier. |
Rebase, fix conflict in Cargo.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
packages/os/metricdog.timer
Outdated
# Don't fire at exactly the same second across machines started together. | ||
RandomizedDelaySec=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want more jitter than this, maybe as much as an hour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added jitter in this push 11b0927
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops set jitter to 3600 this time for one hour 92230f5
sources/metricdog/README.md
Outdated
* `version`: the Bottlerocket version. | ||
* `variant`: the Bottlerocket variant. | ||
* `arch`: the machine architecture, e.g.'x86_64' or 'arm'. | ||
* `aws_region`: the AWS region the machine is running in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be AWS specific? As far as I can tell, we could just leave this as "region" and be a bit more agnostic so other platforms can make use of this field as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were running on another platform that had the same region name, we'd need some way to tell the difference, and I don't think the other fields would help. Another option would be a "platform" field set to "aws", and a "region" field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: I agree region
is better than aws_region
, and I would prefer not to add a field such as platform=aws
.
I initially called this field region
and I strongly agree with changing it back to region
. My thinking is that region
is general-enough of a concept to be potentially useful with any cloud provider, and even for on-prem deployments, since it really just means 'general location'. In fact, I dislike calling aws_region
so much that the current version of the backend accepts either region
or aws_region
! I should have spoken up sooner.
My disinclination to include platform=aws
is less strong, but it feels unnecessary since the variant
value conveys that information, and especially because we have not yet decided how we plan to codify such 'attributes of variant' in the system itself.
I don't think the other fields would help
Adding platform=aws
is currently no different than substr(variant, 3)
.
P.S. Another way of stating my objection for platform=aws
is, where does the value come from at runtime or compile time? Basically if variant in (aws-ecs-1, aws-k8s.1.15, aws-k8s.1.16, aws-k8s.1.17, aws-k8s.1.18) then platform=aws
? Doesn't seem any cleaner to code that logic into the build/runtime of Bottlerocket right now than to do so when interpreting the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think variant will convey all of the information we need. aws-dev can run anywhere, but may only be broken on one platform, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the region field - let's change it to be region
; we all seem to agree on that.
Thinking about the platform
field - I think it's a good idea, but it raises some questions like "How do we know what platform we're on?, etc"
I don't think we need to block on this, and if we choose to ship without the changes, let's open an issue to track fixing this for other platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥼
Changes the name of the |
Rebase. |
A substantial rebase. |
metricdog should also respect |
Maybe as simple as adding |
rebase |
rebase |
Add migration to Release.toml |
Fixed the migration so it does the right things. I tested it through upgrade and downgrade, will add that to the PR notes. |
Increase jitter to one hour. |
|
Metricdog is a new command-line program that sends anonymous information about a Bottlerocket instance to a metrics endpoint. A report of boot success is sent by the mark-boot-success service and at 2 minutes after boot, and every 6 hours thereafter, a report of service health is sent.
The last rebase? |
Issue Number
Closes #1000
Description
Adds a program and systemd service that checks whether certain critical services are running. Reports this to a URL via GET request.
Testing
Unit Testing
I organized the code in such a way as to facilitate unit testing everything except the calls to systemctl.
Happy Integ
I set up an S3 bucket and confirmed that metricdog requests are working. I ran metricdog manually from the command line under various scenarious and traced the URL that was produced.
Healthy Health Ping
Result:
Boot Success
One Failed Service
I 'broke' kubelet and ran a health ping, and received:
The failed services list de-escapes to
kubelet:255
Two Failed Services
I 'broke' an additional service:
The failed services list de-escapes to
containerd:1,kubelet:255
.Bad URL Integ
To make sure that a bad URL doesn't cause problems, I ran an instance with a fake URL. I found that that, send-boot-success and send-health-ping each caused an error to be reported in the system journal, but no other ill affects were observed.
Systemd
I ran hosts and observed that the systemd unit and timer work as expected.
Edit 8/28/2020: I tested an ECS variant with
ab6930c
and found the the URLs are still being constructed properly, including the reporting of failed services.Migration
I tested the migration through upgrade and downgrade, confirming that settings were added/removed and the system was healthy at all three waypoints.
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.