-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana Agent Crow #773
Grafana Agent Crow #773
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the overall approach. Since generating testing metrics depends on crow being scraped, I think it would be good to alert on not being scraped.
|
||
c := &Crow{ | ||
cfg: cfg, | ||
m: newMetrics(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but are these metrics registered at any point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I'm trying a different pattern here. Instead of registering metrics directly, Crow exposes its two collectors (internal state & test sample generator) via a method. This means the caller has to register Crow's metrics, but it allows the caller to arbitrarily register and unregister the metrics at runtime.
CGO_ENABLED=1 go test $(CGO_FLAGS) -race -cover -coverprofile=cover.out -p=4 ./... -- -online | ||
CGO_ENABLED=1 go test $(CGO_FLAGS) -cover -coverprofile=cover-norace.out -p=4 ./pkg/integrations/node_exporter ./pkg/logs -- -online |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out flags after --
aren't read and you can't omit the --
otherwise -online
will be passed to every package, even packages that don't have them defined. Build tags are the best choice here.
labels: | ||
- svc | ||
- status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The labels
field was removed at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once docker hub is ready.
@@ -1,12 +1,11 @@ | |||
// These tests depend on test assets from controller-runtime which don't work on Windows. | |||
|
|||
// +build !windows | |||
// +build !windows,has_network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to execute despite me not setting has_network
during a build…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be reproduced by running nix-build -A grafana-agent
from the root of a checkout of https://github.com/flokli/nixpkgs, grafana-agent-remove-patch
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 That's strange, I don't see the same behavior:
$ go test -v ./pkg/operator
? github.com/grafana/agent/pkg/operator [no test files]
$ go test -tags=has_network -v ./pkg/operator
=== RUN TestEnqueueRequestForSelector
# ...
What version of Go are you using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(We had to change this because of this comment: turns out I was wrong about how the flags passing worked and the build tags are the only way to do this sanely, otherwise we'd need to declare the network flag in every package that has tests defined)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's go 1.16.7
. If you grok at https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/go-modules/generic/default.nix#L168, that's how we invoke the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks, I was struggling to find that.
I see what the problem is now: v0.18.2 is a cherry-picked release containing neither your flag change or this PR: https://github.com/grafana/agent/blob/v0.18.2/pkg/operator/selector_eventhandler_test.go.
We've been having to do a few cherry-pick releases since main contains unfinished work that we can't release yet. We thinks it's confusing (leads to situations like this) and also it's a pain to manage. We're currently working on unblocking main and coming up with a better defined branching and release strategy to avoid getting in this situation again. The current plan is to have a full release of main prior to the 10th of September.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I can reproduce the commands above do the same. It might be something specific to our test phase - we might not honor tags there properly. I'll check back on this, sorry for the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, apparently this landed in main after the v0.18.2
tag, my checkout was a recent main, but the nixpkgs was pointing at v0.18.2
, so I can debug this all day 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I can confirm this works just fine on current main. Sorry for the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, sorry for the confusion with v0.18.2!
* basic working example * increase default sample backoff to 1.25s * increase check frequency * tests for ready * review feedback * finish initial implementation * sign drone * fix lint issues * Update pkg/crow/samples.go
PR Description
Croooow!
This PR creates a metrics validation tool called Grafana Agent Crow.
This is similar to Cortex's text-exporter, which wasn't working well for me. text-exporter assumes that the process is always being scraped from process boot time, which can cause false positive errors when rolling out new agents.
Crow instead will queue samples for validation at scrape time, querying against Prometheus looking for those samples a configurable number of times before marking them as failed. Crow is not a load tester. It isn't designed to generate a significant amount of series. It is only used to make sure that:
This will be used alongside loki-canary and tempo-vulture in an automated smoke testing environment.
Which issue(s) this PR fixes
Notes to the Reviewer
I'm opening this in draft to get some early feedback on the approach.
PR Checklist