Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana Agent Crow #773

Merged
merged 9 commits into from
Aug 12, 2021
Merged

Grafana Agent Crow #773

merged 9 commits into from
Aug 12, 2021

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jul 22, 2021

PR Description

Croooow!

This PR creates a metrics validation tool called Grafana Agent Crow.

This is similar to Cortex's text-exporter, which wasn't working well for me. text-exporter assumes that the process is always being scraped from process boot time, which can cause false positive errors when rolling out new agents.

Crow instead will queue samples for validation at scrape time, querying against Prometheus looking for those samples a configurable number of times before marking them as failed. Crow is not a load tester. It isn't designed to generate a significant amount of series. It is only used to make sure that:

  1. Scraped samples get sent.
  2. Scraped samples get sent with the proper value.

This will be used alongside loki-canary and tempo-vulture in an automated smoke testing environment.

Which issue(s) this PR fixes

Notes to the Reviewer

I'm opening this in draft to get some early feedback on the approach.

PR Checklist

  • CHANGELOG updated
  • Documentation added
  • Tests updated

pkg/crow/samples.go Outdated Show resolved Hide resolved
Copy link
Member

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the overall approach. Since generating testing metrics depends on crow being scraped, I think it would be good to alert on not being scraped.

pkg/crow/crow.go Outdated Show resolved Hide resolved
pkg/crow/crow.go Outdated Show resolved Hide resolved

c := &Crow{
cfg: cfg,
m: newMetrics(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it, but are these metrics registered at any point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I'm trying a different pattern here. Instead of registering metrics directly, Crow exposes its two collectors (internal state & test sample generator) via a method. This means the caller has to register Crow's metrics, but it allows the caller to arbitrarily register and unregister the metrics at runtime.

Comment on lines -224 to -225
CGO_ENABLED=1 go test $(CGO_FLAGS) -race -cover -coverprofile=cover.out -p=4 ./... -- -online
CGO_ENABLED=1 go test $(CGO_FLAGS) -cover -coverprofile=cover-norace.out -p=4 ./pkg/integrations/node_exporter ./pkg/logs -- -online
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out flags after -- aren't read and you can't omit the -- otherwise -online will be passed to every package, even packages that don't have them defined. Build tags are the best choice here.

Comment on lines -91 to -93
labels:
- svc
- status
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labels field was removed at some point.

@rfratto rfratto marked this pull request as ready for review August 12, 2021 17:48
@rfratto rfratto changed the title Draft: Grafana Agent Crow Grafana Agent Crow Aug 12, 2021
@rfratto
Copy link
Member Author

rfratto commented Aug 12, 2021

This is ready for review. I let it run for 10 minutes in docker-compose and there's a 100% success rate:

image

pkg/crow/samples.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once docker hub is ready.

@rfratto rfratto merged commit cbde019 into grafana:main Aug 12, 2021
@rfratto rfratto deleted the crow branch August 12, 2021 23:22
@@ -1,12 +1,11 @@
// These tests depend on test assets from controller-runtime which don't work on Windows.

// +build !windows
// +build !windows,has_network
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to execute despite me not setting has_network during a build…

Copy link
Contributor

@flokli flokli Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reproduced by running nix-build -A grafana-agent from the root of a checkout of https://github.com/flokli/nixpkgs, grafana-agent-remove-patch branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 That's strange, I don't see the same behavior:

$ go test -v ./pkg/operator
?       github.com/grafana/agent/pkg/operator   [no test files]
$ go test -tags=has_network -v ./pkg/operator
=== RUN   TestEnqueueRequestForSelector
# ... 

What version of Go are you using?

Copy link
Member Author

@rfratto rfratto Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We had to change this because of this comment: turns out I was wrong about how the flags passing worked and the build tags are the only way to do this sanely, otherwise we'd need to declare the network flag in every package that has tests defined)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's go 1.16.7. If you grok at https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/go-modules/generic/default.nix#L168, that's how we invoke the build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks, I was struggling to find that.

I see what the problem is now: v0.18.2 is a cherry-picked release containing neither your flag change or this PR: https://github.com/grafana/agent/blob/v0.18.2/pkg/operator/selector_eventhandler_test.go.

We've been having to do a few cherry-pick releases since main contains unfinished work that we can't release yet. We thinks it's confusing (leads to situations like this) and also it's a pain to manage. We're currently working on unblocking main and coming up with a better defined branching and release strategy to avoid getting in this situation again. The current plan is to have a full release of main prior to the 10th of September.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I can reproduce the commands above do the same. It might be something specific to our test phase - we might not honor tags there properly. I'll check back on this, sorry for the noise.

Copy link
Contributor

@flokli flokli Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, apparently this landed in main after the v0.18.2 tag, my checkout was a recent main, but the nixpkgs was pointing at v0.18.2, so I can debug this all day 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can confirm this works just fine on current main. Sorry for the noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, sorry for the confusion with v0.18.2!

@mattdurham mattdurham mentioned this pull request Sep 7, 2021
3 tasks
mattdurham pushed a commit that referenced this pull request Nov 11, 2021
* basic working example

* increase default sample backoff to 1.25s

* increase check frequency

* tests for ready

* review feedback

* finish initial implementation

* sign drone

* fix lint issues

* Update pkg/crow/samples.go
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 9, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants