Add pkg/prom/cluster package with membership management #461

rfratto · 2021-03-10T19:06:32Z

PR Description

This is the first of a few PRs to migrate the pkg/prom/ha logic to a pkg/prom/cluster package.
This first PR re-implements membership within the cluster to a dedicated node type. The node type will be used for ensuring the Agent maintains appropriate membership in the proper cluster as the config changes.

Reshards are triggered during the membership cycle:

All Agents will be told to reshard when a node joins (including the joining Agent)
All Agents will be told to reshard when a node leaves (excluding the leaving Agent)

Which issue(s) this PR fixes

Believe it or not, still working on #147.

Notes to the Reviewer

This isn't hooked into anything yet, and won't be hooked up until all pieces are migrated. Tests still needed, so I'm opening this in draft.

PR Checklist

CHANGELOG updated
Documentation added
Tests updated

rfratto · 2021-03-10T22:48:33Z

The tests are pretty flaky, CI might fail here. I'll look into why it flakes tomorrow.

rfratto · 2021-03-10T23:01:02Z

Ah, it's flaking because it takes a little bit of time for the lifecycler to join the ring in a separate goroutine. ApplyConfig should probably find a way to wait to verify that the node is in the ring.

rfratto · 2021-03-11T17:18:40Z

pkg/prom/cluster/config.go

+	Enabled         bool                  `yaml:"enabled"`
+	ReshardInterval time.Duration         `yaml:"reshard_interval"`
+	ReshardTimeout  time.Duration         `yaml:"reshard_timeout"`
+	Client          client.Config         `yaml:"client"`


I moved the client config here, which will be a change that I need to handle when this is wired in.

rfratto · 2021-03-11T17:23:56Z

Ready for review. I don't think I can make the implementation any simpler than it is now. I tried finding a better way to handle the tests, but I'm not sure there's a ton of room for improvement, even if they are a little ugly.

rfratto · 2021-03-11T17:24:27Z

cc @56quarters since this will eventually be the new scraping service package.

pkg/prom/cluster/node.go

mattdurham · 2021-03-11T19:42:23Z

pkg/prom/cluster/node.go

+// TransferOut implements ring.FlushTransferer. It connects to all other healthy agents and
+// tells them to reshard. TransferOut should NOT be called manually unless the mutex is
+// held.
+func (n *node) TransferOut(ctx context.Context) error {


Is there a reason we don't hold the mutex here?

Yep, if we hold the mutex here then it deadlocks with ApplyConfig when stopping the previous lifecycler 🙃. I'm not thrilled about this, but I can't think of a better way of handling it.

Brainstorming, since this looks like it is only called when leaving the ring and not called again under normal use can we use Once to help in anyway?

Nah, that would normally help but not in this case. Every time you call ApplyConfig you will leave the old ring from the old config before joining the new ring with the new config, so sync.Once would break here.

* initial commit of pkg/prom/cluster package * add tests * fix test / race condition problems * address review feedback

rfratto added 2 commits March 10, 2021 14:02

initial commit of pkg/prom/cluster package

c0222fc

add tests

0d890f8

fix test / race condition problems

239e3ad

rfratto commented Mar 11, 2021

View reviewed changes

rfratto marked this pull request as ready for review March 11, 2021 17:23

rfratto requested a review from mattdurham March 11, 2021 17:24

mattdurham reviewed Mar 11, 2021

View reviewed changes

mattdurham approved these changes Mar 11, 2021

View reviewed changes

address review feedback

0063014

rfratto merged commit 12e1372 into grafana:main Mar 11, 2021

rfratto deleted the clustering-node branch March 11, 2021 21:06

rfratto mentioned this pull request Mar 11, 2021

pkg/prom/cluster: add config watcher #462

Merged

3 tasks

mattdurham mentioned this pull request Sep 7, 2021

crow doc rfratto/agent#8

Closed

3 tasks

mattdurham pushed a commit that referenced this pull request Nov 11, 2021

Add pkg/prom/cluster package with membership management (#461)

e0439df

* initial commit of pkg/prom/cluster package * add tests * fix test / race condition problems * address review feedback

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 15, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pkg/prom/cluster package with membership management #461

Add pkg/prom/cluster package with membership management #461

rfratto commented Mar 10, 2021 •

edited

Loading

rfratto commented Mar 10, 2021

rfratto commented Mar 10, 2021

rfratto Mar 11, 2021

rfratto commented Mar 11, 2021

rfratto commented Mar 11, 2021

mattdurham Mar 11, 2021

rfratto Mar 11, 2021

mattdurham Mar 11, 2021

rfratto Mar 11, 2021

Add pkg/prom/cluster package with membership management #461

Add pkg/prom/cluster package with membership management #461

Conversation

rfratto commented Mar 10, 2021 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

rfratto commented Mar 10, 2021

rfratto commented Mar 10, 2021

rfratto Mar 11, 2021

Choose a reason for hiding this comment

rfratto commented Mar 11, 2021

rfratto commented Mar 11, 2021

mattdurham Mar 11, 2021

Choose a reason for hiding this comment

rfratto Mar 11, 2021

Choose a reason for hiding this comment

mattdurham Mar 11, 2021

Choose a reason for hiding this comment

rfratto Mar 11, 2021

Choose a reason for hiding this comment

rfratto commented Mar 10, 2021 •

edited

Loading