Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pkg/prom/cluster package with membership management #461

Merged
merged 4 commits into from
Mar 11, 2021

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Mar 10, 2021

PR Description

This is the first of a few PRs to migrate the pkg/prom/ha logic to a pkg/prom/cluster package.
This first PR re-implements membership within the cluster to a dedicated node type. The node type will be used for ensuring the Agent maintains appropriate membership in the proper cluster as the config changes.

Reshards are triggered during the membership cycle:

  • All Agents will be told to reshard when a node joins (including the joining Agent)
  • All Agents will be told to reshard when a node leaves (excluding the leaving Agent)

Which issue(s) this PR fixes

Believe it or not, still working on #147.

Notes to the Reviewer

This isn't hooked into anything yet, and won't be hooked up until all pieces are migrated. Tests still needed, so I'm opening this in draft.

PR Checklist

  • CHANGELOG updated
  • Documentation added
  • Tests updated

@rfratto
Copy link
Member Author

rfratto commented Mar 10, 2021

The tests are pretty flaky, CI might fail here. I'll look into why it flakes tomorrow.

@rfratto
Copy link
Member Author

rfratto commented Mar 10, 2021

Ah, it's flaking because it takes a little bit of time for the lifecycler to join the ring in a separate goroutine. ApplyConfig should probably find a way to wait to verify that the node is in the ring.

Enabled bool `yaml:"enabled"`
ReshardInterval time.Duration `yaml:"reshard_interval"`
ReshardTimeout time.Duration `yaml:"reshard_timeout"`
Client client.Config `yaml:"client"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the client config here, which will be a change that I need to handle when this is wired in.

@rfratto
Copy link
Member Author

rfratto commented Mar 11, 2021

Ready for review. I don't think I can make the implementation any simpler than it is now. I tried finding a better way to handle the tests, but I'm not sure there's a ton of room for improvement, even if they are a little ugly.

@rfratto rfratto marked this pull request as ready for review March 11, 2021 17:23
@rfratto rfratto requested a review from mattdurham March 11, 2021 17:24
@rfratto
Copy link
Member Author

rfratto commented Mar 11, 2021

cc @56quarters since this will eventually be the new scraping service package.

pkg/prom/cluster/node.go Outdated Show resolved Hide resolved
pkg/prom/cluster/node.go Outdated Show resolved Hide resolved
// TransferOut implements ring.FlushTransferer. It connects to all other healthy agents and
// tells them to reshard. TransferOut should NOT be called manually unless the mutex is
// held.
func (n *node) TransferOut(ctx context.Context) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't hold the mutex here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if we hold the mutex here then it deadlocks with ApplyConfig when stopping the previous lifecycler 🙃. I'm not thrilled about this, but I can't think of a better way of handling it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brainstorming, since this looks like it is only called when leaving the ring and not called again under normal use can we use Once to help in anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, that would normally help but not in this case. Every time you call ApplyConfig you will leave the old ring from the old config before joining the new ring with the new config, so sync.Once would break here.

@rfratto rfratto merged commit 12e1372 into grafana:main Mar 11, 2021
@rfratto rfratto deleted the clustering-node branch March 11, 2021 21:06
@rfratto rfratto mentioned this pull request Mar 11, 2021
3 tasks
@mattdurham mattdurham mentioned this pull request Sep 7, 2021
3 tasks
mattdurham pushed a commit that referenced this pull request Nov 11, 2021
* initial commit of pkg/prom/cluster package

* add tests

* fix test / race condition problems

* address review feedback
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 15, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants