-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pkg/prom/cluster package with membership management #461
Conversation
The tests are pretty flaky, CI might fail here. I'll look into why it flakes tomorrow. |
Ah, it's flaking because it takes a little bit of time for the lifecycler to join the ring in a separate goroutine. ApplyConfig should probably find a way to wait to verify that the node is in the ring. |
Enabled bool `yaml:"enabled"` | ||
ReshardInterval time.Duration `yaml:"reshard_interval"` | ||
ReshardTimeout time.Duration `yaml:"reshard_timeout"` | ||
Client client.Config `yaml:"client"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the client config here, which will be a change that I need to handle when this is wired in.
Ready for review. I don't think I can make the implementation any simpler than it is now. I tried finding a better way to handle the tests, but I'm not sure there's a ton of room for improvement, even if they are a little ugly. |
cc @56quarters since this will eventually be the new scraping service package. |
// TransferOut implements ring.FlushTransferer. It connects to all other healthy agents and | ||
// tells them to reshard. TransferOut should NOT be called manually unless the mutex is | ||
// held. | ||
func (n *node) TransferOut(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we don't hold the mutex here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if we hold the mutex here then it deadlocks with ApplyConfig
when stopping the previous lifecycler 🙃. I'm not thrilled about this, but I can't think of a better way of handling it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brainstorming, since this looks like it is only called when leaving the ring and not called again under normal use can we use Once to help in anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, that would normally help but not in this case. Every time you call ApplyConfig
you will leave the old ring from the old config before joining the new ring with the new config, so sync.Once
would break here.
* initial commit of pkg/prom/cluster package * add tests * fix test / race condition problems * address review feedback
PR Description
This is the first of a few PRs to migrate the
pkg/prom/ha
logic to apkg/prom/cluster
package.This first PR re-implements membership within the cluster to a dedicated
node
type. Thenode
type will be used for ensuring the Agent maintains appropriate membership in the proper cluster as the config changes.Reshards are triggered during the membership cycle:
Which issue(s) this PR fixes
Believe it or not, still working on #147.
Notes to the Reviewer
This isn't hooked into anything yet, and won't be hooked up until all pieces are migrated. Tests still needed, so I'm opening this in draft.
PR Checklist