Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApplyConfig for integrations #486

Merged
merged 5 commits into from
Mar 25, 2021
Merged

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Mar 24, 2021

PR Description

Implements ApplyConfig for integrations.

This requires a change to the integration interface to no longer register routes directly to mux. This is because mux is unable to unregister routes. The new approach is to expose the HTTP handlers that expose metrics, and look up the handler at request time.

Which issue(s) this PR fixes

Related to #147.

Notes to the Reviewer

Everything still seems to work after local testing. The new approach here is pretty similar, but stores the goroutines in a map (kind of like the basic instance manager) so they can be stopped independently.

One interesting detail here: when "updating" an integration (stopping the old goroutine and starting a new one), the new goroutine does not wait for the previous one to stop. I don't believe this will cause any problems, but we might have to change this behavior later.

PR Checklist

  • CHANGELOG updated
  • Documentation added
  • Tests updated

@rfratto rfratto requested a review from mattdurham March 24, 2021 21:14
pkg/integrations/manager.go Show resolved Hide resolved
if err != nil {
return nil, err
return fmt.Errorf("error initializing integration %q: %w", ic.Name(), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when this returns an error? Is it a problem that new configuration has been applied to some integrations an not others? Or is the idea that the agent would exit if this happened?

Copy link
Member Author

@rfratto rfratto Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this as I've been implementing these, and I see what you're getting at. It's a problem that this returns an early error, since it leaves the Agent in a hard to predict state, where all valid changes were applied, but all (potentially valid) pending changes were aborted after coming across an invalid change. I can change this to log each error and return the first error after iterating through all of them.

An error returned here would eventually turn into a 4xx/5xx HTTP response for the /-/reload caller.

Given we're going to call /-/reload a beta feature, I think we should commit to saying it always updates the state of components that received a valid configuration, but any invalid configuration will leave that component in an undefined state (either running with the last known config or shut down and no longer running).

It's hard to commit to anything more than this, especially since it's not always possible to bring up the component with the new settings before shutting down the old one. One example of this is the Tempo receivers, which open up TCP/UDP ports and need to be shut down so the new config can take its place.

I hacked up some quick docs below, let me know what you think with the (beta feature) commitment for what we can guarantee /-/reload can/can't do.

/-/reload is an endpoint that updates the various components of the Agent to reflect the newest config on disk. It is currently in beta, so please open issues for any problems you see or ways to improve the functionality, especially around
handing runtime errors.

When /-/reload is invoked, the config file provided to the Agent's -config.file flag will be re-read. Invalid YAML, or YAML that doesn't pass basic validation will be immediately rejected. Otherwise, all components will be updated to match the state of the new config file.

When updating component state, runtime errors can happen that wouldn't be caught when reading in the YAML. For example, setting the WAL to an invalid directory would be caught at runtime, not when the config file is read. Runtime errors that prevent a component from being update will leave that component in an undefined state, where it is either completely shut down or running with its last known config. Runtime errors must be fixed by addressing the problem in the configuration file and invoking /-/reload again.

Runtime errors encountered while updating the state do not stop the update from completing, and all valid changes will always be applied. The Agent has many "components" that receive updates:

  1. The Server config
  2. Each Prometheus instance
  3. Each Loki instance
  4. Each Tempo instance
  5. Each Integration
  6. The scraping service (which is the clustering mechanism + the key value store)

If a call to /-/reload returns an error, always examine the logs to learn exactly what went wrong and which components have been shut down as a result of the runtime error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good compromise. "the component with invalid configuration is in an undefined state but other things are fine" feels like the right tradeoff when a user provides a bad config. As long as users can fix the config and call /-/reload again this seems like very straightforward behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible/wise to reject the update entirely and re-apply the old config? IE should we treat this has a transactional update. If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, and I can see arguments on both sides here. I've been using Prometheus as the reference implementation which does the same thing as what we're doing now. Having transactional reloads will be an implementation detail of /-/reload itself though, where it just re-applies the last valid config before returning the error back to the user.

I have some concerns about transactional reloads being noisy at the cluster level if that's being used, but this is something that can be discussed later.

If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?

Kind of. It's hard to do this in a strict way, but other components depend on the listen address / port being valid and open when the server starts listening. So if the server config is invalid, stuff will just subtly break at runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, especially if we let it chill in Beta for a bit and get feedback, with the knowledge to users that the semantics could change.

@rfratto rfratto merged commit 7727df2 into grafana:main Mar 25, 2021
@rfratto rfratto deleted the integrations-applyconfig branch March 25, 2021 15:38
@mattdurham mattdurham mentioned this pull request Sep 7, 2021
3 tasks
mattdurham pushed a commit that referenced this pull request Nov 11, 2021
* dynamic metrics endpoint

* add applyconfig

* finish writing code, updating tests

* use common function for converting integration name to key

* no early returns
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 15, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants