-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ApplyConfig for integrations #486
Conversation
pkg/integrations/manager.go
Outdated
if err != nil { | ||
return nil, err | ||
return fmt.Errorf("error initializing integration %q: %w", ic.Name(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when this returns an error? Is it a problem that new configuration has been applied to some integrations an not others? Or is the idea that the agent would exit if this happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this as I've been implementing these, and I see what you're getting at. It's a problem that this returns an early error, since it leaves the Agent in a hard to predict state, where all valid changes were applied, but all (potentially valid) pending changes were aborted after coming across an invalid change. I can change this to log each error and return the first error after iterating through all of them.
An error returned here would eventually turn into a 4xx/5xx HTTP response for the /-/reload
caller.
Given we're going to call /-/reload a beta feature, I think we should commit to saying it always updates the state of components that received a valid configuration, but any invalid configuration will leave that component in an undefined state (either running with the last known config or shut down and no longer running).
It's hard to commit to anything more than this, especially since it's not always possible to bring up the component with the new settings before shutting down the old one. One example of this is the Tempo receivers, which open up TCP/UDP ports and need to be shut down so the new config can take its place.
I hacked up some quick docs below, let me know what you think with the (beta feature) commitment for what we can guarantee /-/reload can/can't do.
/-/reload
is an endpoint that updates the various components of the Agent to reflect the newest config on disk. It is currently in beta, so please open issues for any problems you see or ways to improve the functionality, especially around
handing runtime errors.When
/-/reload
is invoked, the config file provided to the Agent's-config.file
flag will be re-read. Invalid YAML, or YAML that doesn't pass basic validation will be immediately rejected. Otherwise, all components will be updated to match the state of the new config file.When updating component state, runtime errors can happen that wouldn't be caught when reading in the YAML. For example, setting the WAL to an invalid directory would be caught at runtime, not when the config file is read. Runtime errors that prevent a component from being update will leave that component in an undefined state, where it is either completely shut down or running with its last known config. Runtime errors must be fixed by addressing the problem in the configuration file and invoking
/-/reload
again.Runtime errors encountered while updating the state do not stop the update from completing, and all valid changes will always be applied. The Agent has many "components" that receive updates:
- The Server config
- Each Prometheus instance
- Each Loki instance
- Each Tempo instance
- Each Integration
- The scraping service (which is the clustering mechanism + the key value store)
If a call to /-/reload returns an error, always examine the logs to learn exactly what went wrong and which components have been shut down as a result of the runtime error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good compromise. "the component with invalid configuration is in an undefined state but other things are fine" feels like the right tradeoff when a user provides a bad config. As long as users can fix the config and call /-/reload
again this seems like very straightforward behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible/wise to reject the update entirely and re-apply the old config? IE should we treat this has a transactional update. If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible, and I can see arguments on both sides here. I've been using Prometheus as the reference implementation which does the same thing as what we're doing now. Having transactional reloads will be an implementation detail of /-/reload itself though, where it just re-applies the last valid config before returning the error back to the user.
I have some concerns about transactional reloads being noisy at the cluster level if that's being used, but this is something that can be discussed later.
If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?
Kind of. It's hard to do this in a strict way, but other components depend on the listen address / port being valid and open when the server starts listening. So if the server config is invalid, stuff will just subtly break at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, especially if we let it chill in Beta for a bit and get feedback, with the knowledge to users that the semantics could change.
* dynamic metrics endpoint * add applyconfig * finish writing code, updating tests * use common function for converting integration name to key * no early returns
PR Description
Implements ApplyConfig for integrations.
This requires a change to the integration interface to no longer register routes directly to mux. This is because mux is unable to unregister routes. The new approach is to expose the HTTP handlers that expose metrics, and look up the handler at request time.
Which issue(s) this PR fixes
Related to #147.
Notes to the Reviewer
Everything still seems to work after local testing. The new approach here is pretty similar, but stores the goroutines in a map (kind of like the basic instance manager) so they can be stopped independently.
One interesting detail here: when "updating" an integration (stopping the old goroutine and starting a new one), the new goroutine does not wait for the previous one to stop. I don't believe this will cause any problems, but we might have to change this behavior later.
PR Checklist