ApplyConfig for integrations #486

rfratto · 2021-03-24T21:14:54Z

PR Description

Implements ApplyConfig for integrations.

This requires a change to the integration interface to no longer register routes directly to mux. This is because mux is unable to unregister routes. The new approach is to expose the HTTP handlers that expose metrics, and look up the handler at request time.

Which issue(s) this PR fixes

Related to #147.

Notes to the Reviewer

Everything still seems to work after local testing. The new approach here is pretty similar, but stores the goroutines in a map (kind of like the basic instance manager) so they can be stopped independently.

One interesting detail here: when "updating" an integration (stopping the old goroutine and starting a new one), the new goroutine does not wait for the previous one to stop. I don't believe this will cause any problems, but we might have to change this behavior later.

PR Checklist

CHANGELOG updated
Documentation added
Tests updated

pkg/integrations/manager.go

56quarters · 2021-03-24T22:02:54Z

pkg/integrations/manager.go

 		if err != nil {
-			return nil, err
+			return fmt.Errorf("error initializing integration %q: %w", ic.Name(), err)


What happens when this returns an error? Is it a problem that new configuration has been applied to some integrations an not others? Or is the idea that the agent would exit if this happened?

I've been thinking about this as I've been implementing these, and I see what you're getting at. It's a problem that this returns an early error, since it leaves the Agent in a hard to predict state, where all valid changes were applied, but all (potentially valid) pending changes were aborted after coming across an invalid change. I can change this to log each error and return the first error after iterating through all of them.

An error returned here would eventually turn into a 4xx/5xx HTTP response for the /-/reload caller.

Given we're going to call /-/reload a beta feature, I think we should commit to saying it always updates the state of components that received a valid configuration, but any invalid configuration will leave that component in an undefined state (either running with the last known config or shut down and no longer running).

It's hard to commit to anything more than this, especially since it's not always possible to bring up the component with the new settings before shutting down the old one. One example of this is the Tempo receivers, which open up TCP/UDP ports and need to be shut down so the new config can take its place.

I hacked up some quick docs below, let me know what you think with the (beta feature) commitment for what we can guarantee /-/reload can/can't do.

/-/reload is an endpoint that updates the various components of the Agent to reflect the newest config on disk. It is currently in beta, so please open issues for any problems you see or ways to improve the functionality, especially around
handing runtime errors.

When /-/reload is invoked, the config file provided to the Agent's -config.file flag will be re-read. Invalid YAML, or YAML that doesn't pass basic validation will be immediately rejected. Otherwise, all components will be updated to match the state of the new config file.

When updating component state, runtime errors can happen that wouldn't be caught when reading in the YAML. For example, setting the WAL to an invalid directory would be caught at runtime, not when the config file is read. Runtime errors that prevent a component from being update will leave that component in an undefined state, where it is either completely shut down or running with its last known config. Runtime errors must be fixed by addressing the problem in the configuration file and invoking /-/reload again.

Runtime errors encountered while updating the state do not stop the update from completing, and all valid changes will always be applied. The Agent has many "components" that receive updates:

The Server config

Each Prometheus instance

Each Loki instance

Each Tempo instance

Each Integration

The scraping service (which is the clustering mechanism + the key value store)

If a call to /-/reload returns an error, always examine the logs to learn exactly what went wrong and which components have been shut down as a result of the runtime error.

I think that's a good compromise. "the component with invalid configuration is in an undefined state but other things are fine" feels like the right tradeoff when a user provides a bad config. As long as users can fix the config and call /-/reload again this seems like very straightforward behavior.

Is it possible/wise to reject the update entirely and re-apply the old config? IE should we treat this has a transactional update. If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?

It's possible, and I can see arguments on both sides here. I've been using Prometheus as the reference implementation which does the same thing as what we're doing now. Having transactional reloads will be an implementation detail of /-/reload itself though, where it just re-applies the last valid config before returning the error back to the user.

I have some concerns about transactional reloads being noisy at the cluster level if that's being used, but this is something that can be discussed later.

If server config is in an invalid state, can that propagate into all the other components being in an undefined state, even if their individual config is correct?

Kind of. It's hard to do this in a strict way, but other components depend on the listen address / port being valid and open when the server starts listening. So if the server config is invalid, stuff will just subtly break at runtime.

Sounds reasonable, especially if we let it chill in Beta for a bit and get feedback, with the knowledge to users that the semantics could change.

* dynamic metrics endpoint * add applyconfig * finish writing code, updating tests * use common function for converting integration name to key * no early returns

rfratto added 4 commits March 24, 2021 13:13

dynamic metrics endpoint

6bb8a75

add applyconfig

f57de72

finish writing code, updating tests

a3cc640

use common function for converting integration name to key

ee0856e

rfratto requested a review from mattdurham March 24, 2021 21:14

56quarters reviewed Mar 24, 2021

View reviewed changes

no early returns

1786d1f

mattdurham approved these changes Mar 25, 2021

View reviewed changes

rfratto merged commit 7727df2 into grafana:main Mar 25, 2021

rfratto deleted the integrations-applyconfig branch March 25, 2021 15:38

mattdurham mentioned this pull request Sep 7, 2021

crow doc rfratto/agent#8

Closed

3 tasks

mattdurham pushed a commit that referenced this pull request Nov 11, 2021

ApplyConfig for integrations (#486)

8c61c82

* dynamic metrics endpoint * add applyconfig * finish writing code, updating tests * use common function for converting integration name to key * no early returns

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Apr 15, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ApplyConfig for integrations #486

ApplyConfig for integrations #486

rfratto commented Mar 24, 2021 •

edited

Loading

56quarters Mar 24, 2021

rfratto Mar 25, 2021 •

edited

Loading

56quarters Mar 25, 2021

mattdurham Mar 25, 2021

rfratto Mar 25, 2021

mattdurham Mar 25, 2021

ApplyConfig for integrations #486

ApplyConfig for integrations #486

Conversation

rfratto commented Mar 24, 2021 • edited Loading

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

56quarters Mar 24, 2021

Choose a reason for hiding this comment

rfratto Mar 25, 2021 • edited Loading

Choose a reason for hiding this comment

56quarters Mar 25, 2021

Choose a reason for hiding this comment

mattdurham Mar 25, 2021

Choose a reason for hiding this comment

rfratto Mar 25, 2021

Choose a reason for hiding this comment

mattdurham Mar 25, 2021

Choose a reason for hiding this comment

rfratto commented Mar 24, 2021 •

edited

Loading

rfratto Mar 25, 2021 •

edited

Loading