Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input reload not working as expected under Elastic-Agent #33653

Closed
belimawr opened this issue Nov 11, 2022 · 22 comments · Fixed by #35250
Closed

Input reload not working as expected under Elastic-Agent #33653

belimawr opened this issue Nov 11, 2022 · 22 comments · Fixed by #35250
Assignees
Labels
8.8-candidate bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

Description

When there is a policy change on Elastic-Agent, it will send the new configuration to Filebeat witch will trigger the input reload code path that stops the necessary inputs and starts the new ones.

wg := sync.WaitGroup{}
// Stop removed runners
for hash, runner := range stopList {
wg.Add(1)
r.logger.Debugf("Stopping runner: %s", runner)
delete(r.runners, hash)
go func(runner Runner) {
defer wg.Done()
runner.Stop()
r.logger.Debugf("Runner: '%s' has stopped", runner)
}(runner)
moduleStops.Add(1)
}
// Wait for all runners to stop before starting new ones
wg.Wait()
// Start new runners
for hash, config := range startList {
runner, err := createRunner(r.factory, r.pipeline, config)
if err != nil {
if _, ok := err.(*common.ErrInputNotFinished); ok {
// error is related to state, we should not log at error level
r.logger.Debugf("Error creating runner from config: %s", err)
} else {
r.logger.Errorf("Error creating runner from config: %s", err)
}
errs = append(errs, errors.Wrap(err, "Error creating runner from config"))
continue
}
r.logger.Debugf("Starting runner: %s", runner)
r.runners[hash] = runner
runner.Start()
moduleStarts.Add(1)
}

The inputs are correctly stopped, the harvesters/reader is also correctly shutdown before new inputs are started which will set the harvester status to 'finalised'. However if there are still events in the publishing pipeline that belong to that input as soon as they're acked the harvester status will be set back to 'not finalised'. If the one of the new inputs will harvest the same file (which is common on a policy update, e.g: when adding a new integration to the policy, the 'old ones' will have their inputs stopped/started) this will lead to the following error when starting the new input:

{"log.level":"info","@timestamp":"2022-11-09T18:55:33.554+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":162},"message":"Status change to Failed: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::149908-33, Finished: false, Fileinfo: &{flog.log 7722175 420 {729849109 63803613311 0x55d878467f20} {33 149908 1 33188 1000 1000 0 0 7722175 4096 15088 {1668016405 809851333} {1668016511 729849109} {1668016511 729849109} [0 0 0]}}, Source: /tmp/flog.log, Offset: 7724092, Timestamp: 2022-11-09 18:55:32.698690878 +0100 CET m=+20.865936562, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 149908-33}\n\n","service.name":"filebeat","ecs.version":"1.6.0"}

On an ideal world we'd flush the publishing pipeline before trying to start new inputs, that will make all harvesters states consisten.

Another option is for the Elastic-Agent to re-start Filebeat on every policy change.

@belimawr belimawr added the bug label Nov 11, 2022
@belimawr belimawr self-assigned this Nov 11, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2022
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 11, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2022
@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2022

My suspicion is that just restarting Filebeat will cause more problems. We will stop inputs whose configuration did not actually change.

There is an existing shutdown.timeout option that we may need to implement on a per input basis.

I feel this would also benefit from implementing per input status reporting back to the agent, so that we could correctly reporting the input that is waiting for shutdown as STOPPING or CONFIGURING while the rest continue to work.

If the one of the new inputs will harvest the same file (which is common on a policy update, e.g: when adding a new integration to the policy, the 'old ones' will have their inputs stopped/started) this will lead to the following error when starting the new input:

One last thought, can we detect that the input configuration has not changed and leave the inputs alone in this case? This won't fix the problem but it might make it much rarer.

@belimawr
Copy link
Contributor Author

We will stop inputs whose configuration did not actually change.

I'm almost sure Filebeat is already stopping/starting all inputs on a policy change.

One last thought, can we detect that the input configuration has not changed and leave the inputs alone in this case? This won't fix the problem but it might make it much rarer.

Good question, I didn't dig that deep, also this bit of the codebase is shared by a standalone Filebeat and Filebeat running under Elastic-Agent.

My guess (I'll confirm it tomorrow) is that there will be some small differences on the input configuration, like it's ID.

@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2022

I'm almost sure Filebeat is already stopping/starting all inputs on a policy change.

Ah you are probably right, we could test just having Filebeat respond to configuration changes by gracefully shutting itself down. We could also modify the spec file to set the shutdown.timeout option: https://github.com/elastic/elastic-agent/blob/main/specs/filebeat.spec.yml#L20

@cmacknz
Copy link
Member

cmacknz commented Nov 14, 2022

The one problem with Filebeat shutting down without some special coordination with agent is that it is likely to be reported as unhealthy while the shutdown happens. Really we should just make live configuration reload work properly.

@belimawr
Copy link
Contributor Author

The one problem with Filebeat shutting down without some special coordination with agent is that it is likely to be reported as unhealthy while the shutdown happens. Really we should just make live configuration reload work properly.

I meant that the Elastic-Agent should re-start Filebeat. I just created the issue here on the Beats repo because the issue is within the Beats codebase.

@belimawr
Copy link
Contributor Author

Really we should just make live configuration reload work properly.

The current implementation for the config reload on a standalone Filebeat just keeps re-trying until it works, it retries every filebeat.config.inputs.reload.period, if this value is too high, there could be some periods without data being harvested.

Things start to get complicated if the output is down or if there is back pressure and the queue/publishing pipeline is not emptying.

We can add a retry logic with some backoff and a timeout. If the timeout occurs Filebeat will be restarted anyways.

@belimawr
Copy link
Contributor Author

I looked the code and the input configs are converted into a map[string]interface{} and then hashed by github.com/mitchellh/hashstructure, so even the extra fields the V2 protocol is sending are taken into account. This includes the input ID that comes from the policy/integration 'instance', those will always be different.

The current state of things already ends up re-starting Filebeat, but the Elastic-Agent reports as unhealthy for a while, if we always restart Filebeat at least the Elastic-Agent won't report as unhealthy.

@belimawr
Copy link
Contributor Author

I have confirmed this is still happening after the V2 merge.

@belimawr
Copy link
Contributor Author

Waiting for #34066 to be merged before moving on with this task.

@jlind23
Copy link
Collaborator

jlind23 commented Jan 11, 2023

@belimawr do you have all informations to move on with this?

@belimawr
Copy link
Contributor Author

@belimawr do you have all informations to move on with this?

Yes, I just need to find some time to go back to this task. I still need to verify if it is still happening in the 8.6 release. I should have an update by the end of the week.

@belimawr
Copy link
Contributor Author

I've just confirmed that this issue is not reproducible any more on the latest main branch.

Currently when there is a policy change, the output also changes, which already triggers a re-start of Filebeat. This restart solves the 'input reload' issue described here.

#34066 seems to have solved this issue.

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2023

Currently when there is a policy change, the output also changes

This should only apply when the output unit changes, not the input unit. The Beat should not actually be restarting if the only thing that changed was the input unit configuration.

If changing a log or filestream input is causing the Beat to restart, to me that is a separate bug that is unintentionally fixing this one.

@belimawr do you know what in the output is changing here? Is it the API keys perhaps? We ideally shouldn't restart the Beats at all since any events in the memory queue will be lost, if we are observing restarts on any policy change that is something we should fix.

@belimawr
Copy link
Contributor Author

I did some testing and the API keys are not changing. The output config seems to be exactly the same.

I agree we should avoid unnecessary re-starts.

Looking at the code there is always a Beat restart on output change:

if cm.stopOnOutputReload && cm.outputIsConfigured {
cm.logger.Info("beat is restarting because output changed")
_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil)
cm.Stop()
return nil
}

I did a bit of digging on V1 and found some logic to identify if the output had changed:
https://github.com/elastic/elastic-agent/blob/006784462a006cc31fbba70ec43522500c1816bb/internal/pkg/core/plugin/common.go#L23-L50

And the reload on output change is enabled for the Beats: https://github.com/elastic/elastic-agent/blob/fb6c0f1017cfb9b30e8b7ad99b1c0ecee1f00325/specs/filebeat.spec.yml#L33

I'm just not sure (didn't have time to test, but I can test tomorrow) if a change in policy would be enough to trigger this restart.

Maybe what we need to do is to re-introduce this "output change detection logic" and fix the input reload issue as well.

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2023

Yeah I think the behaviour you are observing will be fixed as part of #34178, and then this bug will come back.

I am going to reopen this as there is some work in flight to undo the unnecessary restarts here.

@cmacknz cmacknz reopened this Jan 16, 2023
@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2023

We should have a similar level of filtering in the Elastic agent client itself here https://github.com/elastic/elastic-agent-client/blob/4477e3ace394ef1abfec55afa5cc5e1f6f87980c/pkg/client/unit.go#L253-L274 to prevent triggering restarts unnecessarily

I think the logic to try to group/debounce unit changes is calling the reload function even when the unit hasn't changed at all:

case change := <-cm.client.UnitChanges():
switch change.Type {
// Within the context of how we send config to beats, I'm not sure there is a difference between
// A unit add and a unit change, since either way we can't do much more than call the reloader
case client.UnitChangedAdded:
cm.addUnit(change.Unit)
// reset can be called here because `<-t.C` is handled in the same select
t.Reset(changeDebounce)
case client.UnitChangedModified:
cm.modifyUnit(change.Unit)
// reset can be called here because `<-t.C` is handled in the same select
t.Reset(changeDebounce)
case client.UnitChangedRemoved:
cm.deleteUnit(change.Unit)
}
case <-t.C:
// a copy of the units is used for reload to prevent the holding of the `cm.mx`.
// it could be possible that sending the configuration to reload could cause the `UpdateStatus`
// to be called on the manager causing it to try and grab the `cm.mx` lock, causing a deadlock.
cm.mx.Lock()
units := make(map[unitKey]*client.Unit, len(cm.units))
for k, u := range cm.units {
units[k] = u
}
cm.mx.Unlock()
cm.reload(units)

@cmacknz
Copy link
Member

cmacknz commented Jan 16, 2023

Confirmed by testing locally, any unit change causes the Beat to restart: #34178 (comment)

This is a bug, once that bug is fixed this issue will need to be fixed per the original plan by retrying failed reloads just like a standalone Filebeat instance does.

@belimawr
Copy link
Contributor Author

Confirmed by testing locally, any unit change causes the Beat to restart: #34178 (comment)

This is a bug, once that bug is fixed this issue will need to be fixed per the original plan by retrying failed reloads just like a standalone Filebeat instance does.

That makes sense. We should also align to make sure they're released together otherwise we might bring back the issue of Elastic-Agent going unhealthy on a policy change.

@jlind23
Copy link
Collaborator

jlind23 commented Jan 18, 2023

Moving this to a later sprint until #34178 (comment) lands

@belimawr
Copy link
Contributor Author

belimawr commented Apr 6, 2023

I tested it today and I can confirm this issue is still happening on main (fresh build from Elastic-Agent and Beats).

@belimawr
Copy link
Contributor Author

belimawr commented Apr 6, 2023

After a bit of testing it seems I got to the current root cause of this issue: The Elastic-Agent sends Unit updates on any order, so it's possible that Filebeat first receives the UnitAdded then UnitRemoved, that effectively tries to start a second input (log input on my tests) on the same file, which will always fails.

All the issues described on #34388 (comment) are very likely also relevant here.

The first fix should be ensuring first all Units are stopped, then updated and only then new ones can be started.

I'll start working on this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.8-candidate bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants