Input reload not working as expected under Elastic-Agent #33653

belimawr · 2022-11-11T10:24:39Z

Version: 8.6.0/main (confirmed before the V2 merge Merge feature arch v2 #33630)
Operating System: All (tested on Linux)

Description

When there is a policy change on Elastic-Agent, it will send the new configuration to Filebeat witch will trigger the input reload code path that stops the necessary inputs and starts the new ones.

beats/libbeat/cfgfile/list.go

Lines 84 to 119 in 3e928b7

    
           wg := sync.WaitGroup{} 
        
           // Stop removed runners 
        
           for hash, runner := range stopList { 
        
           	wg.Add(1) 
        
           	r.logger.Debugf("Stopping runner: %s", runner) 
        
           	delete(r.runners, hash) 
        
           	go func(runner Runner) { 
        
           		defer wg.Done() 
        
           		runner.Stop() 
        
           		r.logger.Debugf("Runner: '%s' has stopped", runner) 
        
           	}(runner) 
        
           	moduleStops.Add(1) 
        
           } 
        
           // Wait for all runners to stop before starting new ones 
        
           wg.Wait() 
        
           // Start new runners 
        
           for hash, config := range startList { 
        
           	runner, err := createRunner(r.factory, r.pipeline, config) 
        
           	if err != nil { 
        
           		if _, ok := err.(*common.ErrInputNotFinished); ok { 
        
           			// error is related to state, we should not log at error level 
        
           			r.logger.Debugf("Error creating runner from config: %s", err) 
        
           		} else { 
        
           			r.logger.Errorf("Error creating runner from config: %s", err) 
        
           		} 
        
           		errs = append(errs, errors.Wrap(err, "Error creating runner from config")) 
        
           		continue 
        
           	} 
        
           	r.logger.Debugf("Starting runner: %s", runner) 
        
           	r.runners[hash] = runner 
        
           	runner.Start() 
        
           	moduleStarts.Add(1) 
        
           }

The inputs are correctly stopped, the harvesters/reader is also correctly shutdown before new inputs are started which will set the harvester status to 'finalised'. However if there are still events in the publishing pipeline that belong to that input as soon as they're acked the harvester status will be set back to 'not finalised'. If the one of the new inputs will harvest the same file (which is common on a policy update, e.g: when adding a new integration to the policy, the 'old ones' will have their inputs stopped/started) this will lead to the following error when starting the new input:

{"log.level":"info","@timestamp":"2022-11-09T18:55:33.554+0100","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":162},"message":"Status change to Failed: 1 error occurred:\n\t* 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::149908-33, Finished: false, Fileinfo: &{flog.log 7722175 420 {729849109 63803613311 0x55d878467f20} {33 149908 1 33188 1000 1000 0 0 7722175 4096 15088 {1668016405 809851333} {1668016511 729849109} {1668016511 729849109} [0 0 0]}}, Source: /tmp/flog.log, Offset: 7724092, Timestamp: 2022-11-09 18:55:32.698690878 +0100 CET m=+20.865936562, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 149908-33}\n\n","service.name":"filebeat","ecs.version":"1.6.0"}

On an ideal world we'd flush the publishing pipeline before trying to start new inputs, that will make all harvesters states consisten.

Another option is for the Elastic-Agent to re-start Filebeat on every policy change.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-11-11T10:24:49Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2022-11-14T18:43:30Z

My suspicion is that just restarting Filebeat will cause more problems. We will stop inputs whose configuration did not actually change.

There is an existing shutdown.timeout option that we may need to implement on a per input basis.

I feel this would also benefit from implementing per input status reporting back to the agent, so that we could correctly reporting the input that is waiting for shutdown as STOPPING or CONFIGURING while the rest continue to work.

If the one of the new inputs will harvest the same file (which is common on a policy update, e.g: when adding a new integration to the policy, the 'old ones' will have their inputs stopped/started) this will lead to the following error when starting the new input:

One last thought, can we detect that the input configuration has not changed and leave the inputs alone in this case? This won't fix the problem but it might make it much rarer.

belimawr · 2022-11-14T19:02:49Z

We will stop inputs whose configuration did not actually change.

I'm almost sure Filebeat is already stopping/starting all inputs on a policy change.

One last thought, can we detect that the input configuration has not changed and leave the inputs alone in this case? This won't fix the problem but it might make it much rarer.

Good question, I didn't dig that deep, also this bit of the codebase is shared by a standalone Filebeat and Filebeat running under Elastic-Agent.

My guess (I'll confirm it tomorrow) is that there will be some small differences on the input configuration, like it's ID.

cmacknz · 2022-11-14T19:06:52Z

I'm almost sure Filebeat is already stopping/starting all inputs on a policy change.

Ah you are probably right, we could test just having Filebeat respond to configuration changes by gracefully shutting itself down. We could also modify the spec file to set the shutdown.timeout option: https://github.com/elastic/elastic-agent/blob/main/specs/filebeat.spec.yml#L20

cmacknz · 2022-11-14T19:08:13Z

The one problem with Filebeat shutting down without some special coordination with agent is that it is likely to be reported as unhealthy while the shutdown happens. Really we should just make live configuration reload work properly.

belimawr · 2022-11-15T14:18:29Z

The one problem with Filebeat shutting down without some special coordination with agent is that it is likely to be reported as unhealthy while the shutdown happens. Really we should just make live configuration reload work properly.

I meant that the Elastic-Agent should re-start Filebeat. I just created the issue here on the Beats repo because the issue is within the Beats codebase.

belimawr · 2022-11-15T14:27:00Z

Really we should just make live configuration reload work properly.

The current implementation for the config reload on a standalone Filebeat just keeps re-trying until it works, it retries every filebeat.config.inputs.reload.period, if this value is too high, there could be some periods without data being harvested.

Things start to get complicated if the output is down or if there is back pressure and the queue/publishing pipeline is not emptying.

We can add a retry logic with some backoff and a timeout. If the timeout occurs Filebeat will be restarted anyways.

belimawr · 2022-11-16T15:00:28Z

I looked the code and the input configs are converted into a map[string]interface{} and then hashed by github.com/mitchellh/hashstructure, so even the extra fields the V2 protocol is sending are taken into account. This includes the input ID that comes from the policy/integration 'instance', those will always be different.

The current state of things already ends up re-starting Filebeat, but the Elastic-Agent reports as unhealthy for a while, if we always restart Filebeat at least the Elastic-Agent won't report as unhealthy.

belimawr · 2022-11-21T17:04:39Z

I have confirmed this is still happening after the V2 merge.

belimawr · 2022-12-19T15:17:05Z

Waiting for #34066 to be merged before moving on with this task.

jlind23 · 2023-01-11T13:57:36Z

@belimawr do you have all informations to move on with this?

belimawr · 2023-01-12T07:18:08Z

@belimawr do you have all informations to move on with this?

Yes, I just need to find some time to go back to this task. I still need to verify if it is still happening in the 8.6 release. I should have an update by the end of the week.

belimawr · 2023-01-16T11:41:01Z

I've just confirmed that this issue is not reproducible any more on the latest main branch.

Currently when there is a policy change, the output also changes, which already triggers a re-start of Filebeat. This restart solves the 'input reload' issue described here.

#34066 seems to have solved this issue.

cmacknz · 2023-01-16T14:25:29Z

Currently when there is a policy change, the output also changes

This should only apply when the output unit changes, not the input unit. The Beat should not actually be restarting if the only thing that changed was the input unit configuration.

If changing a log or filestream input is causing the Beat to restart, to me that is a separate bug that is unintentionally fixing this one.

@belimawr do you know what in the output is changing here? Is it the API keys perhaps? We ideally shouldn't restart the Beats at all since any events in the memory queue will be lost, if we are observing restarts on any policy change that is something we should fix.

belimawr · 2023-01-16T19:22:26Z

I did some testing and the API keys are not changing. The output config seems to be exactly the same.

I agree we should avoid unnecessary re-starts.

Looking at the code there is always a Beat restart on output change:

beats/x-pack/libbeat/management/managerV2.go

Lines 502 to 507 in 65fb69c

    
           if cm.stopOnOutputReload && cm.outputIsConfigured { 
        
           	cm.logger.Info("beat is restarting because output changed") 
        
           	_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil) 
        
           	cm.Stop() 
        
           	return nil 
        
           }

I did a bit of digging on V1 and found some logic to identify if the output had changed:
https://github.com/elastic/elastic-agent/blob/006784462a006cc31fbba70ec43522500c1816bb/internal/pkg/core/plugin/common.go#L23-L50

And the reload on output change is enabled for the Beats: https://github.com/elastic/elastic-agent/blob/fb6c0f1017cfb9b30e8b7ad99b1c0ecee1f00325/specs/filebeat.spec.yml#L33

I'm just not sure (didn't have time to test, but I can test tomorrow) if a change in policy would be enough to trigger this restart.

Maybe what we need to do is to re-introduce this "output change detection logic" and fix the input reload issue as well.

cmacknz · 2023-01-16T19:37:19Z

Yeah I think the behaviour you are observing will be fixed as part of #34178, and then this bug will come back.

I am going to reopen this as there is some work in flight to undo the unnecessary restarts here.

cmacknz · 2023-01-16T19:42:21Z

We should have a similar level of filtering in the Elastic agent client itself here https://github.com/elastic/elastic-agent-client/blob/4477e3ace394ef1abfec55afa5cc5e1f6f87980c/pkg/client/unit.go#L253-L274 to prevent triggering restarts unnecessarily

I think the logic to try to group/debounce unit changes is calling the reload function even when the unit hasn't changed at all:

beats/x-pack/libbeat/management/managerV2.go

Lines 354 to 379 in 65fb69c

    
           case change := <-cm.client.UnitChanges(): 
        
           	switch change.Type { 
        
           	// Within the context of how we send config to beats, I'm not sure there is a difference between 
        
           	// A unit add and a unit change, since either way we can't do much more than call the reloader 
        
           	case client.UnitChangedAdded: 
        
           		cm.addUnit(change.Unit) 
        
           		// reset can be called here because `<-t.C` is handled in the same select 
        
           		t.Reset(changeDebounce) 
        
           	case client.UnitChangedModified: 
        
           		cm.modifyUnit(change.Unit) 
        
           		// reset can be called here because `<-t.C` is handled in the same select 
        
           		t.Reset(changeDebounce) 
        
           	case client.UnitChangedRemoved: 
        
           		cm.deleteUnit(change.Unit) 
        
           	} 
        
           case <-t.C: 
        
           	// a copy of the units is used for reload to prevent the holding of the `cm.mx`. 
        
           	// it could be possible that sending the configuration to reload could cause the `UpdateStatus` 
        
           	// to be called on the manager causing it to try and grab the `cm.mx` lock, causing a deadlock. 
        
           	cm.mx.Lock() 
        
           	units := make(map[unitKey]*client.Unit, len(cm.units)) 
        
           	for k, u := range cm.units { 
        
           		units[k] = u 
        
           	} 
        
           	cm.mx.Unlock() 
        
           	cm.reload(units)

cmacknz · 2023-01-16T19:53:28Z

Confirmed by testing locally, any unit change causes the Beat to restart: #34178 (comment)

This is a bug, once that bug is fixed this issue will need to be fixed per the original plan by retrying failed reloads just like a standalone Filebeat instance does.

belimawr · 2023-01-17T15:03:00Z

Confirmed by testing locally, any unit change causes the Beat to restart: #34178 (comment)

This is a bug, once that bug is fixed this issue will need to be fixed per the original plan by retrying failed reloads just like a standalone Filebeat instance does.

That makes sense. We should also align to make sure they're released together otherwise we might bring back the issue of Elastic-Agent going unhealthy on a policy change.

jlind23 · 2023-01-18T08:38:35Z

Moving this to a later sprint until #34178 (comment) lands

belimawr · 2023-04-06T16:35:28Z

I tested it today and I can confirm this issue is still happening on main (fresh build from Elastic-Agent and Beats).

belimawr · 2023-04-06T16:58:06Z

After a bit of testing it seems I got to the current root cause of this issue: The Elastic-Agent sends Unit updates on any order, so it's possible that Filebeat first receives the UnitAdded then UnitRemoved, that effectively tries to start a second input (log input on my tests) on the same file, which will always fails.

All the issues described on #34388 (comment) are very likely also relevant here.

The first fix should be ensuring first all Units are stopped, then updated and only then new ones can be started.

I'll start working on this next week.

belimawr added the bug label Nov 11, 2022

belimawr self-assigned this Nov 11, 2022

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2022

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 11, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2022

cmacknz added the 8.7-candidate label Nov 14, 2022

belimawr mentioned this issue Nov 21, 2022

Agent gets Unhealthy on reassigning multiple times to different policies. elastic/elastic-agent#1665

Closed

cmacknz mentioned this issue Nov 24, 2022

8.5.1 agents go unhealthy elastic/elastic-agent#1790

Closed

belimawr closed this as completed Jan 16, 2023

cmacknz reopened this Jan 16, 2023

Laffs2k5 mentioned this issue Jan 25, 2023

Filebeat: Error when using autodiscover + hints + templates with filebeat 8.6 #34388

Closed

pierrehilbert added 8.8-candidate and removed 8.7-candidate labels Feb 13, 2023

cmacknz mentioned this issue Mar 28, 2023

Ubuntu agent gets unhealthy on assigning agent from policy with disabled agent monitoring policy to enabled agent monitoring. elastic/elastic-agent#2409

Closed

pierrehilbert mentioned this issue Apr 24, 2023

[Testing] Input reload not working as expected under Elastic-Agent #35178

Closed

4 tasks

belimawr mentioned this issue Apr 27, 2023

Fix input reload under Elastic-Agent #35250

Merged

7 tasks

belimawr closed this as completed in #35250 May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input reload not working as expected under Elastic-Agent #33653

Input reload not working as expected under Elastic-Agent #33653

belimawr commented Nov 11, 2022

elasticmachine commented Nov 11, 2022

cmacknz commented Nov 14, 2022

belimawr commented Nov 14, 2022

cmacknz commented Nov 14, 2022

cmacknz commented Nov 14, 2022

belimawr commented Nov 15, 2022

belimawr commented Nov 15, 2022

belimawr commented Nov 16, 2022

belimawr commented Nov 21, 2022

belimawr commented Dec 19, 2022

jlind23 commented Jan 11, 2023

belimawr commented Jan 12, 2023

belimawr commented Jan 16, 2023

cmacknz commented Jan 16, 2023

belimawr commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

belimawr commented Jan 17, 2023

jlind23 commented Jan 18, 2023

belimawr commented Apr 6, 2023

belimawr commented Apr 6, 2023

Input reload not working as expected under Elastic-Agent #33653

Input reload not working as expected under Elastic-Agent #33653

Comments

belimawr commented Nov 11, 2022

Description

elasticmachine commented Nov 11, 2022

cmacknz commented Nov 14, 2022

belimawr commented Nov 14, 2022

cmacknz commented Nov 14, 2022

cmacknz commented Nov 14, 2022

belimawr commented Nov 15, 2022

belimawr commented Nov 15, 2022

belimawr commented Nov 16, 2022

belimawr commented Nov 21, 2022

belimawr commented Dec 19, 2022

jlind23 commented Jan 11, 2023

belimawr commented Jan 12, 2023

belimawr commented Jan 16, 2023

cmacknz commented Jan 16, 2023

belimawr commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

cmacknz commented Jan 16, 2023

belimawr commented Jan 17, 2023

jlind23 commented Jan 18, 2023

belimawr commented Apr 6, 2023

belimawr commented Apr 6, 2023