-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input reload not working as expected under Elastic-Agent #33653
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
My suspicion is that just restarting Filebeat will cause more problems. We will stop inputs whose configuration did not actually change. There is an existing shutdown.timeout option that we may need to implement on a per input basis. I feel this would also benefit from implementing per input status reporting back to the agent, so that we could correctly reporting the input that is waiting for shutdown as STOPPING or CONFIGURING while the rest continue to work.
One last thought, can we detect that the input configuration has not changed and leave the inputs alone in this case? This won't fix the problem but it might make it much rarer. |
I'm almost sure Filebeat is already stopping/starting all inputs on a policy change.
Good question, I didn't dig that deep, also this bit of the codebase is shared by a standalone Filebeat and Filebeat running under Elastic-Agent. My guess (I'll confirm it tomorrow) is that there will be some small differences on the input configuration, like it's ID. |
Ah you are probably right, we could test just having Filebeat respond to configuration changes by gracefully shutting itself down. We could also modify the spec file to set the |
The one problem with Filebeat shutting down without some special coordination with agent is that it is likely to be reported as unhealthy while the shutdown happens. Really we should just make live configuration reload work properly. |
I meant that the Elastic-Agent should re-start Filebeat. I just created the issue here on the Beats repo because the issue is within the Beats codebase. |
The current implementation for the config reload on a standalone Filebeat just keeps re-trying until it works, it retries every Things start to get complicated if the output is down or if there is back pressure and the queue/publishing pipeline is not emptying. We can add a retry logic with some backoff and a timeout. If the timeout occurs Filebeat will be restarted anyways. |
I looked the code and the input configs are converted into a The current state of things already ends up re-starting Filebeat, but the Elastic-Agent reports as unhealthy for a while, if we always restart Filebeat at least the Elastic-Agent won't report as unhealthy. |
I have confirmed this is still happening after the V2 merge. |
Waiting for #34066 to be merged before moving on with this task. |
@belimawr do you have all informations to move on with this? |
Yes, I just need to find some time to go back to this task. I still need to verify if it is still happening in the 8.6 release. I should have an update by the end of the week. |
I've just confirmed that this issue is not reproducible any more on the latest Currently when there is a policy change, the output also changes, which already triggers a re-start of Filebeat. This restart solves the 'input reload' issue described here. #34066 seems to have solved this issue. |
This should only apply when the output unit changes, not the input unit. The Beat should not actually be restarting if the only thing that changed was the input unit configuration. If changing a log or filestream input is causing the Beat to restart, to me that is a separate bug that is unintentionally fixing this one. @belimawr do you know what in the output is changing here? Is it the API keys perhaps? We ideally shouldn't restart the Beats at all since any events in the memory queue will be lost, if we are observing restarts on any policy change that is something we should fix. |
I did some testing and the API keys are not changing. The output config seems to be exactly the same. I agree we should avoid unnecessary re-starts. Looking at the code there is always a Beat restart on output change: beats/x-pack/libbeat/management/managerV2.go Lines 502 to 507 in 65fb69c
I did a bit of digging on V1 and found some logic to identify if the output had changed: And the reload on output change is enabled for the Beats: https://github.com/elastic/elastic-agent/blob/fb6c0f1017cfb9b30e8b7ad99b1c0ecee1f00325/specs/filebeat.spec.yml#L33 I'm just not sure (didn't have time to test, but I can test tomorrow) if a change in policy would be enough to trigger this restart. Maybe what we need to do is to re-introduce this "output change detection logic" and fix the input reload issue as well. |
Yeah I think the behaviour you are observing will be fixed as part of #34178, and then this bug will come back. I am going to reopen this as there is some work in flight to undo the unnecessary restarts here. |
We should have a similar level of filtering in the Elastic agent client itself here https://github.com/elastic/elastic-agent-client/blob/4477e3ace394ef1abfec55afa5cc5e1f6f87980c/pkg/client/unit.go#L253-L274 to prevent triggering restarts unnecessarily I think the logic to try to group/debounce unit changes is calling the reload function even when the unit hasn't changed at all: beats/x-pack/libbeat/management/managerV2.go Lines 354 to 379 in 65fb69c
|
Confirmed by testing locally, any unit change causes the Beat to restart: #34178 (comment) This is a bug, once that bug is fixed this issue will need to be fixed per the original plan by retrying failed reloads just like a standalone Filebeat instance does. |
That makes sense. We should also align to make sure they're released together otherwise we might bring back the issue of Elastic-Agent going unhealthy on a policy change. |
Moving this to a later sprint until #34178 (comment) lands |
I tested it today and I can confirm this issue is still happening on |
After a bit of testing it seems I got to the current root cause of this issue: The Elastic-Agent sends Unit updates on any order, so it's possible that Filebeat first receives the UnitAdded then UnitRemoved, that effectively tries to start a second input (log input on my tests) on the same file, which will always fails. All the issues described on #34388 (comment) are very likely also relevant here. The first fix should be ensuring first all Units are stopped, then updated and only then new ones can be started. I'll start working on this next week. |
Description
When there is a policy change on Elastic-Agent, it will send the new configuration to Filebeat witch will trigger the input reload code path that stops the necessary inputs and starts the new ones.
beats/libbeat/cfgfile/list.go
Lines 84 to 119 in 3e928b7
The inputs are correctly stopped, the harvesters/reader is also correctly shutdown before new inputs are started which will set the harvester status to 'finalised'. However if there are still events in the publishing pipeline that belong to that input as soon as they're acked the harvester status will be set back to 'not finalised'. If the one of the new inputs will harvest the same file (which is common on a policy update, e.g: when adding a new integration to the policy, the 'old ones' will have their inputs stopped/started) this will lead to the following error when starting the new input:
On an ideal world we'd flush the publishing pipeline before trying to start new inputs, that will make all harvesters states consisten.
Another option is for the Elastic-Agent to re-start Filebeat on every policy change.
The text was updated successfully, but these errors were encountered: