Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

crowens · 2022-09-21T18:39:59Z

Version: 8.4.1
Operating System: Windows Server 2012 R2 Datacenter
Steps to Reproduce:

Set up an Elastic Agent Policy with the Elastic Endpoint and Cloud Integration
Install the Elastic Agent on a windows machine (I used Windows Server 2012 R2 Datacenter)
Wait to ensure it installs and shows Healthy in the UI
From the Agent Policy integrations page, choose to delete the Endpoint Integration
Soon after the Agent will go Unhealthy and stay that way seemingly forever. (This unhealthy status will get fixed by an Agent upgrade and by restarting the agent from the host).

Error messages in the logs from Agent:

elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app endpoint-security--8.4.1-07ad9ca0: failed to stop after 30s: application stopping timed out"
12:15:33.178
elastic_agent
[elastic_agent][error] 2022-09-21T16:15:33Z - message: Application: endpoint-security--8.4.1[a55ff671-6a94-41b6-98ef-951a38ceb7a3]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'
12:15:33.178
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app endpoint-security--8.4.1-07ad9ca0: failed to stop after 30s: application stopping timed out"
12:15:33.178
elastic_agent
[elastic_agent][error] 2022-09-21T16:15:33Z - message: Application: endpoint-security--8.4.1[a55ff671-6a94-41b6-98ef-951a38ceb7a3]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'

On the host, the Endpoint is no longer running and is no longer on disk (there are three empty directories left behind Endpoint->State->Logs).

Running agent status produces:

I have waited up to 24 hours and the Unhealthy status does not resolve. It does get resolved by restarting the Agent from the host.

The text was updated successfully, but these errors were encountered:

crowens · 2022-09-21T18:40:44Z

@ferullo @joshdover

ghost · 2022-09-22T06:57:51Z

Hi Team
We have revalidated this issue on latest 8.5 Snapshot kibana Cloud environment and found it still reproducible.

Windows Agent went unhealthy after deleting Endpoint integration

Build Details:
BUILD: 56489
COMMIT: ac5064b771f6edc2a2ea71da9f68e1afcd08a539

Screenshots:

We will revalidate this issue once it will be fixed.
Thanks

aleksmaus · 2022-09-22T18:05:20Z

After talking to @ferullo today he mentioned that we need to give endpoint at least 90 seconds to stop. Looking into how to customize this settings with endpoint spec. Might need to add some code that continues checking endpoint status longer and recover the agent state to healthy.

aleksmaus · 2022-09-24T01:36:14Z

I played a bit with different approaches to improve the situation here. Think it would be good to make a note on some caveats.

Initial thinking was to increase the timeout and if we still time out then use the platform service apis to stop the service and monitor the status of the service and eventually setting it to stopped either based on the status or after some grace period, thus allowing the agent to recover to the healthy state.

Found out that we actually can't stop the service, because Endpoint implements protection on windows (works on other platforms). This is a good piece of knowledge, since my current PR for V2 Service runtime doesn't expect this. Will need to adjust.
The service is not getting removed all the time, sometimes it is left in Disabled and only is removed after the system restart.

Don't know if this can affect subsequent service reinstalls without rebooting the system (@ferullo or anybody on the endpoint team ?)

So the plan is:

Increase the stop timeout value.
We could attempt to stop through the platform service apis if we time out. Ignore the errors such as "Access denied" for example and log it (debug level).
Monitor the service status through the platform services apis for some grace period of let's say 3 mins(?) and change the status from failed to stopped either when the "stopped" status is detected of the grace "watch" period ended.

If there are any objections let me know.

ferullo · 2022-09-26T13:22:41Z

@bjmcnic do you know the answer to @aleksmaus 's question?

bjmcnic · 2022-09-26T13:48:14Z

@aleksmaus ...

I'm not sure 90 seconds would work yet, although maybe we should find a way to make it work because that's sure a long time. Presently our uninstaller can loop polling for 100 seconds (https://github.com/elastic/endpoint-dev/blob/82449dd2660d54a181c7c81723a84e83184babb2/Libraries/InstallLib/Lib/Internal.cpp#L411-L413) in the error mode that the service is unstartable to receive an unprotect command to allow shutdown. We then relaunch as PPL and succeed. But that means for Agent to accommodate all of that...they likely need to be close to 110 or 120 seconds. I do think that's a good idea though.
There exists a way to stop us right now, elastic-endpoint.exe stop. It'd stop us whether we're protected or not. The behavior is subject to change pending password protection plans. However, the uninstaller is already trying to do unprotect us and then stop us. And if it finds us as not running, it attempts to start us in order to send the unprotect command. It's unclear to me in what circumstances Agent issuing it's own stop to our service could help. There definitely would exist some risk if the stop came during an ongoing attempt to start and unprotect.
From the current perspective of Endpoint's verify command, stopped without an error is a valid state, and if Agent can report that independent from failed, that might be useful to a user. Could they request a start? If this merges, I'd maybe recommend using the verify command to check on a stopped Endpoint as it'll newly report whether it looks like it was a clean stop, it'd never started, or it'd crashed. For all but the clean stop (i.e. verify returns non-zero), I'd hope Agent would install --upgrade Endpoint.

bjmcnic · 2022-09-26T13:59:37Z

@aleksmaus ...

In regards to the Disabled Endpoint state requiring a reboot. That's likely the result of a failed install. That most commonly occurs when we try to install a test signed Endpoint in protect mode on a host without test signing enabled. The installer is unable to start the service as PPL and attempts to cleanup the failed install. But because the service has been marked as PPL with the SCM, and we weren't able to execute as PPL, we aren't able to uninstall the service at run time and require a reboot to clean that up. The service being left that does indeed interfere with subsequent install attempts prior to reboot, but it's likely they were going to fail for the same reasons. If you're aware of another way to cause this state, please let me know and we can likely find a way to prevent it.

ghost · 2022-10-31T12:16:21Z

Hi Team,
We have revalidated this issue on latest 8.5 Snapshot kibana Cloud environment and found it fixed now.

Windows Agent is healthy after deleting Endpoint integration.

Build Details:
BUILD: 57077
COMMIT: d9d2438aeed3b60a06aa7ab548c886ffcfb5e76a

Screenshot:

Thanks

amolnater-qasource · 2022-11-15T08:53:02Z

Hi Team

We have revalidated this issue on 8.5.1 BC1 Kibana cloud-production environment and found it fixed now.

Observations:

Windows agent remains Healthy after removing Endpoint Security Integration.

Build details:
BUILD: 57136
COMMIT: 87149bfd06f4fe41dbfa7e95461294e9dadfb1d8

Screenshots:

Hence marking this issue as QA:Validated.
Thanks

crowens added the bug Something isn't working label Sep 21, 2022

cmacknz added the v8.5.0 label Sep 21, 2022

cmacknz assigned aleksmaus Sep 22, 2022

ferullo mentioned this issue Sep 22, 2022

[BUG] 8.4 and 7.17.5/7.17.6 Windows Endpoints may wind up in a non-running state elastic/endpoint#29

Open

cmacknz added the v8.4.0 label Sep 22, 2022

cmacknz mentioned this issue Sep 23, 2022

Agent goes unhealthy on assigning to agent policy not having Endpoint integration. #1282

Closed

aleksmaus mentioned this issue Sep 24, 2022

Fix: Windows Agent Left Unhealthy After Removing Endpoint Integration #1286

Merged

2 tasks

aleksmaus closed this as completed in #1286 Oct 24, 2022

amolnater-qasource added the QA:Validated Validated by the QA Team label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

crowens commented Sep 21, 2022 •

edited

Loading

crowens commented Sep 21, 2022

ghost commented Sep 22, 2022

aleksmaus commented Sep 22, 2022

aleksmaus commented Sep 24, 2022 •

edited

Loading

ferullo commented Sep 26, 2022

bjmcnic commented Sep 26, 2022

bjmcnic commented Sep 26, 2022

ghost commented Oct 31, 2022

amolnater-qasource commented Nov 15, 2022

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

Comments

crowens commented Sep 21, 2022 • edited Loading

crowens commented Sep 21, 2022

ghost commented Sep 22, 2022

aleksmaus commented Sep 22, 2022

aleksmaus commented Sep 24, 2022 • edited Loading

ferullo commented Sep 26, 2022

bjmcnic commented Sep 26, 2022

bjmcnic commented Sep 26, 2022

ghost commented Oct 31, 2022

amolnater-qasource commented Nov 15, 2022

crowens commented Sep 21, 2022 •

edited

Loading

aleksmaus commented Sep 24, 2022 •

edited

Loading