Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

Closed
crowens opened this issue Sep 21, 2022 · 9 comments · Fixed by #1286
Closed

Windows Agent Left Unhealthy After Removing Endpoint Integration #1262

crowens opened this issue Sep 21, 2022 · 9 comments · Fixed by #1286
Assignees
Labels
bug Something isn't working QA:Validated Validated by the QA Team v8.4.0 v8.5.0

Comments

@crowens
Copy link

crowens commented Sep 21, 2022

Version: 8.4.1
Operating System: Windows Server 2012 R2 Datacenter
Steps to Reproduce:

  1. Set up an Elastic Agent Policy with the Elastic Endpoint and Cloud Integration
  2. Install the Elastic Agent on a windows machine (I used Windows Server 2012 R2 Datacenter)
  3. Wait to ensure it installs and shows Healthy in the UI
  4. From the Agent Policy integrations page, choose to delete the Endpoint Integration
  5. Soon after the Agent will go Unhealthy and stay that way seemingly forever. (This unhealthy status will get fixed by an Agent upgrade and by restarting the agent from the host).

Error messages in the logs from Agent:

elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app endpoint-security--8.4.1-07ad9ca0: failed to stop after 30s: application stopping timed out"
12:15:33.178
elastic_agent
[elastic_agent][error] 2022-09-21T16:15:33Z - message: Application: endpoint-security--8.4.1[a55ff671-6a94-41b6-98ef-951a38ceb7a3]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'
12:15:33.178
elastic_agent
[elastic_agent][error] Elastic Agent status changed to "error": "app endpoint-security--8.4.1-07ad9ca0: failed to stop after 30s: application stopping timed out"
12:15:33.178
elastic_agent
[elastic_agent][error] 2022-09-21T16:15:33Z - message: Application: endpoint-security--8.4.1[a55ff671-6a94-41b6-98ef-951a38ceb7a3]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'

On the host, the Endpoint is no longer running and is no longer on disk (there are three empty directories left behind Endpoint->State->Logs).

Running agent status produces:

image

I have waited up to 24 hours and the Unhealthy status does not resolve. It does get resolved by restarting the Agent from the host.

@crowens crowens added the bug Something isn't working label Sep 21, 2022
@crowens
Copy link
Author

crowens commented Sep 21, 2022

@ferullo @joshdover

@ghost
Copy link

ghost commented Sep 22, 2022

Hi Team
We have revalidated this issue on latest 8.5 Snapshot kibana Cloud environment and found it still reproducible.

  • Windows Agent went unhealthy after deleting Endpoint integration

Build Details:
BUILD: 56489
COMMIT: ac5064b771f6edc2a2ea71da9f68e1afcd08a539

Screenshots:
image
image

We will revalidate this issue once it will be fixed.
Thanks

@aleksmaus
Copy link
Member

After talking to @ferullo today he mentioned that we need to give endpoint at least 90 seconds to stop. Looking into how to customize this settings with endpoint spec. Might need to add some code that continues checking endpoint status longer and recover the agent state to healthy.

@aleksmaus
Copy link
Member

aleksmaus commented Sep 24, 2022

I played a bit with different approaches to improve the situation here. Think it would be good to make a note on some caveats.

Initial thinking was to increase the timeout and if we still time out then use the platform service apis to stop the service and monitor the status of the service and eventually setting it to stopped either based on the status or after some grace period, thus allowing the agent to recover to the healthy state.

  1. Found out that we actually can't stop the service, because Endpoint implements protection on windows (works on other platforms). This is a good piece of knowledge, since my current PR for V2 Service runtime doesn't expect this. Will need to adjust.
  2. The service is not getting removed all the time, sometimes it is left in Disabled and only is removed after the system restart.

Screen Shot 2022-09-23 at 9 19 40 PM

Don't know if this can affect subsequent service reinstalls without rebooting the system (@ferullo or anybody on the endpoint team ?)

So the plan is:

  1. Increase the stop timeout value.
  2. We could attempt to stop through the platform service apis if we time out. Ignore the errors such as "Access denied" for example and log it (debug level).
  3. Monitor the service status through the platform services apis for some grace period of let's say 3 mins(?) and change the status from failed to stopped either when the "stopped" status is detected of the grace "watch" period ended.

If there are any objections let me know.

@ferullo
Copy link

ferullo commented Sep 26, 2022

@bjmcnic do you know the answer to @aleksmaus 's question?

@bjmcnic
Copy link
Contributor

bjmcnic commented Sep 26, 2022

@aleksmaus ...

  1. I'm not sure 90 seconds would work yet, although maybe we should find a way to make it work because that's sure a long time. Presently our uninstaller can loop polling for 100 seconds (https://github.com/elastic/endpoint-dev/blob/82449dd2660d54a181c7c81723a84e83184babb2/Libraries/InstallLib/Lib/Internal.cpp#L411-L413) in the error mode that the service is unstartable to receive an unprotect command to allow shutdown. We then relaunch as PPL and succeed. But that means for Agent to accommodate all of that...they likely need to be close to 110 or 120 seconds. I do think that's a good idea though.
  2. There exists a way to stop us right now, elastic-endpoint.exe stop. It'd stop us whether we're protected or not. The behavior is subject to change pending password protection plans. However, the uninstaller is already trying to do unprotect us and then stop us. And if it finds us as not running, it attempts to start us in order to send the unprotect command. It's unclear to me in what circumstances Agent issuing it's own stop to our service could help. There definitely would exist some risk if the stop came during an ongoing attempt to start and unprotect.
  3. From the current perspective of Endpoint's verify command, stopped without an error is a valid state, and if Agent can report that independent from failed, that might be useful to a user. Could they request a start? If this merges, I'd maybe recommend using the verify command to check on a stopped Endpoint as it'll newly report whether it looks like it was a clean stop, it'd never started, or it'd crashed. For all but the clean stop (i.e. verify returns non-zero), I'd hope Agent would install --upgrade Endpoint.

@bjmcnic
Copy link
Contributor

bjmcnic commented Sep 26, 2022

@aleksmaus ...

In regards to the Disabled Endpoint state requiring a reboot. That's likely the result of a failed install. That most commonly occurs when we try to install a test signed Endpoint in protect mode on a host without test signing enabled. The installer is unable to start the service as PPL and attempts to cleanup the failed install. But because the service has been marked as PPL with the SCM, and we weren't able to execute as PPL, we aren't able to uninstall the service at run time and require a reboot to clean that up. The service being left that does indeed interfere with subsequent install attempts prior to reboot, but it's likely they were going to fail for the same reasons. If you're aware of another way to cause this state, please let me know and we can likely find a way to prevent it.

@ghost
Copy link

ghost commented Oct 31, 2022

Hi Team,
We have revalidated this issue on latest 8.5 Snapshot kibana Cloud environment and found it fixed now.

  • Windows Agent is healthy after deleting Endpoint integration.

Build Details:
BUILD: 57077
COMMIT: d9d2438aeed3b60a06aa7ab548c886ffcfb5e76a

Screenshot:
image
image

Thanks

@amolnater-qasource
Copy link

Hi Team

We have revalidated this issue on 8.5.1 BC1 Kibana cloud-production environment and found it fixed now.

Observations:

  • Windows agent remains Healthy after removing Endpoint Security Integration.

Build details:
BUILD: 57136
COMMIT: 87149bfd06f4fe41dbfa7e95461294e9dadfb1d8

Screenshots:
6
7
8

Hence marking this issue as QA:Validated.
Thanks

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working QA:Validated Validated by the QA Team v8.4.0 v8.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants