Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

Closed
simitt opened this issue Oct 1, 2021 · 3 comments · Fixed by #28260
Closed

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

simitt opened this issue Oct 1, 2021 · 3 comments · Fixed by #28260
Assignees
Labels

Comments

@simitt
Copy link
Contributor

simitt commented Oct 1, 2021

Current Behavior

When starting the Agent with Fleet-Server with the setup command, in case the Fleet Server can not be set up correctly, the Elastic Agent does not expose the health check endpoint and eventually shuts down. This happens for example when the package registry is (temporarily) unavailable causing issues for the Fleet setup, or when trying to enroll into an Agent Policy without a Fleet Server package.
The Cloud setup is using the Fleet preconfiguration API to set up the Cloud agent policy. When the package registry is not reachable, the agent policy is still created, but it doesn't contain a Fleet Server package policy .

Expected Behavior

The Elastic Agent should always expose the healthcheck endpoint /processes and listen on the configured port. It should not shut down because of issues in one of its subprocesses, not even when it is the Fleet Server. The health check is designed to always return a 200 for the Elastic Agent itself and a list of subprocesses that are expected to be running. For every expected subprocess a pid is returned, indicating if it is up or not.
While the agent is not usable when the Fleet Server is not up, this might be a temporary issue and therefore should not shut down the agent. It should be up to the orchestrator to make decisions about shutting down the whole agent based on its health check response.

Why is this a problem

This behavior is causing problems on Cloud when the initial setup fails and the agent shuts down, for example on all ECE air gapped deployments >= 7.14. An unhealthy Fleet Server should not impact a standalone APM Server, but with the above mentioned behavior it does.

@simitt simitt added bug discussion Team:Elastic-Agent Label for the Agent team labels Oct 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@andresrc
Copy link
Contributor

andresrc commented Oct 1, 2021

ping @blakerouse

@simitt
Copy link
Contributor Author

simitt commented Oct 4, 2021

@blakerouse is working on a fix for always exposing the http endpoint, also during bootstrapping process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants