[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

simitt · 2021-10-01T06:40:22Z

Current Behavior

When starting the Agent with Fleet-Server with the setup command, in case the Fleet Server can not be set up correctly, the Elastic Agent does not expose the health check endpoint and eventually shuts down. This happens for example when the package registry is (temporarily) unavailable causing issues for the Fleet setup, or when trying to enroll into an Agent Policy without a Fleet Server package.
The Cloud setup is using the Fleet preconfiguration API to set up the Cloud agent policy. When the package registry is not reachable, the agent policy is still created, but it doesn't contain a Fleet Server package policy .

Expected Behavior

The Elastic Agent should always expose the healthcheck endpoint /processes and listen on the configured port. It should not shut down because of issues in one of its subprocesses, not even when it is the Fleet Server. The health check is designed to always return a 200 for the Elastic Agent itself and a list of subprocesses that are expected to be running. For every expected subprocess a pid is returned, indicating if it is up or not.
While the agent is not usable when the Fleet Server is not up, this might be a temporary issue and therefore should not shut down the agent. It should be up to the orchestrator to make decisions about shutting down the whole agent based on its health check response.

Why is this a problem

This behavior is causing problems on Cloud when the initial setup fails and the agent shuts down, for example on all ECE air gapped deployments >= 7.14. An unhealthy Fleet Server should not impact a standalone APM Server, but with the above mentioned behavior it does.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-01T06:40:23Z

Pinging @elastic/agent (Team:Agent)

andresrc · 2021-10-01T14:55:16Z

ping @blakerouse

simitt · 2021-10-04T13:33:35Z

@blakerouse is working on a fix for always exposing the http endpoint, also during bootstrapping process.

simitt added bug discussion Team:Elastic-Agent Label for the Agent team labels Oct 1, 2021

blakerouse self-assigned this Oct 5, 2021

blakerouse mentioned this issue Oct 5, 2021

Allow HTTP metrics to run in bootstrap mode. Add ability to adjust timeouts for Fleet Server. #28260

Merged

3 tasks

blakerouse closed this as completed in #28260 Oct 14, 2021

simitt mentioned this issue Nov 21, 2022

V2: 8.6.0-SNAPSHOT not starting in ESS elastic/elastic-agent#1731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

simitt commented Oct 1, 2021

elasticmachine commented Oct 1, 2021

andresrc commented Oct 1, 2021

simitt commented Oct 4, 2021

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

[elastic-agent] Elastic Agent shuts down when Fleet Server is unhealthy #28209

Comments

simitt commented Oct 1, 2021

Current Behavior

Expected Behavior

Why is this a problem

elasticmachine commented Oct 1, 2021

andresrc commented Oct 1, 2021

simitt commented Oct 4, 2021