[Elastic Agent] Set status Failed if configuration applying fails #23537

blakerouse · 2021-01-15T18:31:26Z

What does this PR do?

Adjusted libbeat to report the failure of reloading the configuration as failed.

Why is it important?

Without this the running beat will stay degraded until the next configuration reload. If applying configuration fails then it is really an error and Elastic Agent should kill it and restart the beat (which it will do with this change).

Checklist

My code follows the style guidelines of this project
~~[ ] I have commented my code, particularly in hard-to-understand areas~~
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Relates [Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

elasticmachine · 2021-01-15T18:32:44Z

Pinging @elastic/agent (Team:Agent)

ph

LGTM haven't tested it or reproduce the mentioned issue with filebeat.

elasticmachine · 2021-01-15T18:34:43Z

Pinging @elastic/ingest-management (Team:Ingest Management)

elasticmachine · 2021-01-15T18:35:25Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #23537 updated
- Start Time: 2021-01-19T20:49:31.909+0000
Duration: 50 min 31 sec
Commit: ccd8bea

Test stats 🧪

Test	Results
Failed	0
Passed	5468
Skipped	358
Total	5826

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	5468
Skipped	358
Total	5826

michalpristas · 2021-01-19T09:39:08Z

LGTM,
small question though, can this lead to restart loop when beat is incapable of recognizing the config?

blakerouse · 2021-01-19T13:05:34Z

@michalpristas Yes that would be the case.

EricDavisX · 2021-01-19T15:27:50Z

/package

blakerouse · 2021-01-19T20:35:02Z

This seems to cause a restart/loop on report of failure back from filebeat.

[elastic_agent][warn] Elastic Agent status changed to: 'degraded'
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to FAILED: 1 error: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::25689473-64768, Finished: false, Fileinfo: &{secure 2537 384 {230053202 63746680987 0x6827760} {64768 25689473 1 33152 0 0 0 0 2537 4096 8 {1600454102 163846939} {1611084187 230053202} {1611084187 230053202} [0 0 0]}}, Source: /var/log/secure, Offset: 5301, Timestamp: 2021-01-19 14:30:01.136606369 -0500 EST m=+401.357986507, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 25689473-64768}
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting

EricDavisX · 2021-01-19T20:36:45Z

shows this error when changing Agent policy:
{"log.level":"error","@timestamp":"2021-01-19T15:06:02.971-0500","log.origin":{"file.name":"instance/beat.go","file.line":952},"message":"Exiting: could not start the HTTP server for the API: listen unix /tmp/elastic-agent/default/filebeat/filebeat.sock: bind: no such file or directory","ecs.version":"1.6.0"}

Testing on a clean system, the Default Agent config was up and running on the centos Agent and it was healthy and had all logs monitoring in place as expected.

After changing policy to one with Endpoint included, the connection to ES seemed to drop for one of the Filebeats and got the host into a bad state.

ph · 2021-01-19T20:47:36Z

@EricDavisX @blakerouse well this seems that without moving to filestream we cannot fix that problem?

blakerouse · 2021-01-19T20:48:35Z

/package

blakerouse · 2021-01-19T21:57:55Z

@ph No I think there was another issue in the code, that with the restart cause a restart loop. I think with that fixed this will work correctly.

@EricDavisX going to give it a run through in the AM.

mdelapenya · 2021-01-19T22:10:06Z

Just in case you need to manually re-run the e2e tests for a PR that broke them with potential flakiness : https://github.com/elastic/e2e-testing/tree/master/e2e#running-tests-for-a-beats-pull-request

Besides that, if you need to run them locally:

$> git clone https://github.com/elastic/e2e-testing.git
$> cd e2e-testing
$> SUITE="fleet" \
    TAGS="fleet_mode_agent" \ # this is optional and allows you to filter by scenario/test suite
    BEATS_USE_CI_SNAPSHOTS=true \   # will consume CI artifacts from GCP bucket
    ELASTIC_AGENT_VERSION="pr-23537" \ # pr-ID
    DEVELOPER_MODE=true \ # do not destroy services after tests run, to allow SSH'ing into them for logs
    TIMEOUT_FACTOR=3 \ # factor to be applied when waiting for resources or number of hits or processes (default: 1 * 3 minutes)
    LOG_LEVEL=TRACE \
    make -C e2e functional-test

Lets get that tested. I will remove my review.

ph · 2021-01-20T21:06:54Z

@EricDavisX Can you approve this PR?

EricDavisX

i pulled the gcp beats-ci generated Agent file and tested on a linux centos system and find that the Fleet UI always shows Healthy when I think it should still be healthy... it is working in this regard. other issues are logged separately and being triaged. this one is good to go, it is being released in concert with the newer System package which has the conditional inputs needed

…astic#23537) * Set status to Failed if configuration applying fails. * Add changelog. * Don't cleanup paths on crash, as it will be restart. Fix ownership. (cherry picked from commit e0881de)

…3537) (#23600) * Set status to Failed if configuration applying fails. * Add changelog. * Don't cleanup paths on crash, as it will be restart. Fix ownership. (cherry picked from commit e0881de)

…3537) (#23601) * Set status to Failed if configuration applying fails. * Add changelog. * Don't cleanup paths on crash, as it will be restart. Fix ownership. (cherry picked from commit e0881de)

blakerouse added the Team:Elastic-Agent Label for the Agent team label Jan 15, 2021

blakerouse self-assigned this Jan 15, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 15, 2021

blakerouse marked this pull request as ready for review January 15, 2021 18:32

ph previously approved these changes Jan 15, 2021

View reviewed changes

botelastic bot added the Team:Ingest Management label Jan 15, 2021

EricDavisX mentioned this pull request Jan 19, 2021

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

Closed

blakerouse added 3 commits January 19, 2021 15:48

Set status to Failed if configuration applying fails.

b996f7a

Add changelog.

7e0d2fc

Don't cleanup paths on crash, as it will be restart. Fix ownership.

ccd8bea

blakerouse force-pushed the fix-libbeat-agent-degraded branch from 95e588a to ccd8bea Compare January 19, 2021 20:48

ph requested a review from EricDavisX January 20, 2021 21:06

ph approved these changes Jan 20, 2021

View reviewed changes

EricDavisX approved these changes Jan 20, 2021

View reviewed changes

blakerouse merged commit e0881de into elastic:master Jan 20, 2021

blakerouse deleted the fix-libbeat-agent-degraded branch January 20, 2021 21:18

blakerouse mentioned this pull request Jan 20, 2021

Cherry-pick #23537 to 7.x: [Elastic Agent] Set status Failed if configuration applying fails #23600

Merged

2 tasks

blakerouse added the v7.12.0 label Jan 20, 2021

blakerouse mentioned this pull request Jan 20, 2021

Cherry-pick #23537 to 7.11: [Elastic Agent] Set status Failed if configuration applying fails #23601

Merged

2 tasks

blakerouse added the v7.11.0 label Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] Set status Failed if configuration applying fails #23537

[Elastic Agent] Set status Failed if configuration applying fails #23537

blakerouse commented Jan 15, 2021 •

edited

Loading

elasticmachine commented Jan 15, 2021

ph left a comment

elasticmachine commented Jan 15, 2021

elasticmachine commented Jan 15, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Test stats 🧪

michalpristas commented Jan 19, 2021

blakerouse commented Jan 19, 2021

EricDavisX commented Jan 19, 2021

blakerouse commented Jan 19, 2021

EricDavisX commented Jan 19, 2021

ph commented Jan 19, 2021

blakerouse commented Jan 19, 2021

blakerouse commented Jan 19, 2021

mdelapenya commented Jan 19, 2021 •

edited

Loading

ph commented Jan 20, 2021

EricDavisX left a comment

[Elastic Agent] Set status Failed if configuration applying fails #23537

[Elastic Agent] Set status Failed if configuration applying fails #23537

Conversation

blakerouse commented Jan 15, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

Related issues

elasticmachine commented Jan 15, 2021

ph left a comment

Choose a reason for hiding this comment

elasticmachine commented Jan 15, 2021

elasticmachine commented Jan 15, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

michalpristas commented Jan 19, 2021

blakerouse commented Jan 19, 2021

EricDavisX commented Jan 19, 2021

blakerouse commented Jan 19, 2021

EricDavisX commented Jan 19, 2021

ph commented Jan 19, 2021

blakerouse commented Jan 19, 2021

blakerouse commented Jan 19, 2021

mdelapenya commented Jan 19, 2021 • edited Loading

ph commented Jan 20, 2021

EricDavisX left a comment

Choose a reason for hiding this comment

blakerouse commented Jan 15, 2021 •

edited

Loading

elasticmachine commented Jan 15, 2021 •

edited by jenkins-beats-ci bot

Loading

mdelapenya commented Jan 19, 2021 •

edited

Loading