-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Elastic Agent] Set status Failed if configuration applying fails #23537
[Elastic Agent] Set status Failed if configuration applying fails #23537
Conversation
Pinging @elastic/agent (Team:Agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM haven't tested it or reproduce the mentioned issue with filebeat.
Pinging @elastic/ingest-management (Team:Ingest Management) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
LGTM, |
@michalpristas Yes that would be the case. |
/package |
This seems to cause a restart/loop on report of failure back from filebeat.
|
shows this error when changing Agent policy: Testing on a clean system, the Default Agent config was up and running on the centos Agent and it was healthy and had all logs monitoring in place as expected. After changing policy to one with Endpoint included, the connection to ES seemed to drop for one of the Filebeats and got the host into a bad state. |
@EricDavisX @blakerouse well this seems that without moving to filestream we cannot fix that problem? |
95e588a
to
ccd8bea
Compare
/package |
@ph No I think there was another issue in the code, that with the restart cause a restart loop. I think with that fixed this will work correctly. @EricDavisX going to give it a run through in the AM. |
Just in case you need to manually re-run the e2e tests for a PR that broke them with potential flakiness : https://github.com/elastic/e2e-testing/tree/master/e2e#running-tests-for-a-beats-pull-request Besides that, if you need to run them locally: $> git clone https://github.com/elastic/e2e-testing.git
$> cd e2e-testing
$> SUITE="fleet" \
TAGS="fleet_mode_agent" \ # this is optional and allows you to filter by scenario/test suite
BEATS_USE_CI_SNAPSHOTS=true \ # will consume CI artifacts from GCP bucket
ELASTIC_AGENT_VERSION="pr-23537" \ # pr-ID
DEVELOPER_MODE=true \ # do not destroy services after tests run, to allow SSH'ing into them for logs
TIMEOUT_FACTOR=3 \ # factor to be applied when waiting for resources or number of hits or processes (default: 1 * 3 minutes)
LOG_LEVEL=TRACE \
make -C e2e functional-test |
Lets get that tested. I will remove my review.
@EricDavisX Can you approve this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i pulled the gcp beats-ci generated Agent file and tested on a linux centos system and find that the Fleet UI always shows Healthy when I think it should still be healthy... it is working in this regard. other issues are logged separately and being triaged. this one is good to go, it is being released in concert with the newer System package which has the conditional inputs needed
…astic#23537) * Set status to Failed if configuration applying fails. * Add changelog. * Don't cleanup paths on crash, as it will be restart. Fix ownership. (cherry picked from commit e0881de)
…astic#23537) * Set status to Failed if configuration applying fails. * Add changelog. * Don't cleanup paths on crash, as it will be restart. Fix ownership. (cherry picked from commit e0881de)
What does this PR do?
Adjusted
libbeat
to report the failure of reloading the configuration as failed.Why is it important?
Without this the running beat will stay degraded until the next configuration reload. If applying configuration fails then it is really an error and Elastic Agent should kill it and restart the beat (which it will do with this change).
Checklist
[ ] I have commented my code, particularly in hard-to-understand areas[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues