Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay the restart of application when a status report of failure is given #25339

Merged
merged 5 commits into from
Apr 28, 2021

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Apr 27, 2021

What does this PR do?

With the change in Fleet Server to report failure on error, this helps cleanup the flow to only restart if it stays failed for longer than 10 seconds. This allows a temporary failure to occur, before Elastic Agent would just force kill it and then restart it.

Why is it important?

It is very possible that a user gets the connection information or authentication information to elasticsearch wrong when bootstrapping with Fleet Server. In that case the Elastic Agent should show a clear error message versus just spamming the logs with constant restarts.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify labels Apr 27, 2021
@blakerouse blakerouse self-assigned this Apr 27, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 27, 2021
@blakerouse blakerouse marked this pull request as ready for review April 27, 2021 15:55
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 27, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: blakerouse commented: /test

  • Start Time: 2021-04-28T14:04:48.883+0000

  • Duration: 106 min 12 sec

  • Commit: 377e45e

Test stats 🧪

Test Results
Failed 0
Passed 1698
Skipped 4
Total 1702

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 1698
Skipped 4
Total 1702

@ruflin ruflin requested review from urso and scunningham April 27, 2021 19:59
@mergify
Copy link
Contributor

mergify bot commented Apr 28, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b delay-restart-on-failure-report upstream/delay-restart-on-failure-report
git merge upstream/master
git push upstream delay-restart-on-failure-report

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. What I wonder how to test this manually and in an automated way.

}
ctx := a.startContext
tag := a.tag

// it was marshalled to pass into the state, so unmarshall will always succeed
var cfg map[string]interface{}
_ = yaml.Unmarshal([]byte(s.Config()), &cfg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR but I don't think we should swallow the errors here.


err := a.start(ctx, tag, cfg)
if err != nil {
a.setState(state.Crashed, fmt.Sprintf("failed to restart: %s", err), nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add here a bit more info which process (name?) failed to restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is there, that is managed inside of the setState.

Copy link

@scunningham scunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems ok. A bit hard to understand. Is there a simpler way to write this?

ctx, cancel := context.WithCancel(a.startContext)
a.restartCanceller = cancel
a.restartConfig = cfg
t := time.NewTimer(a.processConfig.FailureTimeout)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this start quick with exponential backoff to limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like that would make it even more complicated, and harder to understand the time interval in log messages. A constant time allows for log messages to be clear that its every 10 seconds (or whatever setting value set) it is restarting.

@blakerouse
Copy link
Contributor Author

Seems ok. A bit hard to understand. Is there a simpler way to write this?

Based on how the application state interacts with the process applications in Elastic Agent, not really. Open to ideas.

@blakerouse
Copy link
Contributor Author

/test

@blakerouse blakerouse merged commit 371871e into elastic:master Apr 28, 2021
@blakerouse blakerouse deleted the delay-restart-on-failure-report branch April 28, 2021 16:31
mergify bot pushed a commit that referenced this pull request Apr 28, 2021
…iven (#25339)

* Delay the restart of application when a status report of failure is given.

* Add changelog.

* Fix test and make it configurable.

* Run mage check

(cherry picked from commit 371871e)
mergify bot pushed a commit that referenced this pull request Apr 28, 2021
…iven (#25339)

* Delay the restart of application when a status report of failure is given.

* Add changelog.

* Fix test and make it configurable.

* Run mage check

(cherry picked from commit 371871e)
blakerouse added a commit that referenced this pull request Apr 28, 2021
…iven (#25339) (#25398)

* Delay the restart of application when a status report of failure is given.

* Add changelog.

* Fix test and make it configurable.

* Run mage check

(cherry picked from commit 371871e)

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
blakerouse added a commit that referenced this pull request Apr 28, 2021
…iven (#25339) (#25397)

* Delay the restart of application when a status report of failure is given.

* Add changelog.

* Fix test and make it configurable.

* Run mage check

(cherry picked from commit 371871e)

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants