Delay the restart of application when a status report of failure is given #25339

blakerouse · 2021-04-27T15:54:16Z

What does this PR do?

With the change in Fleet Server to report failure on error, this helps cleanup the flow to only restart if it stays failed for longer than 10 seconds. This allows a temporary failure to occur, before Elastic Agent would just force kill it and then restart it.

Why is it important?

It is very possible that a user gets the connection information or authentication information to elasticsearch wrong when bootstrapping with Fleet Server. In that case the Elastic Agent should show a clear error message versus just spamming the logs with constant restarts.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

[Windows and Linux ARM64]: Unable to install agent with Fleet Server URL. fleet-server#235

…iven.

elasticmachine · 2021-04-27T15:55:16Z

Pinging @elastic/agent (Team:Agent)

elasticmachine · 2021-04-27T16:37:33Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: blakerouse commented: /test
Start Time: 2021-04-28T14:04:48.883+0000
Duration: 106 min 12 sec
Commit: 377e45e

Test stats 🧪

Test	Results
Failed	0
Passed	1698
Skipped	4
Total	1702

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	1698
Skipped	4
Total	1702

mergify · 2021-04-28T06:04:32Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b delay-restart-on-failure-report upstream/delay-restart-on-failure-report
git merge upstream/master
git push upstream delay-restart-on-failure-report

ruflin

Code LGTM. What I wonder how to test this manually and in an automated way.

ruflin · 2021-04-28T06:15:31Z

x-pack/elastic-agent/pkg/core/plugin/process/status.go

-		}
-		ctx := a.startContext
-		tag := a.tag
-
 		// it was marshalled to pass into the state, so unmarshall will always succeed
 		var cfg map[string]interface{}
 		_ = yaml.Unmarshal([]byte(s.Config()), &cfg)


Unrelated to this PR but I don't think we should swallow the errors here.

ruflin · 2021-04-28T06:17:51Z

x-pack/elastic-agent/pkg/core/plugin/process/status.go

+
+	err := a.start(ctx, tag, cfg)
+	if err != nil {
+		a.setState(state.Crashed, fmt.Sprintf("failed to restart: %s", err), nil)


Can we add here a bit more info which process (name?) failed to restart?

That is there, that is managed inside of the setState.

scunningham

Seems ok. A bit hard to understand. Is there a simpler way to write this?

scunningham · 2021-04-28T13:10:37Z

x-pack/elastic-agent/pkg/core/plugin/process/status.go

+	ctx, cancel := context.WithCancel(a.startContext)
+	a.restartCanceller = cancel
+	a.restartConfig = cfg
+	t := time.NewTimer(a.processConfig.FailureTimeout)


Should this start quick with exponential backoff to limit?

Feel like that would make it even more complicated, and harder to understand the time interval in log messages. A constant time allows for log messages to be clear that its every 10 seconds (or whatever setting value set) it is restarting.

blakerouse · 2021-04-28T13:18:48Z

Seems ok. A bit hard to understand. Is there a simpler way to write this?

Based on how the application state interacts with the process applications in Elastic Agent, not really. Open to ideas.

blakerouse · 2021-04-28T14:04:27Z

/test

…iven (#25339) * Delay the restart of application when a status report of failure is given. * Add changelog. * Fix test and make it configurable. * Run mage check (cherry picked from commit 371871e)

…iven (#25339) (#25398) * Delay the restart of application when a status report of failure is given. * Add changelog. * Fix test and make it configurable. * Run mage check (cherry picked from commit 371871e) Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

…iven (#25339) (#25397) * Delay the restart of application when a status report of failure is given. * Add changelog. * Fix test and make it configurable. * Run mage check (cherry picked from commit 371871e) Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

Delay the restart of application when a status report of failure is g…

1097324

…iven.

blakerouse added Team:Elastic-Agent Label for the Agent team backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify labels Apr 27, 2021

blakerouse self-assigned this Apr 27, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 27, 2021

Add changelog.

2f89ae9

blakerouse marked this pull request as ready for review April 27, 2021 15:55

ruflin requested review from urso and scunningham April 27, 2021 19:59

ruflin approved these changes Apr 28, 2021

View reviewed changes

blakerouse added 3 commits April 28, 2021 07:37

Fix test and make it configurable.

f754066

Merge branch 'master' into delay-restart-on-failure-report

c6977d2

Run mage check

377e45e

scunningham approved these changes Apr 28, 2021

View reviewed changes

blakerouse merged commit 371871e into elastic:master Apr 28, 2021

blakerouse deleted the delay-restart-on-failure-report branch April 28, 2021 16:31

mergify bot mentioned this pull request Apr 28, 2021

Delay the restart of application when a status report of failure is given (backport #25339) #25397

Merged

mergify bot mentioned this pull request Apr 28, 2021

Delay the restart of application when a status report of failure is given (backport #25339) #25398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay the restart of application when a status report of failure is given #25339

Delay the restart of application when a status report of failure is given #25339

blakerouse commented Apr 27, 2021 •

edited

Loading

elasticmachine commented Apr 27, 2021

elasticmachine commented Apr 27, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

mergify bot commented Apr 28, 2021

ruflin left a comment

ruflin Apr 28, 2021

ruflin Apr 28, 2021

blakerouse Apr 28, 2021

scunningham left a comment

scunningham Apr 28, 2021

blakerouse Apr 28, 2021

blakerouse commented Apr 28, 2021

blakerouse commented Apr 28, 2021

Delay the restart of application when a status report of failure is given #25339

Delay the restart of application when a status report of failure is given #25339

Conversation

blakerouse commented Apr 27, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

Related issues

elasticmachine commented Apr 27, 2021

elasticmachine commented Apr 27, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

mergify bot commented Apr 28, 2021

ruflin left a comment

Choose a reason for hiding this comment

ruflin Apr 28, 2021

Choose a reason for hiding this comment

ruflin Apr 28, 2021

Choose a reason for hiding this comment

blakerouse Apr 28, 2021

Choose a reason for hiding this comment

scunningham left a comment

Choose a reason for hiding this comment

scunningham Apr 28, 2021

Choose a reason for hiding this comment

blakerouse Apr 28, 2021

Choose a reason for hiding this comment

blakerouse commented Apr 28, 2021

blakerouse commented Apr 28, 2021

blakerouse commented Apr 27, 2021 •

edited

Loading

elasticmachine commented Apr 27, 2021 •

edited by jenkins-beats-ci bot

Loading