-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay the restart of application when a status report of failure is given #25339
Delay the restart of application when a status report of failure is given #25339
Conversation
Pinging @elastic/agent (Team:Agent) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM. What I wonder how to test this manually and in an automated way.
} | ||
ctx := a.startContext | ||
tag := a.tag | ||
|
||
// it was marshalled to pass into the state, so unmarshall will always succeed | ||
var cfg map[string]interface{} | ||
_ = yaml.Unmarshal([]byte(s.Config()), &cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR but I don't think we should swallow the errors here.
|
||
err := a.start(ctx, tag, cfg) | ||
if err != nil { | ||
a.setState(state.Crashed, fmt.Sprintf("failed to restart: %s", err), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add here a bit more info which process (name?) failed to restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is there, that is managed inside of the setState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems ok. A bit hard to understand. Is there a simpler way to write this?
ctx, cancel := context.WithCancel(a.startContext) | ||
a.restartCanceller = cancel | ||
a.restartConfig = cfg | ||
t := time.NewTimer(a.processConfig.FailureTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this start quick with exponential backoff to limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel like that would make it even more complicated, and harder to understand the time interval in log messages. A constant time allows for log messages to be clear that its every 10 seconds (or whatever setting value set) it is restarting.
Based on how the application state interacts with the process applications in Elastic Agent, not really. Open to ideas. |
/test |
What does this PR do?
With the change in Fleet Server to report failure on error, this helps cleanup the flow to only restart if it stays failed for longer than 10 seconds. This allows a temporary failure to occur, before Elastic Agent would just force kill it and then restart it.
Why is it important?
It is very possible that a user gets the connection information or authentication information to elasticsearch wrong when bootstrapping with Fleet Server. In that case the Elastic Agent should show a clear error message versus just spamming the logs with constant restarts.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues