-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gracefully exit brupop agent when rebooting system #218
gracefully exit brupop agent when rebooting system #218
Conversation
agent/src/apiclient.rs
Outdated
// reboot sends signal to kill the agent; instead of that, we should gracefully exit brupop agent. | ||
_ => { | ||
event!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that we can handle reboots this deep in the abstraction layer, and not higher up when we actually call the API to trigger a reboot.
I think more specifically, I'm not sure I understand how a reboot ultimately causes this codepath to be executed. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbgbt Yeah! When I looked into the error thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1'
and code, I found the root cause to make agent panic is here.
When apiclient successfully called reboot
, I expected it to return output.status.successed() == true
. However, it returned false and stderr was empty even the reboot
worked as expected. The status of reboot command was Exit Status(unix wait status(15)), stderr=''
which means a request to the program to terminate. Therefore, I think I should consider this situation as success when calls reboot
. Add an extra match condition there to handle the signal which sends to kill the agent, and gracefully exit the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to only exit()
if we receive that exist status? It's hard reading this code to understand the interaction you mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean it's bette to do like this? Exist on specific Exit status?
match error_statuscode {
... =>....,
15 => std::process::exit(0)
}
The reason why I didn't do that was that the logic to retrieve the exit status here is to extract exit status from stderr. However, the status of calling reboot
is fail
but stderr
is empty so that unable to extract the exist status. Therefore, I have the logic here if the command returns fail but with empty stderr, It should be reboot and we need exit agent.
Yeah, I agree that's not a straightforward way and inconvenient to read. I think might be better to add more description here?
Is that a good way or actually we have better way to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned:
The status of reboot command was Exit Status(unix wait status(15)), stderr=''
Are we saying that the apiserver
received signal 15 (SIGTERM) and still returned this result? Or did our brupop agent receive SIGTERM? If it's our own agent, we could possibly write a signal handler. If it's from the apiserver, can we extract it from the Exit Status you mentioned above somehow?
I'd be okay with explaining the situation in a comment if there's not a better way to make it more clear in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, apiserver
received signal 15 (SIGTERM).
On Unix, this will return None
if the process was terminated by a signal (reboot sends signal 15 (SIGTERM)). Signal termination is not considered a success. After further investment, I would say current method is more clear. I'll add more explanation to make code more readable.
ce0cab2
to
9d3cb56
Compare
When the brupop agent reboots a node, it allows the reboot to send a signal to the process, terminating it. Instead of that, We should have the brupop agent exit gracefully rather than allowing it to be killed.
9d3cb56
to
b7fa9f2
Compare
I added explanation on change part. |
When the brupop agent reboots a node, it allows the reboot to send
a signal to the process, terminating it. Instead of that, We should
have the brupop agent exit gracefully rather than allowing it to be killed.
Issue number:
#149
Description of changes:
alarm log
Testing done:
Launch host with new changes
alarm log had been removed and current reboot log
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.