Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gracefully exit brupop agent when rebooting system #218

Merged
merged 1 commit into from
Aug 17, 2022

Conversation

gthao313
Copy link
Member

@gthao313 gthao313 commented Jun 28, 2022

When the brupop agent reboots a node, it allows the reboot to send
a signal to the process, terminating it. Instead of that, We should
have the brupop agent exit gracefully rather than allowing it to be killed.

Issue number:
#149

Description of changes:

Author: Tianhao Geng <tianhg@amazon.com>
Date:   Tue Jun 28 07:25:13 2022 +0000

    gracefully exit brupop agent when rebooting system

    When the brupop agent reboots a node, it allows the reboot to send
    a signal to the process, terminating it. Instead of that, We should
    have the brupop agent exit gracefully rather than allowing it to be killed.

alarm log

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', agent/src/apiclient.rs:112:56

Testing done:
Launch host with new changes

  • check if agent gracefully exits
  • check if alarming-looking agent log message on reboot has been removed

alarm log had been removed and current reboot log

{"v":0,"name":"agent","msg":"[HANDLE_STATE_TRANSITION - EVENT] Bottlerocket node is terminated by reboot signal","level":30,"hostname":"brupop-agent-x9x5c","pid":1,"time":"2022-06-28T07:12:50.525208202+00:00","target":"agent::apiclient","line":158,"file":"agent/src/apiclient.rs","bottlerocket_shadow":"BottlerocketShadow { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time(2022-06-28T07:05:04Z)), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(3), labels: None, managed_fields: Some([ManagedFieldsEntry { api_version: Some(\"brupop.bottlerocket.aws/v2\"), fields_type: Some(\"FieldsV1\"), fields_v1: Some(FieldsV1(Object({\"f:metadata\": Object({\"f:ownerReferences\": Object({\".\": Object({}), \"k:{\\\"uid\\\":\\\"e77d33d5-2f4f-491e-9755-cb9ee36545b7\\\"}\": Object({\".\": Object({}), \"f:apiVersion\": Object({}), \"f:kind\": Object({}), \"f:name\": Object({}), \"f:uid\": Object({})})})}), \"f:spec\": Object({\".\": Object({}), \"f:state\": Object({}), \"f:state_transition_timestamp\": Object({}), \"f:version\": Object({})}), \"f:status\": Object({\".\": Object({}), \"f:crash_count\": Object({}), \"f:current_state\": Object({}), \"f:current_version\": Object({}), \"f:target_version\": Object({})})}))), manager: Some(\"unknown\"), operation: Some(\"Update\"), time: Some(Time(2022-06-28T07:05:24Z)) }]), name: Some(\"brs-ip-192-168-141-69.us-west-2.compute.internal\"), namespace: Some(\"brupop-bottlerocket-aws\"), owner_references: Some([OwnerReference { api_version: \"v1\", block_owner_deletion: None, controller: None, kind: \"Node\", name: \"ip-192-168-141-69.us-west-2.compute.internal\", uid: \"e77d33d5-2f4f-491e-9755-cb9ee36545b7\" }]), resource_version: Some(\"14629549\"), self_link: None, uid: Some(\"391f21d1-b22a-4c88-a1d5-efa73293fad3\") }, spec: BottlerocketShadowSpec { state: RebootedIntoUpdate, state_transition_timestamp: Some(\"2022-06-28T07:12:21.899994323+00:00\"), version: Some(\"1.8.0\") }, status: Some(BottlerocketShadowStatus { current_version: \"1.6.1\", target_version: \"1.8.0\", current_state: StagedAndPerformUpdate, crash_count: 0, state_transition_failure_timestamp: None }) }"}

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@gthao313 gthao313 marked this pull request as ready for review June 28, 2022 17:01
@gthao313 gthao313 requested a review from rpkelly July 6, 2022 02:13
Comment on lines 156 to 163
// reboot sends signal to kill the agent; instead of that, we should gracefully exit brupop agent.
_ => {
event!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that we can handle reboots this deep in the abstraction layer, and not higher up when we actually call the API to trigger a reboot.

I think more specifically, I'm not sure I understand how a reboot ultimately causes this codepath to be executed. Can you elaborate?

Copy link
Member Author

@gthao313 gthao313 Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbgbt Yeah! When I looked into the error thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1' and code, I found the root cause to make agent panic is here.

When apiclient successfully called reboot, I expected it to return output.status.successed() == true. However, it returned false and stderr was empty even the reboot worked as expected. The status of reboot command was Exit Status(unix wait status(15)), stderr='' which means a request to the program to terminate. Therefore, I think I should consider this situation as success when calls reboot. Add an extra match condition there to handle the signal which sends to kill the agent, and gracefully exit the agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to only exit() if we receive that exist status? It's hard reading this code to understand the interaction you mentioned here.

Copy link
Member Author

@gthao313 gthao313 Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean it's bette to do like this? Exist on specific Exit status?

match error_statuscode {
... =>....,
15 => std::process::exit(0)
}

The reason why I didn't do that was that the logic to retrieve the exit status here is to extract exit status from stderr. However, the status of calling reboot is fail but stderr is empty so that unable to extract the exist status. Therefore, I have the logic here if the command returns fail but with empty stderr, It should be reboot and we need exit agent.

Yeah, I agree that's not a straightforward way and inconvenient to read. I think might be better to add more description here?

Is that a good way or actually we have better way to handle this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned:

The status of reboot command was Exit Status(unix wait status(15)), stderr=''

Are we saying that the apiserver received signal 15 (SIGTERM) and still returned this result? Or did our brupop agent receive SIGTERM? If it's our own agent, we could possibly write a signal handler. If it's from the apiserver, can we extract it from the Exit Status you mentioned above somehow?

I'd be okay with explaining the situation in a comment if there's not a better way to make it more clear in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, apiserver received signal 15 (SIGTERM).

On Unix, this will return None if the process was terminated by a signal (reboot sends signal 15 (SIGTERM)). Signal termination is not considered a success. After further investment, I would say current method is more clear. I'll add more explanation to make code more readable.

When the brupop agent reboots a node, it allows the reboot to send
a signal to the process, terminating it. Instead of that, We should
have the brupop agent exit gracefully rather than allowing it to be killed.
@gthao313
Copy link
Member Author

I added explanation on change part.

@gthao313 gthao313 requested a review from cbgbt August 16, 2022 01:18
@gthao313 gthao313 merged commit e253c45 into bottlerocket-os:develop Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants