gracefully exit brupop agent when rebooting system #218

gthao313 · 2022-06-28T17:00:31Z

When the brupop agent reboots a node, it allows the reboot to send
a signal to the process, terminating it. Instead of that, We should
have the brupop agent exit gracefully rather than allowing it to be killed.

Issue number:
#149

Description of changes:

Author: Tianhao Geng <tianhg@amazon.com>
Date:   Tue Jun 28 07:25:13 2022 +0000

    gracefully exit brupop agent when rebooting system

    When the brupop agent reboots a node, it allows the reboot to send
    a signal to the process, terminating it. Instead of that, We should
    have the brupop agent exit gracefully rather than allowing it to be killed.

alarm log

thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', agent/src/apiclient.rs:112:56

Testing done:
Launch host with new changes

check if agent gracefully exits
check if alarming-looking agent log message on reboot has been removed

alarm log had been removed and current reboot log

{"v":0,"name":"agent","msg":"[HANDLE_STATE_TRANSITION - EVENT] Bottlerocket node is terminated by reboot signal","level":30,"hostname":"brupop-agent-x9x5c","pid":1,"time":"2022-06-28T07:12:50.525208202+00:00","target":"agent::apiclient","line":158,"file":"agent/src/apiclient.rs","bottlerocket_shadow":"BottlerocketShadow { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time(2022-06-28T07:05:04Z)), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(3), labels: None, managed_fields: Some([ManagedFieldsEntry { api_version: Some(\"brupop.bottlerocket.aws/v2\"), fields_type: Some(\"FieldsV1\"), fields_v1: Some(FieldsV1(Object({\"f:metadata\": Object({\"f:ownerReferences\": Object({\".\": Object({}), \"k:{\\\"uid\\\":\\\"e77d33d5-2f4f-491e-9755-cb9ee36545b7\\\"}\": Object({\".\": Object({}), \"f:apiVersion\": Object({}), \"f:kind\": Object({}), \"f:name\": Object({}), \"f:uid\": Object({})})})}), \"f:spec\": Object({\".\": Object({}), \"f:state\": Object({}), \"f:state_transition_timestamp\": Object({}), \"f:version\": Object({})}), \"f:status\": Object({\".\": Object({}), \"f:crash_count\": Object({}), \"f:current_state\": Object({}), \"f:current_version\": Object({}), \"f:target_version\": Object({})})}))), manager: Some(\"unknown\"), operation: Some(\"Update\"), time: Some(Time(2022-06-28T07:05:24Z)) }]), name: Some(\"brs-ip-192-168-141-69.us-west-2.compute.internal\"), namespace: Some(\"brupop-bottlerocket-aws\"), owner_references: Some([OwnerReference { api_version: \"v1\", block_owner_deletion: None, controller: None, kind: \"Node\", name: \"ip-192-168-141-69.us-west-2.compute.internal\", uid: \"e77d33d5-2f4f-491e-9755-cb9ee36545b7\" }]), resource_version: Some(\"14629549\"), self_link: None, uid: Some(\"391f21d1-b22a-4c88-a1d5-efa73293fad3\") }, spec: BottlerocketShadowSpec { state: RebootedIntoUpdate, state_transition_timestamp: Some(\"2022-06-28T07:12:21.899994323+00:00\"), version: Some(\"1.8.0\") }, status: Some(BottlerocketShadowStatus { current_version: \"1.6.1\", target_version: \"1.8.0\", current_state: StagedAndPerformUpdate, crash_count: 0, state_transition_failure_timestamp: None }) }"}

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

cbgbt · 2022-07-13T04:51:24Z

agent/src/apiclient.rs

+                    // reboot sends signal to kill the agent; instead of that, we should gracefully exit brupop agent.
+                    _ => {
+                        event!(


I'm a bit surprised that we can handle reboots this deep in the abstraction layer, and not higher up when we actually call the API to trigger a reboot.

I think more specifically, I'm not sure I understand how a reboot ultimately causes this codepath to be executed. Can you elaborate?

@cbgbt Yeah! When I looked into the error thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1' and code, I found the root cause to make agent panic is here.

When apiclient successfully called reboot, I expected it to return output.status.successed() == true. However, it returned false and stderr was empty even the reboot worked as expected. The status of reboot command was Exit Status(unix wait status(15)), stderr='' which means a request to the program to terminate. Therefore, I think I should consider this situation as success when calls reboot. Add an extra match condition there to handle the signal which sends to kill the agent, and gracefully exit the agent.

Would it make sense to only exit() if we receive that exist status? It's hard reading this code to understand the interaction you mentioned here.

you mean it's bette to do like this? Exist on specific Exit status?

match error_statuscode { ... =>...., 15 => std::process::exit(0) }

The reason why I didn't do that was that the logic to retrieve the exit status here is to extract exit status from stderr. However, the status of calling reboot is fail but stderr is empty so that unable to extract the exist status. Therefore, I have the logic here if the command returns fail but with empty stderr, It should be reboot and we need exit agent.

Yeah, I agree that's not a straightforward way and inconvenient to read. I think might be better to add more description here?

Is that a good way or actually we have better way to handle this.

You mentioned:

The status of reboot command was Exit Status(unix wait status(15)), stderr=''

Are we saying that the apiserver received signal 15 (SIGTERM) and still returned this result? Or did our brupop agent receive SIGTERM? If it's our own agent, we could possibly write a signal handler. If it's from the apiserver, can we extract it from the Exit Status you mentioned above somehow?

I'd be okay with explaining the situation in a comment if there's not a better way to make it more clear in the code.

Yeah, apiserver received signal 15 (SIGTERM).

On Unix, this will return None if the process was terminated by a signal (reboot sends signal 15 (SIGTERM)). Signal termination is not considered a success. After further investment, I would say current method is more clear. I'll add more explanation to make code more readable.

When the brupop agent reboots a node, it allows the reboot to send a signal to the process, terminating it. Instead of that, We should have the brupop agent exit gracefully rather than allowing it to be killed.

gthao313 · 2022-08-16T01:06:45Z

I added explanation on change part.

gthao313 marked this pull request as ready for review June 28, 2022 17:01

gthao313 requested review from somnusfish and cbgbt June 28, 2022 17:01

gthao313 requested a review from rpkelly July 6, 2022 02:13

somnusfish approved these changes Jul 11, 2022

View reviewed changes

cbgbt reviewed Jul 13, 2022

View reviewed changes

gthao313 force-pushed the agent-exit-when-reboot branch from ce0cab2 to 9d3cb56 Compare August 16, 2022 00:55

gracefully exit brupop agent when rebooting system

b7fa9f2

When the brupop agent reboots a node, it allows the reboot to send a signal to the process, terminating it. Instead of that, We should have the brupop agent exit gracefully rather than allowing it to be killed.

gthao313 force-pushed the agent-exit-when-reboot branch from 9d3cb56 to b7fa9f2 Compare August 16, 2022 00:59

gthao313 requested a review from cbgbt August 16, 2022 01:18

cbgbt approved these changes Aug 16, 2022

View reviewed changes

gthao313 merged commit e253c45 into bottlerocket-os:develop Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gracefully exit brupop agent when rebooting system #218

gracefully exit brupop agent when rebooting system #218

gthao313 commented Jun 28, 2022 •

edited

Loading

cbgbt Jul 13, 2022

gthao313 Jul 13, 2022 •

edited

Loading

cbgbt Jul 14, 2022

gthao313 Jul 14, 2022 •

edited

Loading

cbgbt Jul 21, 2022

gthao313 Aug 16, 2022

gthao313 commented Aug 16, 2022

gracefully exit brupop agent when rebooting system #218

gracefully exit brupop agent when rebooting system #218

Conversation

gthao313 commented Jun 28, 2022 • edited Loading

cbgbt Jul 13, 2022

Choose a reason for hiding this comment

gthao313 Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

cbgbt Jul 14, 2022

Choose a reason for hiding this comment

gthao313 Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

cbgbt Jul 21, 2022

Choose a reason for hiding this comment

gthao313 Aug 16, 2022

Choose a reason for hiding this comment

gthao313 commented Aug 16, 2022

gthao313 commented Jun 28, 2022 •

edited

Loading

gthao313 Jul 13, 2022 •

edited

Loading

gthao313 Jul 14, 2022 •

edited

Loading