Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to operate on BottlerocketShadow: 'Unable to get Bottlerocket node 'status' because of missing 'status' value'" #568

Closed
perplexa opened this issue Oct 24, 2023 · 9 comments

Comments

@perplexa
Copy link

Hi,

We are running the Helm chart v1.3.0 of brupop and see regular agent errors affecting newly spawned spot instances. The operator stabilizes after restarting a few (usually 1-3) times and starts working as expected. I have attached the logs for more details.

Helm Values

exclude_from_lb_wait_time_in_sec: "0"
logging:
  ansi_enabled: "false"
  formatter: json
max_concurrent_updates: "1"
namespace: brupop-bottlerocket-aws
placement:
  agent:
    tolerations:
    - operator: Exists
  apiserver:
    tolerations:
    - operator: Exists
  controller:
    tolerations:
    - operator: Exists
scheduler_cron_expression: 0 * * * * * *

Example Logs

{"timestamp":"2023-10-24T08:41:07.298595Z","level":"INFO","fields":{"message":"Created K8s API Server client using service port","service_port":"443"},"target":"apiserver::client::webclient"}
{"timestamp":"2023-10-24T08:41:08.399114Z","level":"INFO","fields":{"message":"Brs has been created.","brs_name":"\"ip-xxx-xxx-xxx-xxx.xxxxx.compute.internal\""},"target":"agent::agentclient","span":{"name":"create_metadata_shadow"},"spans":[{"name":"run"},{"name":"create_shadow_if_not_exist"},{"name":"create_metadata_shadow"}]}
{"timestamp":"2023-10-24T08:41:08.901165Z","level":"INFO","fields":{"message":"Checking for Bottlerocket updates."},"target":"agent::agentclient","span":{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"}]}
{"timestamp":"2023-10-24T08:41:09.398613Z","level":"INFO","fields":{"message":"The lock for the update API is held by another process ..."},"target":"agent::apiclient::api","span":{"args":"[\"raw\", \"-u\", \"/updates/status\"]","name":"invoke_apiclient"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"shadow_error_info":"ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }","name":"shadow_status_with_refreshed_system_matadata"},{"name":"get_update_status"},{"args":"[\"raw\", \"-u\", \"/updates/status\"]","name":"invoke_apiclient"}]}
{"timestamp":"2023-10-24T08:41:19.324755Z","level":"INFO","fields":{"message":"Brs status has been updated.","brs_name":"\"ip-xxx-xxx-xxx-xxx.xxxxx.compute.internal\"","brs_status":"BottlerocketShadowStatus { current_version: \"1.15.1\", target_version: \"1.15.1\", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }"},"target":"agent::agentclient","span":{"name":"update_metadata_shadow"},"spans":[{"name":"run"},{"name":"initialize_shadow_if_not_initialized"},{"name":"initialize_metadata_shadow"},{"name":"update_metadata_shadow"}]}
{"timestamp":"2023-10-24T08:41:19.398602Z","level":"ERROR","fields":{"error":"Unable to operate on BottlerocketShadow: 'Unable to get Bottlerocket node 'status' because of missing 'status' value'"},"target":"agent::agentclient","span":{"name":"run"},"spans":[{"name":"run"}]}
{"timestamp":"2023-10-24T08:41:19.398838Z","level":"ERROR","fields":{"message":"Agent runner exited"},"target":"agent"}
@yeazelm
Copy link
Contributor

yeazelm commented Oct 24, 2023

Thanks @perplexa for opening this issue, we will take a look and see if we can figure out what is going on! I just want to clarify, have you seen this before on older versions, or just the one deployed via v1.3.0 helm charts?

@perplexa
Copy link
Author

perplexa commented Oct 25, 2023

According to our monitoring we have had brupop agent restarts before, but it got noticeably more since I migrated from 1.0.0 to 1.3.0.

I have not looked into it myself but I assume it might relate to the cron expression (which I have set fairly aggressively) and the agent trying to operate on a shadow before it is actually available on a new node.

The graph shows restarts per pod across all brupop agent pods in the past 60 days. I switched to 1.3.0 around 2023-10-12.

Screenshot 2023-10-25 at 10 19 46

@cbgbt
Copy link
Contributor

cbgbt commented Oct 25, 2023

I have a theory on what's going on here. As some background, the overall picture of how the Shadow is updated and the agent becomes aware of changes is something like this:

Untitled-2023-10-25-0955(1)

Brupop's apiserver is responsible for authorizing update requests, by checking that the Shadow update is coming from the associated node. The agent uses k8s WATCH APIs as an efficient alternative for "polling" of the shadow state. We do this because the controller can also change the BottlerocketShadow object.

Based on the logs, it looks like what's happening is that the agent is successfully completing an UPDATE call to the brupop apiserver, but then probably too-quickly assuming that the change will be reflected by the local watch. When you get this message:

'Unable to get Bottlerocket node 'status' because of missing 'status' value'

It's a local copy of the BottlerocketShadow that the agent has in memory which is missing a status.

@cbgbt
Copy link
Contributor

cbgbt commented Oct 25, 2023

If this is true, we should be able to resolve this by ensuring the agent retries looking at its local state in the reflector for some time if the status is missing.

@cbgbt
Copy link
Contributor

cbgbt commented Oct 25, 2023

I assume it might relate to the cron expression (which I have set fairly aggressively)

FWIW: The cron expressions cannot make brupop more aggressive than it is by default, so I wouldn't worry about this.

@perplexa
Copy link
Author

Might be unrelated, but there's also quite a lot of the below error in EKS's apiserver logs (on a new cluster & fresh brupop Helm install):

E1027 14:33:11.052419      11 cacher.go:470] cacher (bottlerocketshadows.brupop.bottlerocket.aws): unexpected ListAndWatch error: failed to list brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow: conversion webhook for brupop.bottlerocket.aws/v2, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority; reinitializing...

@cbgbt
Copy link
Contributor

cbgbt commented Oct 27, 2023

@perplexa I believe you are hitting #486. Sorry about that.

We are looking to include a fix for this in the next Brupop release.

@vyaghras
Copy link
Contributor

Pull request #572 resolves this issue. This fix will be available as part of next Brupop release.

@perplexa
Copy link
Author

awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants