-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bruop error when trying to upgrade nodes #179
Comments
Thanks for opening this issue. I believe this is an issue with the installation instructions: the instructions use the To fix this, we need to lock the installation instructions to the latest version. You can fix this in your cluster by clearing your resources in the brupop namespace, and re-installing those resources from the I'll have a stab at correcting the instructions. |
#180 has fixed the install instructions to use the versioned artifacts. Would you mind checking to see if this resolves your issue? To fix this in a better way long-term, we have #126, as well as #161 which would make newer CRDs for the shadow objects backwards compatible. We're intending to use the solution in #161 to fix the current compatibility issue in |
Looks much better: $kubectl get brs --namespace brupop-bottlerocket-aws But there is still some problem with upgrade: Logs from agent of 172-22-49-202 node: {"v":0,"name":"agent","msg":"[RUN - EVENT] Detected drift between spec state and current state. Requesting node to take action","level":30,"hostname":"brupop-agent-fxq6d","pid":1,"time":"2022-04-05T07:57:55.244687734+00:00","target":"agent::agentclient","line":322,"file":"agent/src/agentclient.rs","action":"StagedUpdate","brs_name":"Some("brs-ip-172-22-49-202.aws-ec2")"} |
I misread your message, apologies for closing. So the error message is:
Which is being thrown here. Brupop attempts to "prepare" an update image, and then writes back to the k8s API that the preparation is ready. If, during preparation, it notices an image is already prepped, then it raises an error -- the rationale being that a user may be attempting to manually install a different update ("out of band"), and Brupop should not intervene. I'm noticing the IP of the affected host is the same -- so I think what has happened is:
If you have admin access to this node, then manually triggering the update this once should "un-stick" Brupop. From a usability perspective, I think it might make sense for Brupop to attempt to perform it's configured update regardless. Admins trying to perform an out-of-band update should probably disable Brupop on the node first. We can look at changing this behavior to prevent this error case. |
To apply the update I'm mentioning on this node, you'd want to follow these steps. I would suggest using kubectl drain to remove the node from service first, then using |
Thanks for suggestion. I've updated manually problematic host. Bow it's another problem ... Bruop try to update next host, state is StagedUpdate, host is cordoned but it's not drained ans stand in this state for last 24h ... $ kubectl get brs --namespace brupop-bottlerocket-aws Agent logs from host 172-22-45-41: {"v":0,"name":"agent","msg":"[FETCH_CUSTOM_RESOURCE - START]","level":30,"hostname":"brupop-agent-bqxz6","pid":1,"time":"2022-04-07T20:34:33.367421300+00:00","target":"agent::agentclient","line":140,"file":"agent/src/agentclient.rs"} API server logs: {"v":0,"name":"apiserver","msg":"[CHECK_REQUEST_AUTHORIZED - START]","level":30,"hostname":"brupop-apiserver-745c5cffd9-7q79q","pid":1,"time":"2022-04-07T19:43:21.322756382+00:00","target":"apiserver::auth::authorizor","line":49,"file":"apiserver/src/auth/authorizor.rs","http.target":"/bottlerocket-node-resource","node_selector":"BottlerocketShadowSelector { node_name: "ip-172-22-49-202", node_uid: "1c491162-0d6f-4517-a594-e2ab72f4bc14" }","http.host":"brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local","http.scheme":"http","http.flavor":"1.1","otel.kind":"server","node_name":"ip-172-22-49-202","http.method":"PUT","http.client_ip":"172.22.48.174:41666","request_id":"2a66662c-f3a5-4aea-b5f1-28026675fc6c","http.user_agent":"","http.route":"/bottlerocket-node-resource"} |
Thanks for the report and logs. Are you using StatefulSet deployments? We recently merged a PR to fix #168, which sounds like the same problem. We're planning to release an updated Brupop with the fix. |
Yeap, I use almost all Kinds of objects on the cluster, StateFullSet as well. |
Hi, the release v0.2.1 includes the fix for StateFullSet. Could you try and let us know if you still meet the same issue? |
Hello pmacieje@, haven't heard from you for a long time. I would assume our new release solved the issue you mentioned here. I am going to close this issue. Please feel free to open a new issue if you have any questions or concerns. |
Hi, there smonusfish, works better now, no more hanging issue, but I saw that there was no drain during update process of my bottlerockets, there was only apply new os version and api reboot. Is that correct ? |
The
The drain actually happens during the state change
Since The other possible could be since Kubernetes API doesn't provide an implementation of drain, Brupop uses Pod Deletion and Eviction to remove all Pods from a give node, kubectl by default will not evict nodes under some criteria, so by default, we ignore:
If you are talking about pods belong to the above set, it's not drained as expected. |
**Image I'm using:**bottlerocket/bottlerocket-update-operator:v0.2.0
Issue or Feature Request: I've implementet bruop on EKS cluster, but there only one (of six) node was updated. Rest of nodes got error like below.
Agent logs:
{"v":0,"name":"agent","msg":"[UPDATE_BOTTLEROCKET_SHADOW - START]","level":30,"hostname":"brupop-agent-4z2dr","pid":1,"time":"2022-04-04T07:50:55.176421185+00:00","target":"apiserver::client::webclient","line":155,"file":"apiserver/src/client/webclient.rs","self":"K8SAPIServerClient { k8s_projected_token_path: "/var/run/secrets/tokens/bottlerocket-agent-service-account-token" }","req":"UpdateBottlerocketShadowRequest { node_selector: BottlerocketShadowSelector { node_name: "ip-172-22-49-202.aws-ec2", node_uid: "1c491162-0d6f-4517-a594-e2ab72f4bc14" }, node_status: BottlerocketShadowStatus { current_version: "1.6.2", target_version: "1.6.2", current_state: Idle } }"}
{"v":0,"name":"agent","msg":"[UPDATE_BOTTLEROCKET_SHADOW - END]","level":30,"hostname":"brupop-agent-4z2dr","pid":1,"time":"2022-04-04T07:51:12.631267091+00:00","target":"apiserver::client::webclient","line":155,"file":"apiserver/src/client/webclient.rs","elapsed_milliseconds":17454,"self":"K8SAPIServerClient { k8s_projected_token_path: "/var/run/secrets/tokens/bottlerocket-agent-service-account-token" }","req":"UpdateBottlerocketShadowRequest { node_selector: BottlerocketShadowSelector { node_name: "ip-172-22-49-202.aws-ec2", node_uid: "1c491162-0d6f-4517-a594-e2ab72f4bc14" }, node_status: BottlerocketShadowStatus { current_version: "1.6.2", target_version: "1.6.2", current_state: Idle } }"}
{"v":0,"name":"agent","msg":"[UPDATE_METADATA_CUSTOM_RESOURCE - EVENT] agent::agentclient","level":50,"hostname":"brupop-agent-4z2dr","pid":1,"time":"2022-04-04T07:51:12.631362459+00:00","target":"agent::agentclient","line":206,"file":"agent/src/agentclient.rs","error":"Unable to update the custom resource associated with this node: 'Unable to update BottlerocketShadow status (ip-172-22-49-202.aws-ec2, 1c491162-0d6f-4517-a594-e2ab72f4bc14): 'API server responded with an error status code 500 Internal Server Error: 'Error patching BottlerocketShadow: 'Unable to update BottlerocketShadow status (ip-172-22-49-202.aws-ec2, 1c491162-0d6f-4517-a594-e2ab72f4bc14): 'ApiError: BottlerocketShadow.brupop.bottlerocket.aws "brs-ip-172-22-49-202.aws-ec2" is invalid: status.crash_count: Required value: Invalid (ErrorResponse { status: "Failure", message: "BottlerocketShadow.brupop.bottlerocket.aws \"brs-ip-172-22-49-202.aws-ec2\" is invalid: status.crash_count: Required value", reason: "Invalid", code: 422 })'''''"}
Monitoring Custom Resources:
$ kubectl get brs --namespace brupop-bottlerocket-aws
NAME STATE VERSION TARGET STATE TARGET VERSION CRASH COUNT
brs-ip-172-22-40-11.aws-ec2 Idle
brs-ip-172-22-43-11.aws-ec2 Idle
brs-ip-172-22-45-41.aws-ec2 Idle
brs-ip-172-22-45-87.aws-ec2 Idle
brs-ip-172-22-49-202.aws-ec2 Idle
brs-ip-172-22-50-13.aws-ec2 Idle
Bruop stack deployed:
kubectl get pods -n brupop-bottlerocket-aws
NAME READY STATUS RESTARTS AGE
brupop-agent-4z2dr 1/1 Running 0 3d21h
brupop-agent-dfztv 1/1 Running 0 3d21h
brupop-agent-hzd7d 1/1 Running 0 3d21h
brupop-agent-lz9xq 1/1 Running 0 3d21h
brupop-agent-vb2w8 1/1 Running 0 3d21h
brupop-agent-zc6rt 1/1 Running 1 3d21h
brupop-apiserver-745c5cffd9-bdts7 1/1 Running 1 3d21h
brupop-apiserver-745c5cffd9-ssvbx 1/1 Running 0 3d21h
brupop-apiserver-745c5cffd9-wht6g 1/1 Running 0 3d21h
brupop-controller-deployment-8545559bc7-pqkrs 1/1 Running 1 3d21h
All nodes are labeled with bottlerocket.aws/updater-interface-version=2.0.0
Could enyone help with this issue ?
The text was updated successfully, but these errors were encountered: