Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing BottlerocketShadow CRD causes excessive logging #478

Closed
maxres-ch opened this issue Jul 6, 2023 · 9 comments
Closed

Missing BottlerocketShadow CRD causes excessive logging #478

maxres-ch opened this issue Jul 6, 2023 · 9 comments
Assignees

Comments

@maxres-ch
Copy link

https://github.com/bottlerocket-os/bottlerocket-update-operator/issues?q=is%3Aissue+backoff

Image I'm using:
v1.0.0

Issue or Feature Request:

We've started using the helm chart on the develop branch. We missed that the shadow chart had to be installed. When the operator started it rightly wrote error messages because the API didn't exist. We didn't see the error because the controller went healthy.

We saw around 53.6 million error messages in a 7 minute window and a peak of 130K messages / sec when the bottlerocket shadow CRD was not installed.

I would have expected the controller to not be healthy --or-- there's exponential backoffs used when these kind of errors are hit (could potentially DoS the kube-api server too). Controllers I've written in the past used backoff to prevent crazy logs and hammering the API. I saw that at least one of the errors was had a static backoff of 5 secs, but there's some others that seem to have no backoff.

 Message
[[wildcard]*[/wildcard]:[wildcard][00-59][/wildcard]:[wildcard][00.018549-59.992987][/wildcard]Z�[0m �[33m WARN�[0m �[1;33mkube_client::client�[0m�[33m: �[33mUnsuccessful data error parse: 404 page not found
[0m
[2;3mat�[0m /src/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/kube-client-0.76.0/src/client/mod.rs:434
[2;3min�[0m [wildcard]*[/wildcard]::[wildcard]*[/wildcard]::�[[wildcard]*[/wildcard]�[0m
[2;3mat�[0m [wildcard]*[/wildcard][wildcard]/src/*[/wildcard]
"[[wildcard]*[/wildcard]:[wildcard][02-59][/wildcard]:[wildcard][00.593393-58.285542][/wildcard]Z�[0m �[32m INFO�[0m �[1;32magent::apiclient�[0m�[32m: �[32mAPI server busy, retrying later ...�[0m"
"[2;3min�[0m agent::agentclient::�[1mshadow_status_with_refreshed_system_matadata�[0m �[2;3mwith�[0m �[1mshadow_error_info�[0m: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }"
"[2;3min�[0m agent::agentclient::�[1mupdate_status_in_shadow�[0m �[2;3mwith�[0m �[1mbottlerocket_shadow�[0m: BottlerocketShadow { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: None, generate_name: None, generation: Some(1), labels: None, managed_fields: Some([ManagedFieldsEntry { api_version: Some(""brupop.bottlerocket.aws/v2""), fields_type: Some(""FieldsV1""), fields_v1: Some(FieldsV1(Object {""f:metadata"": Object {""f:ownerReferences"": Object {""."": Object {}, ""k:{\""uid\"":\""[wildcard]*[/wildcard]\""}"": Object {}}}, ""f:spec"": Object {""."": Object {}, ""f:state"": Object {}, ""f:state_transition_timestamp"": Object {}, ""f:version"": Object {}}})), manager: Some(""unknown""), operation: Some(""Update""), time: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])) }, ManagedFieldsEntry { api_version: Some(""brupop.bottlerocket.aws/v2""), fields_type: Some(""FieldsV1""), fields_v1: Some(FieldsV1(Object {""f:status"": Object {""."": Object {}, ""f:crash_count"": Object {}, ""f:current_state"": Object {}, ""f:current_version"": Object {}, ""f:target_version"": Object {}}})), manager: Some(""unknown""), operation: Some(""Update""), time: Some(Time([wildcard]yyyy-MM-ddTHH:mm:ssXXX[/wildcard])) }]), name: Some(""[wildcard]*[/wildcard].redacted.compute.internal""), namespace: Some(""brupop-bottlerocket-aws""), owner_references: Some([OwnerReference { api_version: ""v1"", block_owner_deletion: None, controller: None, kind: ""Node"", name: ""[wildcard]*[/wildcard].us-west-2.compute.internal"", uid: ""[wildcard]*[/wildcard]"" }]), resource_version: Some(""[wildcard][54829321-278377934][/wildcard]""), self_link: None, uid: Some(""[wildcard]*[/wildcard]"") }, spec: BottlerocketShadowSpec { state: Idle, state_transition_timestamp: None, version: None }, status: Some(BottlerocketShadowStatus { current_version: ""1.14.1"", target_version: ""1.14.1"", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }) }, �[1mstate�[0m: Idle, �[1mshadow_error_info�[0m: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }"
"[[wildcard]*[/wildcard]:[wildcard][03-59][/wildcard]:[wildcard][04.738003-58.311911][/wildcard]Z�[0m �[32m INFO�[0m �[1;32mcontroller::controller�[0m�[32m: �[32mFound associated bottlerocketshadow name., �[1;32massociated_bottlerocketshadow_name�[0m�[32m: ""[wildcard]*[/wildcard].redacted.compute.internal""�[0m"
[[wildcard]*[/wildcard]:[wildcard][00-59][/wildcard]:[wildcard][01.302051-52.595513][/wildcard]Z�[0m �[32m INFO�[0m �[1;32mcontroller::controller�[0m�[32m: �[32mCalculating if current time is within update time window.�[0m
"[2;3min�[0m apiserver::telemetry::�[1mHTTP request�[0m �[2;3mwith�[0m �[1mhttp.method�[0m: POST, �[1mhttp.route�[0m: /bottlerocket-node-resource, �[1mhttp.flavor�[0m: 1.1, �[1mhttp.scheme�[0m: https, �[1mhttp.host�[0m: brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local, �[1mhttp.client_ip�[0m: [wildcard]10.52.XXX.XXX[/wildcard]:[wildcard]XXXX[/wildcard], �[1mhttp.user_agent�[0m: , �[1mhttp.target�[0m: /bottlerocket-node-resource, �[1motel.kind�[0m: ""server"", �[1mrequest_id�[0m: [wildcard]*[/wildcard], �[1mnode_name�[0m: ""[wildcard]*[/wildcard].redacted.compute.internal"""
"[2;3min�[0m apiserver::telemetry::�[1mHTTP request�[0m �[2;3mwith�[0m �[1mhttp.method�[0m: POST, �[1mhttp.route�[0m: /bottlerocket-node-resource, �[1mhttp.flavor�[0m: 1.1, �[1mhttp.scheme�[0m: https, �[1mhttp.host�[0m: brupop-apiserver.brupop-bottlerocket-aws.svc.cluster.local, �[1mhttp.client_ip�[0m: [wildcard]10.52.XXX.XXX[/wildcard]:[wildcard]XXXX[/wildcard], �[1mhttp.user_agent�[0m: , �[1mhttp.target�[0m: /bottlerocket-node-resource, �[1motel.kind�[0m: ""server"", �[1mrequest_id�[0m: [wildcard]*[/wildcard], �[1mnode_name�[0m: ""[wildcard]*[/wildcard].redacted.compute.internal"", �[1mexception.message�[0m: Error creating BottlerocketShadow: 'Unable to create BottlerocketShadow ([wildcard]*[/wildcard].us-west-2.compute.internal, [wildcard]*[/wildcard]): 'ApiError: ""404 page not found "": Failed to parse error data (ErrorResponse { status: ""404 Not Found"", message: ""\""404 page not found\\n\"""", reason: ""Failed to parse error data"", code: 404 })'', �[1mexception.details�[0m: BottlerocketShadowCreate { source: CreateBottlerocketShadow { source: Api(ErrorResponse { status: ""404 Not Found"", message: ""\""404 page not found\\n\"""", reason: ""Failed to parse error data"", code: 404 }), selector: BottlerocketShadowSelector { node_name: ""[wildcard]*[/wildcard].us-west-2.compute.internal"", node_uid: ""[wildcard]*[/wildcard]"" } } }, �[1mhttp.status_code�[0m: 500, �[1motel.status_code�[0m: ""ERROR"""
[2m2023-07-05T21:38:39.981472Z�[0m �[33m WARN�[0m �[1;33magent::agentclient�[0m�[33m: �[33mAn error occurred when try to create BottlerocketShadow. Restarting event loop.�[0m
"[2;3min�[0m models::node::client::�[1mcreate_node�[0m �[2;3mwith�[0m �[1mselector�[0m: BottlerocketShadowSelector { node_name: ""ip-10-52-96-90.redacted.compute.internal"", node_uid: ""340b1e0b-a34a-4557-99f9-4d2e55dfec7a"" }"

Slightly better view of logged message patterns:

Screen Shot 2023-07-06 at 1 12 26 PM

Screen Shot 2023-07-06 at 1 37 11 PM

@jackgill
Copy link

The excessive logging and controller API calls can occur in other situations as well. I've seen this cause the kube-controller-manager to grind to a halt under heavy load, which destabilized the entire cluster. I'm currently trying to debug a TLS cert issue but it's difficult when brupop keeps launching a denial of service attack against my logging stack.

@gthao313
Copy link
Member

Thanks for opening this issue. We're looking at the issue now and will reach out to you back soon.

For now, we will investigate and work on reducing the logs. Meanwhile, for any versions <=v1.1.0, we recommend to use this installation instead of helm chart since we haven't released a new version which contains the new feature helm yet (but we will release it soon).

@jackgill can you share with us more what was your situation? thanks

@jackgill
Copy link

@gthao313 my situation is a bit different, as far as I know the CRD is installed fine. (And I used the manifest file to install, not the helm chart) I believe the problem is that the API server has an invalid TLS cert for some reason, for instance trying to list the BottleRocketShadows has this error:

> kubectl -n brupop-bottlerocket-aws get bottlerocketshadows
Error from server: conversion webhook for brupop.bottlerocket.aws/v1, Kind=BottlerocketShadow failed: Post "https://brupop-apiserver.brupop-bottlerocket-aws.svc:443/crdconvert?timeout=30s": x509: certificate signed by unknown authority

I can open a separate issue for the TLS problem if you want, I'm still hoping I can figure it out somehow. The reason that I commented here is that the brupop controller is logging this error thousands of times per minute:

  2023-07-12T22:04:52.909266Z ERROR kube_client::client::builder: failed with status 500 Internal Server Error
    at /src/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/kube-client-0.78.0/src/client/builder.rs:151

This makes it more difficult to troubleshoot because I have to remove the controller to avoid overwhelming my logging infrastructure. Regardless of what the root cause of my problem turns out to be, brupop should not have such a high volume of retries and error messages.

@gthao313
Copy link
Member

@jackgill yeah, we will work on reducing the logs!

Can you open a new issue about TLS problem? I think that maybe better to track the issue : ) Meanwhile can you share with us what brupop container version you were using?

@jackgill
Copy link

Thanks @gthao313, I have open #486 for the TLS issue. I am running version 1.1.0 of the brupop container.

@maxres-ch
Copy link
Author

@gthao313 thanks!

@maxres-ch
Copy link
Author

maxres-ch commented Jul 27, 2023

@jackgill yeah, we will work on reducing the logs!

@gthao313 will you be adding some backoffs on some of the errors? IMHO the excessively retrying could DoS the api server.

@cbgbt
Copy link
Contributor

cbgbt commented Aug 8, 2023

I've opened #505 and #506 which significantly reduce log emissions via ratelimiting and backoff. Between these changes and the log formatting options surfaced in #503 to optionally reduce log verbosity and remove ANSI color codes, I believe the next brupop release should resolve this issue.

@cbgbt
Copy link
Contributor

cbgbt commented Aug 8, 2023

The fixes for this will be released with Brupop 1.3.0. Please feel free to follow the tracking issue for that release: #508

Thanks again for the report and all of the supporting data!

@cbgbt cbgbt closed this as completed Aug 8, 2023
@cbgbt cbgbt self-assigned this Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants