Vector making api requests to Kubernetes API server without using resource_version #16753

jeremy-mi-rh · 2023-03-09T23:03:37Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Context

We use vector to deliver kubernetes_logs to our kafka cluster which will be later processed and ingested into Humio. Vector is deployed as a daemon set in our kubernetes clusteres (each with >1000 nodes running).

We recently had an outage in one of our kubernetes clusters (with ~1100 nodes running). There was a failure in ETCD leader node, which resulted in a cascaded failure where pods making 1000x API calls to our API server which eventually brought the kubernetes control plane down entirely.

In the process of remediation, we identified vector as one of the candidate that was hammering the API server. Shutting down vector along with a few other daemon sets eventually reduced the traffic on Control Plane components, which allows ETCD nodes to recover.

Investigation

We did some analysis and investigation after the outage is recovered, and there are 2 issues we want to bring to vector community.

resource_version not set when making API requests to Kube API server

Based on this issue: #7943, resource_version was set to 0 in vector 0.18 - 0.20. PR #11714 adopted kube-rs and dropped the change in #9974. When looking at the Audit logs from kube-api-server, we don't see the resource_version set in the request URL, which makes us wonder if this was an regression

Sample request in audit logs:

{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"xxx","stage":"ResponseComplete","requestURI":"/api/v1/nodes?\u0026fieldSelector=metadata.name%3Dip-10-x-x-x.ec2.internal","verb":"list","user":{"username":"system:serviceaccount:vector:vector-agent","uid":"xxx","groups":["system:serviceaccounts","system:serviceaccounts:vector","system:authenticated"]},"sourceIPs":["10.x.x.x"],"objectRef":{"resource":"nodes","name":"ip-10-x-x-x.ec2.internal","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2023-03-08T02:29:15.992746Z","stageTimestamp":"2023-03-08T02:29:16.534351Z","annotations":{"authentication.k8s.io/legacy-token":"system:serviceaccount:vector:vector-agent","authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"default-reader-role-binding\" of ClusterRole \"cluster-read-all\" to Group \"system:authenticated\""}}

Need a more aggressive backoff strategy?

Another issue we found, is that vector was making lot more requests when it was seeing non successful responses from Kube API server. Making more requests is expected, as it needs to retry. However we are seeing 1000x times more requests in some cases.

Before 17:45, the traffic was pretty steady. It makes 1 - 300 requests at a per minute basis. And when etcd server starts to have issues, it starts to make very aggressive requests which results in as many as 200,000 requests per minute. Is there a way we can configure the backoff strategy in this case? Or should it be less aggressive on retrying by default?

Also attached the graph that filtered on 429 response code:

Configuration

No response

Version

vector 0.27.0 (x86_64-unknown-linux-gnu 5623d1e 2023-01-18)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

#7943

The text was updated successfully, but these errors were encountered:

fuchsnj · 2023-03-13T19:41:38Z

Hey @jeremy-mi-rh, thanks for the issue.

It seems like this is for 2 separate issues, resource_version, and backoff. Do you mind splitting this into 2 issues to make it easier for us to track / prioritize?

jeremy-mi-rh · 2023-03-14T17:05:24Z

Sure! Let me create separate ones and close this one.

jeremy-mi-rh · 2023-03-14T17:17:31Z

resource_version not set to 0: Vector making api requests to Kubernetes API server without using resource_version #16797
vector back off strategy to K8s API server: Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

jeremy-mi-rh added the type: bug A code related bug. label Mar 9, 2023

This was referenced Mar 14, 2023

Vector making api requests to Kubernetes API server without using resource_version #16797

Closed

Vector making excessive api requests to Kubernetes API server on 429/500 http responses #16798

Closed

jeremy-mi-rh closed this as completed Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector making api requests to Kubernetes API server without using resource_version #16753

Vector making api requests to Kubernetes API server without using resource_version #16753

jeremy-mi-rh commented Mar 9, 2023

fuchsnj commented Mar 13, 2023

jeremy-mi-rh commented Mar 14, 2023

jeremy-mi-rh commented Mar 14, 2023

Vector making api requests to Kubernetes API server without using resource_version #16753

Vector making api requests to Kubernetes API server without using resource_version #16753

Comments

jeremy-mi-rh commented Mar 9, 2023

A note for the community

Problem

Context

Investigation

resource_version not set when making API requests to Kube API server

Need a more aggressive backoff strategy?

Configuration

Version

Debug Output

Example Data

Additional Context

References

fuchsnj commented Mar 13, 2023

jeremy-mi-rh commented Mar 14, 2023

jeremy-mi-rh commented Mar 14, 2023