Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Statefulsets provisioning from CL2 Load Tests #16172

Merged
merged 3 commits into from
Dec 27, 2023

Conversation

hakuna-matatah
Copy link
Contributor

No description provided.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 11, 2023
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

1 similar comment
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

2 similar comments
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah hakuna-matatah changed the title {WIP} Add warm ip and min ip for better prefix management {WIP} Add more cidr block to tackle insufficient CIDRs issue from EC2 APIs Dec 13, 2023
@dims
Copy link
Member

dims commented Dec 13, 2023

/approve
/lgtm

/assign @hakman

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2023
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

3 similar comments
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

Some test failures are attributed to this issue - kubernetes/test-infra#31459

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 18, 2023
@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-small-scale-amazonvpc-using-cl2

@hakuna-matatah
Copy link
Contributor Author

/test presubmit-kops-aws-scale-amazonvpc-using-cl2

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 27, 2023
@hakman
Copy link
Member

hakman commented Dec 27, 2023

/lgtm
/approve
/hold in case @rifelpet has any final comments

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 27, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, hakman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2023
@rifelpet
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit ec584c9 into kubernetes:master Dec 27, 2023
22 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Dec 27, 2023
@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 2, 2024

@hakman could you take a look at this and approve it if you don't find any issues with this. This would disable Statefulsets from CL2 tests and Restart checks for containers - These are the 2 consistent failures i have seen in the last many tests now and all of them point to Statefulsets not coming up and EBS containers getting restarted.

As expected disabling Statefulsets and restart_checks on EBS pods, helped succeed recent tests that were able to get through validation step of kops.
@hakman Some tests are failing in validation step itself which was a known issue due to EBS pods failing to come up and failing validation and test exits before kicking off a load test. To tackle those failures, is there a flag from KOPS side that we can use to not check EBS pods healthiness during validation phase ? Once EBS issue is root-caused we can turn it back on if we have such flag.

@hakman
Copy link
Member

hakman commented Jan 3, 2024

@hakman could you take a look at this and approve it if you don't find any issues with this. This would disable Statefulsets from CL2 tests and Restart checks for containers - These are the 2 consistent failures i have seen in the last many tests now and all of them point to Statefulsets not coming up and EBS containers getting restarted.

As expected disabling Statefulsets and restart_checks on EBS pods, helped succeed recent tests that were able to get through validation step of kops. @hakman Some tests are failing in validation step itself which was a known issue due to EBS pods failing to come up and failing validation and test exits before kicking off a load test. To tackle those failures, is there a flag from KOPS side that we can use to not check EBS pods healthiness during validation phase ? Once EBS issue is root-caused we can turn it back on if we have such flag.

@hakuna-matatah There is no flag to ignore EBS pods healthiness during validation phase. The check is generic, based on the existence of system-cluster-critical and system-node-critical.
Maybe that there's some env var that we can set in the EBS CSI driver pods to not look for something?
We are using IMDSv2 with very small hops value, maybe that is something that influences larger scale?

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 3, 2024

@hakuna-matatah There is no flag to ignore EBS pods healthiness during validation phase. The check is generic, based on the existence of system-cluster-critical and system-node-critical. Maybe that there's some env var that we can set in the EBS CSI driver pods to not look for something? We are using IMDSv2 with very small hops value, maybe that is something that influences larger scale?

Thanks @hakman for you response, have you added these changes to see if it helps the EBS failures we are seeing, do you think it's actually bottleneck'd from APIServer interaction due to the scale ? Currently, I don't think we are snapshotting Prometheus metrics in CL2, if we had that, we could see if APIServer is throwing 429s to EBS drivers.

Maybe there's some env var that we can set in the EBS CSI driver pods to not look for something?

@torredil do you think is there any env-var that you think could think of ?

@torredil
Copy link
Member

torredil commented Jan 4, 2024

The root cause for kOps validation timeout failures is primarily due to Auto Scaling Group (ASG) throttling. As an example, lets take a look at i-0c5cf23ad74672693 from the latest run.

  • 2024-01-03T06:19:37.561Z: RunInstances is invoked.
  • 2024-01-03T06:19:42.520Z: kops validate cluster is invoked.
  • 2024-01-03T06:23:12.189Z: Instance has not joined ASG, this error persists for ~24 minutes:
Retryable error (Throttling: Rate exceeded
	status code: 400, request id: e64b1857-28f4-4c08-8e89-19a0912ffdb0) from autoscaling/DescribeAutoScalingInstances - will retry after delay of 665.889374ms
  • 2024-01-03T06:47:04.298Z: instance is entering the ASG
  • 2024-01-03T06:48:26.211Z: Both EBS CSI node and aws-node pods are finally scheduled on the node, nearly 30 minutes later:
Successfully bound pod to node" pod="kube-system/ebs-csi-node-hc87x" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1

Successfully bound pod to node" pod="kube-system/aws-node-mngfh" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1
  • 2024-01-03T06:48:41.364Z: AssignPrivateIpAddresses is invoked.
  • 2024-01-03T07:14:32.918Z: kOps cluster validation times out after being unable to detect 10 consecutive successful validations (this number needs to be reduced IMO).

Log Source

TPS and ASG throttling from this account in us-east-2:
Screenshot 2024-01-03 at 8 24 26 PM

I've requested the quota increases, lets follow up internally. For context, this issue is related to kubernetes/k8s.io#6165.

cc: @hakuna-matatah @dims @hakman @wmesard

@rifelpet
Copy link
Member

rifelpet commented Jan 4, 2024

The DescribeAutoScalingInstances call is used to determine whether an instance is in its ASG's warm pool and whether or not we should enable the kubelet service. I noticed this is available in instance metadata so I'm migrating to use that in #16213. This should eliminate the DescribeAutoScalingInstances call volume in the scale tests. We'll see how much it helps with time for nodes to join the cluster

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 4, 2024

@hakman @rifelpet Thanks for optimizing on the ASG calls ^^^. That will definitely help with throttling.

2024-01-03T06:48:26.211Z: Both EBS CSI node and aws-node pods are finally scheduled on the node, nearly 30 minutes later:
Successfully bound pod to node" pod="kube-system/ebs-csi-node-hc87x" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1
Successfully bound pod to node" pod="kube-system/aws-node-mngfh" node="i-0c5cf23ad74672693" evaluatedNodes=1 feasibleNodes=1
2024-01-03T06:48:41.364Z: AssignPrivateIpAddresses is invoked.
2024-01-03T07:14:32.918Z: kOps cluster validation times out after being unable to detect 10 consecutive successful validations (this number needs to be reduced IMO).

@torredil In the above example you posted, EBS pod and AWS node has come up already at 2024-01-03T06:48:26.211Z, is what you mentioned IIUC, so how is this particular node/ebs-pod/aws-node in question is related to kOPS cluster validation time out which was happened 26 minutes after ( 2024-01-03T07:14:32.918Z) aws-node/ebs-pod/node is ready ?

Do you think it would be good to have an example of the timeline of operations like this for ``aws-node/ebs-pod/node` that didn't come up ready until the kOPS validation time has run out to understand if this is actually the underlying root of the issues ? Am I missing something here ?

@hakman
Copy link
Member

hakman commented Jan 4, 2024

One more piece of the puzzle: #16216.
We should see soon how things look without ec2/DescribeInstances and ec2/DescribeInstanceTypes calls in nodeup.

@rifelpet
Copy link
Member

rifelpet commented Jan 5, 2024

FYI kops has experimental open telemetry support though it only supports tracing in the kops cli, not the k8s control plane or any node components. With this we can visualize tracing from prow jobs by downloading the otel files from the job artifacts and running a local jaeger-query server. See #16220 for more details.

I tried visualizing the scale test's otel files in it but it has crashed every browser I've tried :/ maybe someone else will have better luck.

Eventually we can add support for dumping traces from other components too.

@torredil
Copy link
Member

torredil commented Jan 5, 2024

@hakuna-matatah I think the next error we need to look at here is the following, which is causing a significant delay in node registration:

providerID was not set for node

see https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/1743153232743501824/artifacts/cluster-info/kube-system/kops-controller-7pdvn/kops-controller.log

As an example, lets look at i-0024e4829e39b7b4a from the latest run.

  • 2024-01-05T06:20:02.53Z - kops validate cluster is invoked.
  • 2024-01-05T06:26:44.83Z - error identifying node from kops-controller:
E0105 06:26:44.836264       1 controller.go:329] "msg"="Reconciler error" "error"="error identifying node \"i-0024e4829e39b7b4a\": providerID was not set for node i-0024e4829e39b7b4a" "Node"={"name":"i-0024e4829e39b7b4a"} "controller"="node" "controllerGroup"="" "controllerKind"="Node" "name"="i-0024e4829e39b7b4a" "namespace"="" "reconcileID"="f434feac-931e-4348-870e-f136135e62c8"
...
I0105 06:49:44.894481       1 node_controller.go:431] Initializing node i-0024e4829e39b7b4a with cloud provider
I0105 06:49:45.206776       1 node_controller.go:502] Successfully initialized node i-0024e4829e39b7b4a with cloud provider
  • 2024-01-05T06:49:45.19 - Node is added to NodeTree with corresponding zone
I0105 06:49:45.190956      11 node_tree.go:65] "Added node in listed group to NodeTree" node="i-0024e4829e39b7b4a" zone="us-east-2:\x00:us-east-2b"
  • 2024-01-05T06:54:09.80
 I0105 06:54:09.806668       1 node_controller.go:163] sending patch for node "i-0024e4829e39b7b4a": "{\"metadata\":{\"labels\":{\"node-role.kubernetes.io/node\":\"\"}}}"

Liveness probes succeed shortly after this process.

cc: @hakman

@hakman
Copy link
Member

hakman commented Jan 5, 2024

I added the --provider-id=aws:///availability-zone/instance-id in #16223. This should address providerID was not set for node and makes the boot sequence really fast. See the 5k run after that:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/16204/presubmit-kops-aws-scale-amazonvpc-using-cl2/1743208014908755968

  • 09:56:11.943779 - validation starts
  • 10:04:01.875856 - all nodes joined the cluster
  • 10:05:32.108396 - all nodes ready, only ebs-csi-node pods not ready
  • 10:48:11.809966 - validation passes (40+ min later).

Observations:

  • CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.
  • CSI node fails to start repeatedly and enters CrashLoopBackOff, but previous container logs show only some failed to establish connection to CSI driver: context deadline exceeded.

@torredil any ideas?

@torredil
Copy link
Member

torredil commented Jan 5, 2024

@hakman Excellent thanks for swiftly taking care of that! I think we're on the right track with the CCM observation. For every instance of CSI node container livenessprobe check failures I've observed, the checks start succeeding after CCM successfully initializes the node with cloud provider and adds node labels such that KCM is able to retrieve the zone information.

kube-controller-manager.log:

I0105 06:48:15.764482      11 topologycache.go:237] "Can't get CPU or zone information for node" node="i-00fc83f221ebae26e"
I0105 06:48:15.764618      11 topologycache.go:253] "Insufficient node info for topology hints" totalZones=0 totalCPU="8" sufficientNodeInfo=false

As an example, lets take a look at i-002f08e03b0d03709 , where CSI node pod comes up shortly after kops validate and livenessprobe check does not fail:

kube-scheduler.log

I0104 06:25:17.395453      11 node_tree.go:79] "Removed node in listed group from NodeTree" node="i-002f08e03b0d03709" zone=""
I0104 06:25:17.395484      11 node_tree.go:65] "Added node in listed group to NodeTree" node="i-002f08e03b0d03709" zone="us-east-2:\x00:us-east-2c"

kubelet.log

I0104 06:25:27.810194    2897 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: ebs.csi.aws.com endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock versions: 1.0.0
I0104 06:25:27.810386    2897 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: ebs.csi.aws.com at endpoint: /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock

Unfortunately, the aws-cloud-controller-manager.log starts at 06:52:20.454135 so we can't see the log where CCM initializes node and adds labels, but presumably that occurs before 06:25:17.395484.

kube-controller-manager.log:

I0104 06:30:42.230341      11 ttl_controller.go:281] "Changed ttl annotation" node="i-002f08e03b0d03709" TTL="5m0s"

Notice the absence of Can't get CPU or zone information for node error.

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Jan 5, 2024

CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.

@hakman We can leverage this flag from CCM to improve the node_syncs on kops side for CCM, this should make it faster as it's going spin up more workers instead of one, we can set to 30 ? Default workers I think is set to 1 in CCM, and from what I see, KOPS doesn't allow to set that flag to customize as it's not defined as setting in it's custom config struct def IIUC. So, enabling this as a flag on kops and setting it would help.

@hakman
Copy link
Member

hakman commented Jan 6, 2024

CCM is fairly slow: node_controller.go:267] Update 5003 nodes status took 8m30.356517966s.

@hakman We can leverage this flag from CCM to improve the node_syncs on kops side for CCM, this should make it faster as it's going spin up more workers instead of one, we can set to 30 ? Default workers I think is set to 1 in CCM, and from what I see, KOPS doesn't allow to set that flag to customize as it's not defined as setting in it's custom config struct def IIUC. So, enabling this as a flag on kops and setting it would help.

Let's see if #16228 will help. 😀

@torredil
Copy link
Member

torredil commented Jan 8, 2024

@hakman @hakuna-matatah

Looks like it helped, but we still have ways to go:

aws-cloud-controller-manager.log

I0107 06:25:27.482746       1 node_controller.go:502] Successfully initialized node i-07814b493f17c05fe with cloud provider
...
I0107 06:47:51.234044       1 node_controller.go:502] Successfully initialized node i-07ab0fadf6b251bd2 with cloud provider
I0107 06:25:27.223436       1 node_controller.go:267] Update 3 nodes status took 69.915µs.
I0107 06:31:46.038439       1 node_controller.go:267] Update 5003 nodes status took 1m18.814078847s.
I0107 06:40:28.547820       1 node_controller.go:267] Update 5003 nodes status took 3m42.508428426s.
I0107 06:52:51.190852       1 node_controller.go:267] Update 5003 nodes status took 7m22.642821075s.
I0107 07:06:17.260322       1 node_controller.go:267] Update 5003 nodes status took 8m26.068627268s.
I0107 07:19:41.514106       1 node_controller.go:267] Update 5003 nodes status took 8m24.252855746s.
I0107 07:33:14.847352       1 node_controller.go:267] Update 5003 nodes status took 8m33.332452261s.
I0107 07:46:54.687156       1 node_controller.go:267] Update 5003 nodes status took 8m39.839100999s.
I0107 08:00:40.719528       1 node_controller.go:267] Update 5003 nodes status took 8m46.031692279s.
I0107 08:14:29.259301       1 node_controller.go:267] Update 5003 nodes status took 8m48.538857604s.
I0107 08:28:16.464665       1 node_controller.go:267] Update 5003 nodes status took 8m47.204680091s.
I0107 08:42:03.231571       1 node_controller.go:267] Update 5003 nodes status took 8m46.766106626s.
I0107 08:55:52.191038       1 node_controller.go:267] Update 5003 nodes status took 8m48.959225455s.
I0107 09:09:42.420701       1 node_controller.go:267] Update 5003 nodes status took 8m50.228761742s.
I0107 09:23:31.476705       1 node_controller.go:267] Update 5003 nodes status took 8m49.055156794s.
I0107 09:37:12.876390       1 node_controller.go:267] Update 5003 nodes status took 8m41.399313664s.

how about setting ConcurrentNodeSyncs to 100 for this test? at 5, we are not close to making out QPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants