Skip to content

Latest commit

 

History

History
431 lines (333 loc) · 17.2 KB

troubleshooting.md

File metadata and controls

431 lines (333 loc) · 17.2 KB

Troubleshooting Guide

Table of Contents

Troubleshooting Windows

Please follow the troubleshooting guide in the chronological order to debug issues with Windows Node and Pods.

Verify if your EKS Cluster is on the required Platform Version

To get the Platform Version of your EKS cluster

aws eks describe-cluster --name cluster-name --region us-west-2 | jq .cluster.platformVersion

Your Platform Version should be equal to or greater than Platform Version specified here.

Resolution

If your Platform Version is lower, you can

  • Create a new EKS Cluster or
  • Update to the new K8s Version if possible or
  • Enable legacy controller support on your EKS Cluster using this guide.

Verify Windows IPAM is enabled in the ConfigMap.

To get the ConfigMap and the data field

kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"

You should have the ConfigMap with the following data,

enable-windows-ipam:true

Resolution

If the ConfigMap is missing or doesn't have the above field, you can

  • Create or Update ConfigMap with the required fields by following this guide.

Verify Node has the Resource Capacity

Describe the Windows Node,

kubectl describe node node-name

You should see a non-zero capacity for resource vpc.amazonaws.com/PrivateIPv4Address

Capacity:
  vpc.amazonaws.com/PrivateIPv4Address:  9
Allocatable:
  vpc.amazonaws.com/PrivateIPv4Address:  9

Resolution

If the node doesn't have the resource capacity validate the following,

  • Windows Node has label kubernetes.io/os: windows or beta.kubernetes.io/os: windows.
  • There are Sufficient ENI/IP.
  • Sufficient permissions in the Cluster Role.

Verify Pod has the resource limits

Describe the Windows Pod,

kubectl describe pod windows-pod

You should see 1 limit and request for the resource vpc.amazonaws.com/PrivateIPv4Address

Limits:
  vpc.amazonaws.com/PrivateIPv4Address:  1
Requests:
  vpc.amazonaws.com/PrivateIPv4Address:  1

Resolution

If limit/request is missing,

  1. Validate Pod has nodeSelector.
    nodeSelector:
      kubernetes.io/os: windows
    
  2. Validate Mutating Webhook Configuration is not accidentally deleted.
    kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io vpc-resource-mutating-webhook
    NAME                            WEBHOOKS   AGE
    vpc-resource-mutating-webhook   1          59d
    

Verify Pod has the IPv4 Address Annotation.

Describe the Windows Pod,

kubectl describe pod windows-pod

The Pod should have the similar annotation.

Annotations:    vpc.amazonaws.com/PrivateIPv4Address: 192.168.25.15/19

Resolution

If the Annotation is missing,

Look for Issues on the Windows Host

Resolution

If the Pod is still stuck in ContainerCreating you can,

  • Fetch more detailed logs on the Host using the EKS Log collector script
  • Check the CNI Logs from the collected logs.
  • Open an Issue if no intuitive logs are present Issue in this repository.

Troubleshooting Security Group for Pods

Please follow the troubleshooting guide in the chronological order to debug issues with Security Group for Pods.

Verify ENI Trunking is Enabled

Describe the aws-node daemonset

kubectl get ds -n kube-system aws-node -o yaml

The following environment variable must be set.

containers:
  name: aws-node
  env:
  - name: ENABLE_POD_ENI
    value: "true" 

If you are using ConfigMaps that are referred from VPC CNI containers' env, you need have the same key/value pair setup in the referred ConfigMap.

Resolution If the environment variable is not set,

Verify Trunk ENI is created

Get the EKS managed CRD CNINode

kubectl get cninode <NODDE_NAME>

The CNINode's FEATURE column should have

[{"name":"SecurityGroupsForPods"}]

Alternatively, you can check node for further confirming. Describe the Node

kubectl describe node <NODE_NAME>

The following annotation will be added in node's Capacity and Allocatable if Trunk ENI is created successfully

vpc.amazonaws.com/pod-eni:  9 (could be other values depending on your instance type)

Your node should also receive an event like the following:

Normal  NodeTrunkInitiated       5m12s  vpc-resource-controller  The node has trunk interface initialized successfully

Resolution

If the label is missing or set to false check for,

  • Instance type supports ENI Trunking. Only Nitro instance supports this feature. See for supported instance types.

On nodes created before feature was enabled,

  • Check if there's capacity to create one more ENI.
    aws ec2 describe-network-interfaces --filters Name=attachment.instance-id,Values=instance-id
    

On nodes created after feature was enabled,

Verify Pod has the resource limit

Describe the SGP Pod

kubectl describe pod sgp-pod

You should see 1 limit and request for the resource vpc.amazonaws.com/pod-eni

Limits:
  vpc.amazonaws.com/pod-eni:  1
Requests:
  vpc.amazonaws.com/pod-eni:  1

Resolution

If limit/request is missing,

  1. Validate you have Security Group Policy that matches labels/service account with the Pod.
  2. Validate the RBAC Role and RoleBindings are not accidentally deleted.
    kubectl get rolebindings.rbac.authorization.k8s.io -n kube-system  eks-vpc-resource-controller-rolebinding
    kubectl get roles.rbac.authorization.k8s.io -n kube-system eks-vpc-resource-controller-role
    
    NAME                                      ROLE                                    AGE
    eks-vpc-resource-controller-rolebinding   Role/eks-vpc-resource-controller-role   59d
    NAME                               CREATED AT
    eks-vpc-resource-controller-role   2021-11-08T07:40:41Z
    
  3. Validate Mutating Webhook Configuration is not accidentally deleted.
    kubectl get mutatingwebhookconfigurations.admissionregistration.k8s.io vpc-resource-mutating-webhook
    NAME                            WEBHOOKS   AGE
    vpc-resource-mutating-webhook   1          59d
    

Verify Pod has the pod-eni annotation

Describe the SGP Pod,

kubectl describe pod sgp-pod

The Pod should have the following annotation.

Annotations:    vpc.amazonaws.com/pod-eni: [Branch ENI Details]

Resolution

If the Annotation is missing,

Check Issues with VPC CNI

Resolution

If the Pod is still stuck in ContainerCreating you can,

  • Fetch more detailed logs on the Host using the EKS Log collector script
  • Check the CNI Logs from the collected logs.
  • Open an Issue in this repository if the problem still persists.

Connection Timeouts

If you observe connection failures like intermittent DNS timeouts on pods using security groups, you might need to update the branch ENI cooldown period or kernel ARP cache timeout so the values are equal. Else this could result in re-use of IP address of a recently terminated pod by a new pod before the kernel's ARP cache is updated, which causes DNS failures or general packet drops.

The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. This can be set on the amazon-vpc-cni configmap. See more details here.

To update the kernel ARP cache timeout, set the following parameters for each existing interface on the node. If the branch ENI cooldown period is 30s, set:

sudo sysctl -w net.ipv4.neigh.eth0.gc_stale_time=30
sudo sysctl -w net.ipv4.neigh.eth0.base_reachable_time_ms=15000

Also set the default so all new interfaces created are configured with these values:

sudo sysctl -w net.ipv4.neigh.default.gc_stale_time=30
sudo sysctl -w net.ipv4.neigh.default.base_reachable_time_ms=15000

IP starvation issue

If the pods are not Running due to IP addresses being unavailable, but you have few pods running and expect to have IP address available, tune the branch ENI cooldown period accordingly. The branch ENI cooldown period is the period of time to wait before deleting the branch ENI for propagation of iptables rules for the deleted pod. The default value is 60s, so IP addresses are not released for atleast 60s. This can be configured via the amazon-vpc-cni configmap as described here. Note that the minimum cooldown period is 30s.

Be sure to also update the kernel ARP cache timeouts if you notice DNS issues as outlined in the above section.

Troubleshooting Prefix Delegation for Windows

Please follow the troubleshooting steps here for issues with Windows Node and Pods when using prefix delegation mode.

The following steps should be checked in chronological order to find out any issues with the workflow.

Verify Windows prefix delegation is enabled in the ConfigMap

To get the ConfigMap and the data field

kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"

You should have the ConfigMap with the following data in the string,

enable-windows-ipam:true enable-windows-prefix-delegation:true

Resolution

If the ConfigMap is missing or doesn't have the above field, you can create or update the amazon-vpc-cni ConfigMap with the required fields-

enable-windows-ipam: "true"
enable-windows-prefix-delegation: "true"

Note: Windows IPAM needs to be enabled in order to use windows prefix delegation feature.

Check both pod events and node events for any specific error

In case the controller encounters any error during it's prefix delegation workflow which needs to be acted upon by the customer, it will emit the errors as pod events and/or node events. Therefore, checking the same can be a good starting point to root cause the issue.

You can obtain the pod events using the following command.

kubectl get events --all-namespaces

In case there is any explicit error, the same needs to be looked into.

For example, if the error states that there are insufficient space in the subnet to carve a /28 prefix, then the subnet needs to be looked into to ensure that /28 ranges are available which can be allocated as prefixes.

Verify Node has the required Resource Capacity

Same as Verify Node has the Resource Capacity

Verify Pod has the required resource limits

Same as Verify Pod has the resource limits

Verify Pod has the required IPv4 Address Annotation

Same as Verify Pod has the IPv4 Address Annotation

Verify the configuration options set for windows prefix delegation

Configuration options can be used to fine-tune the behaviour of prefix delegation on Windows. The details about the options are available here.

To get the ConfigMap and the data field

kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"

If you see any of the following keys in the data-

minimum-ip-target
warm-ip-target
warm-prefix-target

Then the configuration options have been set.

Resolution

Verify if the configuration is correct as mentioned in the documentation.

Alternatively, to isolate the issue, try removing the above keys from the config map.

Look for networking issues on the Windows Host

Same as Look for Issues on the Windows Host

List of Common Issues

PSP Blocking Controller Annotations

If you have a PSP that blocks annotation to Pod, you will have to allow annotation from the following User eks:vpc-resource-controller

subjects:
  - kind: Group
    apiGroup: rbac.authorization.k8s.io
    name: system:authenticated
  - kind: User
    name: eks:vpc-resource-controller
    apiGroup: rbac.authorization.k8s.io
  - kind: ServiceAccount
    name: eks-vpc-resource-controller

Missing IAM Permissions on the Cluster Role

To get cluster role for your EKS Cluster

aws eks describe-cluster --name cluster-name --region us-west-2  | j
q .cluster.roleArn

To find the policies attached to the cluster role

aws iam list-attached-role-policies --role-name role-name-from-above

The Policy arn:aws:iam::aws:policy/AmazonEKSVPCResourceController must be present for the Windows/SGP feature to work. If it's missing, please add the policy.

ENI/IP Exhaustion

New ENI Creation or Assigning Secondary IPv4 Address can fail if you don't have sufficient IPv4 Address in your Subnet.

To find the list of IPv4 address available

aws ec2 describe-subnets --subnet-id subnet-id-here

From the response you can look for how many IPv4 address are available in the Subnet from the field AvailableIpAddressCount

Disable prefix delegation feature for Windows

You should check if the feature is enabled via ConfigMap. To get the ConfigMap and the data field

kubectl get configmaps -n kube-system amazon-vpc-cni -o custom-columns=":data"

If have the ConfigMap with the following data in the string,

enable-windows-prefix-delegation:true

then the feature is enabled.

Resolution

You can disable the feature by editing your config map and setting enable-windows-prefix-delegation as "false".