Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

Closed
alpeb opened this issue Jun 29, 2023 · 12 comments · Fixed by #11699, linkerd/linkerd2-proxy-init#306 or #11922

Comments

@alpeb
Copy link
Member

alpeb commented Jun 29, 2023

linkerd/linkerd2-proxy-init#242 fixed the CNI boot ordering issue, where linkerd-cni installed its config before any other CNI plugin had a chance to install its own config for setting up the cluster's networking.

However, there remains another scenario to address. Right after the other CNI plugins deploy their configs, the scheduler starts triggering pods, and sometimes might not leave time for linkerd-cni to drop its config. The injected linkerd-network-validor will catch that and fail the pod. But that doesn't signal the scheduler to restart the pod, and manual restarting is required. We have reproduced this issue in GKE with its default CNI plugin, when the node pool is scaled.

@rootik
Copy link

rootik commented Jul 6, 2023

The bug is reproducible on AKS. Especially in situation when the node pool reached it's capacity and there are queue of unscheduled pods. When we scale out the node pool, most of the pods scheduled on a freshly added nodes were stuck with the linkerd-network-validator containers in endless Init:CrashLoopBackOff . When we kill those pods in some time they are starting normally.
A temporary solution we use is to kill those pods with a shell-operator script.

@rootik
Copy link

rootik commented Jul 6, 2023

A question, could linkerd CNI label the nodes it runs on as schedulable after full initialization so we can use nodeSelector as an indicator that pods are safe to be sheduled?
e.g. linkerd.io/cni-ready=true

@alpeb
Copy link
Member Author

alpeb commented Jul 10, 2023

Thanks for letting us know this is an issue in AKS as well, and for the pointer to shell-operator; I wasn't aware of that project :-)
As for having linkerd-cni label its nodes, that sounds like a nice workaround. Could possibly be implemented as a postStart hook on the linkerd-cni Daemonset. We'll keep on mind this when we prioritize this issue :-)

@rootik
Copy link

rootik commented Jul 12, 2023

Thanks for looking at this.
shell-operator is not actually the right tool for it as it doesn't watch for Warning type events.
We've ended up with a little powershell script which runs in a pod along with kubelet proxy container, watches events filtered by fieldSelector == type=Warning,reason=BackOff,regarding.kind=Pod and drops pods with linkerd-network-validator container stuck in Init:CrashLoopBackOff state.

@risingspiral risingspiral added this to the stable-2.13.x-patches milestone Jul 14, 2023
@rootik
Copy link

rootik commented Aug 4, 2023

Whoever is stuck with the issue, this will unblock them

Chart.yaml

---
apiVersion: v2
name: event-processor
version: 1.0.1
description: "A Helm chart to install event-processor"

templates/config-map.yaml

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-processor
  namespace: devops
data:
  {{- (.Files.Glob "scripts/*").AsConfig | nindent 2 }}

templates/deployments.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-processor
  namespace: devops
spec:
  replicas: 1
  selector:
    matchLabels:
      app: event-processor
  template:
    metadata:
      labels:
        app: event-processor
      annotations:
        configHash: {{ .Files.Get "scripts/event-processor.ps1" | sha256sum }}
    spec:
      serviceAccountName: event-processor
      containers:
        - name: event-processor
          image: mcr.microsoft.com/powershell:latest
          imagePullPolicy: IfNotPresent
          command:
          - pwsh
          args:
          - /scripts/event-processor.ps1
          env:
          - name: DEBUG
            value: 'true'
          resources:
            limits:
              memory: 150Mi
            requests:
              memory: 100Mi
          volumeMounts:
          - name: scripts
            mountPath: /scripts
        - name: kube-proxy
          image: bitnami/kubectl:latest
          command:
          - kubectl
          args:
          - proxy
          - --port=8080
          - --keepalive=10s
          resources:
            limits:
              memory: 32Mi
            requests:
              memory: 20Mi
      volumes:
        - name: scripts
          configMap:
            name: event-processor
            defaultMode: 0755

templates/rbac.yaml

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: event-processor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: event-processor
rules:
- apiGroups: 
  - "*"
  - events.k8s.io
  resources:
  - events
  verbs:
  - get
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-processor
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: event-processor
subjects:
  - kind: ServiceAccount
    name: event-processor
    namespace: devops

scripts/event-processor.ps1

$ErrorActionPreference = 'Continue'
function kubeApi() {
  param(
    [string]$Api,
    [string]$Kind,
    [string]$Name,
    [string]$Namespace,
    [string]$Action,
    [string]$Body
  )
  if ($Kind -eq 'pods') { $apiSegment = 'api' }
  else { $apiSegment = 'apis' }
  $uriRequest = [System.UriBuilder]"http://127.0.0.1:8080/$apiSegment/$Api/namespaces/$namespace/$kind/$name"
  if ($Action -eq 'patch') {
    $contentType = @{
      'Content-Type' = 'application/strategic-merge-patch+json'
    }
  }

  $statusCode = 200
  $res = try { Invoke-RestMethod -Method $action -Headers $contentType -Body $Body -Uri $uriRequest.Uri } catch { $statusCode = $_.Exception.Response.StatusCode.value__ }
  if ($env:DEBUG -eq 'true') {
    Write-Host $Action.ToUpper() $uriRequest.Uri '-' $statusCode -ForegroundColor Blue
  }
  Write-Output @{ res = $res; statusCode = $statusCode }
}

Start-Sleep -Seconds 10

Add-Type -AssemblyName System.Web

$queryString= [System.Web.HttpUtility]::ParseQueryString([String]::Empty)
$queryString.Add('watch', 'true')
$queryString.Add('sendInitialEvents', 'false')
$queryString.Add('fieldSelector', 'type=Warning,reason=BackOff,regarding.kind=Pod')

$uriRequest = [System.UriBuilder]'http://127.0.0.1:8080/apis/events.k8s.io/v1/events'

$uriRequest.Query = $queryString.ToString()

while ($true) {
  try {
    $request = [System.Net.WebRequest]::Create($uriRequest.Uri.OriginalString)
    $request.KeepAlive = $true

    $resp = $request.GetResponse()

    $reqstream = $resp.GetResponseStream()

    $sr = New-Object System.IO.StreamReader $reqstream

    while (!$sr.EndOfStream) {

      $line = $sr.ReadLine();

      $k8sEvent = (ConvertFrom-Json $line -Depth 10).object

      if ($k8sEvent) {
        $timestamp = $k8sEvent.metadata.creationTimestamp
        $pod = $k8sEvent.regarding.name
        $namespace = $k8sEvent.regarding.namespace
        $reason = $k8sEvent.reason
        $note = $k8sEvent.note
        $container = $k8sEvent.regarding.fieldPath -replace '.+\{(.+)\}', '$1'
        $action = 'skip'
        $result = 0

        if ($container -eq 'linkerd-network-validator') {
          # pod with linkerd CNI ordering issue
          # just kill it with fire
          $resource = kubeApi -Kind pods -Name $pod -Namespace $namespace -Action delete -Api v1
          if ($resource.statusCode -eq 200) {
            $action = 'delete'
            $resource = $resource.res.name
            $result = 200
          }
          else {
            $resource = "pod/$pod"
            $result = $resource.statusCode
          }
          $kind = 'Pod'

        }
        [ordered]@{
          timestamp    = $timestamp;
          pod          = $pod;
          namespace    = $namespace;
          reason       = $reason;
          note         = $note;
          container    = $container;
          action       = $action;
          result       = $result
        } | ConvertTo-Json -Compress
      }
    }
    Write-Host '*** Stream closed ***'
  }
  finally {
    $sr.Close()
  }
}

This can be deployed with helm upgrade event-processor -n devops .

@steve-gray
Copy link
Contributor

We're seeing this issue frequently when using spot/preemptible nodes on Oracle Cloud too. Its a bit of a painful one for us at the moment.

@mateiidavid
Copy link
Member

I had an interesting conversation with someone on Slack. There may be an alternative to creating a controller that restarts failed pods.

Cilium (and perhaps other CNI distributions) rely on Node taints to block pods from executing. Nodes get registered with a taint, when the plugin has finished executing, an operator removes the taint. Our DaemonSet would be capable of similar behaviour. Provided this doesn't result in multiple distributions writing to the same node at the same time it might be an easier and cleaner solutions than directly descheduling pods. Added some reference literature at the bottom.

https://docs.cilium.io/en/stable/installation/taints/

@michalschott
Copy link

michalschott commented Oct 11, 2023

⬆️ EKS with spot instances and Cilium CNI instead default aws-vpc-cni one.

@michalschott
Copy link

michalschott commented Oct 12, 2023

Workaround which worked for me instead is:

  • update cluster role with
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["patch"]
  • add sidecar container:
      - name: kubectl
        image: alpine/k8s:1.25.14
        command: ["/bin/sh", "-c"]
        args:
          - echo "Removing taint";
            /usr/bin/kubectl taint node $MY_NODE linkerd-cni=NotReady:NoSchedule-;
            echo "Sleeping";
            while true; do sleep 30; done;
        env:
        - name: MY_NODE
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - kill -15 1; sleep 15s
        resources:
          limits:
            memory: "32Mi"
          requests:
            cpu: "1m"
            memory: "32Mi"
        securityContext:
          readOnlyRootFilesystem: true
          privileged: false

Remember to conifgure your nodes/nodegroups so nodes are registered with linkerd-cni=NotReady:NoSchedule taint

@rootik
Copy link

rootik commented Oct 12, 2023

@michalschott would it be better to make install-cni container an init container? Which would guarantee that your container starts after linkerd CNI installed and the container won't need to sleep for 30 seconds.

@michalschott
Copy link

@rootik yes thats also an option but I was looking for as small kustomize patch as possible there.

@alpeb alpeb added the bug label Oct 12, 2023
@steve-gray
Copy link
Contributor

Thats very interesting - What cluster role did you edit @michalschott, and is that change/patch being made to a particular daemonset? Just want to make sure we can correctly mirror this, as we're seeing the issue with spot instances and its quite painful.

alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Dec 5, 2023
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Dec 5, 2023
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
alpeb added a commit that referenced this issue Dec 5, 2023
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.

## TO-DOs

- Integration test
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Dec 5, 2023
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
alpeb added a commit that referenced this issue Dec 5, 2023
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.

## TO-DOs

- Integration test
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Dec 6, 2023
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Jan 2, 2024
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does.

This controller "`linkerd-cni-repair-controller`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to delete them so they can restart with a proper network config.

The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#11699).

This exposes two custom counter metrics: `linkerd_cni_repair_controller_queue_overflow` (in the spirit of the destination controller's `endpoint_updates_queue_overflow`) and `linkerd_cni_repair_controller_deleted`
olix0r pushed a commit that referenced this issue Jan 5, 2024
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
mateiidavid added a commit that referenced this issue Jan 12, 2024
This edge release introduces a number of different fixes and improvements. More
notably, it introduces a new `cni-repair-controller` binary to the CNI plugin
image. The controller will automatically restart pods that have not received
their iptables configuration.

* Removed shortnames from Tap API resources to avoid colliding with existing
  Kubernetes resources ([#11816]; fixes [#11784])
* Introduced a new ExternalWorkload CRD to support upcoming mesh expansion
  feature ([#11805])
* Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI
  identities ([#11882])
* Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to
  automatically restart misconfigured pods that are missing iptables rules
  ([#11699]; fixes [#11073])
* Fixed a `"duplicate metrics"` warning in the multicluster service-mirror
  component ([#11875]; fixes [#11839])
* Added metric labels and weights to `linkerd diagnostics endpoints` json
  output ([#11889])
* Changed how `Server` updates are handled in the destination service. The
  change will ensure that during a cluster resync, consumers won't be
  overloaded by redundant updates ([#11907])
* Changed `linkerd install` error output to add a newline when a Kubernetes
  client cannot be successfully initialised ([#11917])

[#11816]: #11816
[#11784]: #11784
[#11805]: #11805
[#11882]: #11882
[#11699]: #11699
[#11073]: #11073
[#11875]: #11875
[#11839]: #11839
[#11889]: #11889
[#11907]: #11907
[#11917]: #11917

Signed-off-by: Matei David <matei@buoyant.io>
adleong pushed a commit that referenced this issue Jan 18, 2024
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
6 participants