Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k0s and swap: Pods got swapped but memory-pressure taint triggers. #3830

Closed
4 tasks done
leleobhz opened this issue Dec 18, 2023 · 7 comments
Closed
4 tasks done

k0s and swap: Pods got swapped but memory-pressure taint triggers. #3830

leleobhz opened this issue Dec 18, 2023 · 7 comments
Labels
question Further information is requested Stale

Comments

@leleobhz
Copy link

leleobhz commented Dec 18, 2023

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023 aarch64 GNU/Linux
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Version

v1.28.4+k0s.0

Sysinfo

`k0s sysinfo`
Machine ID: "6536b48ed857d0ad5fccafe052e5f68d963247b6c8e32116fa91e4669709c136" (from machine) (pass)
Total memory: 959.5 MiB (warning: 1.0 GiB recommended)
Disk space available for /var/lib/k0s: 12.4 GiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.1.21-v8+ (pass)
  Max. file descriptors per process: current: 524288 / max: 524288 (pass)
  AppArmor: unavailable (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (pass)
    cgroup controller "memory": available (pass)
    cgroup controller "devices": available (assumed) (pass)
    cgroup controller "freezer": available (assumed) (pass)
    cgroup controller "pids": available (pass)
    cgroup controller "hugetlb": unavailable (warning)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: no kernel config found (warning)
  CONFIG_NAMESPACES: Namespaces support: no kernel config found (warning)
  CONFIG_NET: Networking support: no kernel config found (warning)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: no kernel config found (warning)
  CONFIG_PROC_FS: /proc file system support: no kernel config found (warning)

What happened?

Hello!

I'm running k0s in a 3 Raspberry Pi 3 Cluster (3 rpi, 1 controller+enable-worker + 2 workers) and I get a strange behavior about Swapping.

Swap enablement was discussed at #1524 (And excluding some typos, it works) and pods running on workers are allowed to run on swap, controller node - with or without enable-worker - still getting tainted:

root@pi0:~# k0s kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
NAME                                    TAINTS
pi0.fqdn   [map[effect:NoSchedule key:node.kubernetes.io/memory-pressure timeAdded:2023-12-18T07:00:03Z]]
pi1.fqdn   <none>
pi2.fqdn   <none>

And this is crashing pods also:

root@pi0:~# root@pi0:~# k0s kubectl get pods -n metallb-system controller-786f9df989-tb58j -o json
{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "annotations": {
            "prometheus.io/port": "7472",
            "prometheus.io/scrape": "true"
        },
        "creationTimestamp": "2023-12-16T01:59:15Z",
        "generateName": "controller-786f9df989-",
        "labels": {
            "app": "metallb",
            "component": "controller",
            "pod-template-hash": "786f9df989"
        },
        "name": "controller-786f9df989-tb58j",
        "namespace": "metallb-system",
        "ownerReferences": [
            {
                "apiVersion": "apps/v1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "ReplicaSet",
                "name": "controller-786f9df989",
                "uid": "c953d74d-bb37-4e85-8e9d-8509cb3adb11"
            }
        ],
        "resourceVersion": "226392",
        "uid": "0d44da90-44d0-430d-a2dc-52982a8c8e3a"
    },
    "spec": {
        "containers": [
            {
                "args": [
                    "--port=7472",
                    "--log-level=info"
                ],
                "env": [
                    {
                        "name": "METALLB_ML_SECRET_NAME",
                        "value": "memberlist"
                    },
                    {
                        "name": "METALLB_DEPLOYMENT",
                        "value": "controller"
                    }
                ],
                "image": "quay.io/metallb/controller:v0.13.12",
                "imagePullPolicy": "IfNotPresent",
                "livenessProbe": {
                    "failureThreshold": 3,
                    "httpGet": {
                        "path": "/metrics",
                        "port": "monitoring",
                        "scheme": "HTTP"
                    },
                    "initialDelaySeconds": 10,
                    "periodSeconds": 10,
                    "successThreshold": 1,
                    "timeoutSeconds": 1
                },
                "name": "controller",
                "ports": [
                    {
                        "containerPort": 7472,
                        "name": "monitoring",
                        "protocol": "TCP"
                    },
                    {
                        "containerPort": 9443,
                        "name": "webhook-server",
                        "protocol": "TCP"
                    }
                ],
                "readinessProbe": {
                    "failureThreshold": 3,
                    "httpGet": {
                        "path": "/metrics",
                        "port": "monitoring",
                        "scheme": "HTTP"
                    },
                    "initialDelaySeconds": 10,
                    "periodSeconds": 10,
                    "successThreshold": 1,
                    "timeoutSeconds": 1
                },
                "resources": {},
                "securityContext": {
                    "allowPrivilegeEscalation": false,
                    "capabilities": {
                        "drop": [
                            "all"
                        ]
                    },
                    "readOnlyRootFilesystem": true
                },
                "terminationMessagePath": "/dev/termination-log",
                "terminationMessagePolicy": "File",
                "volumeMounts": [
                    {
                        "mountPath": "/tmp/k8s-webhook-server/serving-certs",
                        "name": "cert",
                        "readOnly": true
                    },
                    {
                        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                        "name": "kube-api-access-4gwxm",
                        "readOnly": true
                    }
                ]
            }
        ],
        "dnsPolicy": "ClusterFirst",
        "enableServiceLinks": true,
        "nodeName": "pi0.fqdn",
        "nodeSelector": {
            "kubernetes.io/os": "linux"
        },
        "preemptionPolicy": "PreemptLowerPriority",
        "priority": 0,
        "restartPolicy": "Always",
        "schedulerName": "default-scheduler",
        "securityContext": {
            "fsGroup": 65534,
            "runAsNonRoot": true,
            "runAsUser": 65534
        },
        "serviceAccount": "controller",
        "serviceAccountName": "controller",
        "terminationGracePeriodSeconds": 0,
        "tolerations": [
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/not-ready",
                "operator": "Exists",
                "tolerationSeconds": 300
            },
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/unreachable",
                "operator": "Exists",
                "tolerationSeconds": 300
            }
        ],
        "volumes": [
            {
                "name": "cert",
                "secret": {
                    "defaultMode": 420,
                    "secretName": "webhook-server-cert"
                }
            },
            {
                "name": "kube-api-access-4gwxm",
                "projected": {
                    "defaultMode": 420,
                    "sources": [
                        {
                            "serviceAccountToken": {
                                "expirationSeconds": 3607,
                                "path": "token"
                            }
                        },
                        {
                            "configMap": {
                                "items": [
                                    {
                                        "key": "ca.crt",
                                        "path": "ca.crt"
                                    }
                                ],
                                "name": "kube-root-ca.crt"
                            }
                        },
                        {
                            "downwardAPI": {
                                "items": [
                                    {
                                        "fieldRef": {
                                            "apiVersion": "v1",
                                            "fieldPath": "metadata.namespace"
                                        },
                                        "path": "namespace"
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    },
    "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2023-12-16T02:29:40Z",
                "message": "The node was low on resource: memory. Threshold quantity: 100Mi, available: 76700Ki. Container controller was using 21576Ki, request is 0, has larger consumption of memory. ",
                "reason": "TerminationByKubelet",
                "status": "True",
                "type": "DisruptionTarget"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2023-12-16T01:59:19Z",
                "status": "True",
                "type": "Initialized"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2023-12-16T02:29:39Z",
                "reason": "PodFailed",
                "status": "False",
                "type": "Ready"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2023-12-16T02:29:39Z",
                "reason": "PodFailed",
                "status": "False",
                "type": "ContainersReady"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2023-12-16T01:59:17Z",
                "status": "True",
                "type": "PodScheduled"
            }
        ],
        "containerStatuses": [
            {
                "image": "quay.io/metallb/controller:v0.13.12",
                "imageID": "",
                "lastState": {
                    "terminated": {
                        "exitCode": 137,
                        "finishedAt": null,
                        "message": "The container could not be located when the pod was deleted.  The container used to be Running",
                        "reason": "ContainerStatusUnknown",
                        "startedAt": null
                    }
                },
                "name": "controller",
                "ready": false,
                "restartCount": 1,
                "started": false,
                "state": {
                    "terminated": {
                        "exitCode": 137,
                        "finishedAt": null,
                        "message": "The container could not be located when the pod was terminated",
                        "reason": "ContainerStatusUnknown",
                        "startedAt": null
                    }
                }
            }
        ],
        "hostIP": "192.168.42.240",
        "message": "The node was low on resource: memory. Threshold quantity: 100Mi, available: 76700Ki. Container controller was using 21576Ki, request is 0, has larger consumption of memory. ",
        "phase": "Failed",
        "podIP": "10.244.2.30",
        "podIPs": [
            {
                "ip": "10.244.2.30"
            }
        ],
        "qosClass": "BestEffort",
        "reason": "Evicted",
        "startTime": "2023-12-16T01:59:19Z"
    }
}

Trying to remove this taint immediately is not effective because taint get applied instantly after removal and I can't figure how to allow pods to be scheduled on master+worker node, since k0s recognize entire RAM (Phy + Swap) but container and taint engines cannot:

root@pi0:~# k0s kubectl describe nodes/pi0.fqdn
Name:               pi0.fqdn
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=pi0.fqdn
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=true
                    node.k0sproject.io/role=control-plane
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 14 Dec 2023 22:14:21 -0300
Taints:             node.kubernetes.io/memory-pressure:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  pi0.fqdn
  AcquireTime:     <unset>
  RenewTime:       Mon, 18 Dec 2023 11:37:26 -0300
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                         Message
  ----             ------  -----------------                 ------------------                ------                         -------
  MemoryPressure   True    Mon, 18 Dec 2023 11:35:43 -0300   Mon, 18 Dec 2023 04:00:03 -0300   KubeletHasInsufficientMemory   kubelet has insufficient memory available
  DiskPressure     False   Mon, 18 Dec 2023 11:35:43 -0300   Fri, 15 Dec 2023 22:53:48 -0300   KubeletHasNoDiskPressure       kubelet has no disk pressure
  PIDPressure      False   Mon, 18 Dec 2023 11:35:43 -0300   Fri, 15 Dec 2023 22:53:48 -0300   KubeletHasSufficientPID        kubelet has sufficient PID available
  Ready            True    Mon, 18 Dec 2023 11:35:43 -0300   Fri, 15 Dec 2023 22:53:48 -0300   KubeletReady                   kubelet is posting ready status
Addresses:
  InternalIP:  192.168.42.240
  Hostname:    pi0.fqdn
Capacity:
  cpu:                4
  ephemeral-storage:  30526252Ki
  memory:             982532Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  28132993797
  memory:             880132Ki
  pods:               110
System Info:
  Machine ID:                 0877b7be72cd45feb4cb3b88e2eb858f
  System UUID:                0877b7be72cd45feb4cb3b88e2eb858f
  Boot ID:                    27234e8f-5348-4e47-830a-5ae1a01ff49b
  Kernel Version:             6.1.21-v8+
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.7.8
  Kubelet Version:            v1.28.4+k0s
  Kube-Proxy Version:         v1.28.4+k0s
PodCIDR:                      10.244.2.0/24
PodCIDRs:                     10.244.2.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                        ------------  ----------  ---------------  -------------  ---
  kube-system                 konnectivity-agent-cklpb    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d13h
  kube-system                 kube-proxy-t58cv            0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d13h
  kube-system                 kube-router-hbdt2           250m (6%)     0 (0%)      16Mi (1%)        0 (0%)         3d13h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                250m (6%)  0 (0%)
  memory             16Mi (1%)  0 (0%)
  ephemeral-storage  0 (0%)     0 (0%)
Events:
  Type     Reason                Age                        From     Message
  ----     ------                ----                       ----     -------
  Warning  EvictionThresholdMet  3m41s (x13114 over 2d12h)  kubelet  Attempting to reclaim memory

Also, controller node is using the following systemd config:

root@pi0:~# systemctl cat k0scontroller.service
# /etc/systemd/system/k0scontroller.service
[Unit]
Description=k0s - Zero Friction Kubernetes
Documentation=https://docs.k0sproject.io
ConditionFileIsExecutable=/usr/local/bin/k0s

After=network-online.target
Wants=network-online.target

[Service]
StartLimitInterval=5
StartLimitBurst=10
ExecStart=/usr/local/bin/k0s controller --no-taints --enable-worker --config=/root/k0s.yaml --kubelet-extra-args=--feature-gates=NodeSwap=true\x20--fail-swap-on=false

RestartSec=120
Delegate=yes
KillMode=process
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
LimitNOFILE=999999
Restart=always

[Install]
WantedBy=multi-user.target

How can I tell to Kubernetes on controller to consider swap memory on controller node too? Its a good way to run on same host a controller unit separated from worker unit or I can keep with --enable-worker too?

@leleobhz leleobhz added the bug Something isn't working label Dec 18, 2023
@twz123
Copy link
Member

twz123 commented Dec 18, 2023

Trying to remove this taint immediately is not effective because taint get applied instantly after removal

Yes, the memory-pressure taint is managed by kubelet. It's futile to try to remove it. Moreover, when kubelet deems the node is under pressure, it will start to evict pods, no matter if the taint is on the node or not, or if the pods tolerate it.

I can't figure how to allow pods to be scheduled on master+worker node, since k0s recognize entire RAM (Phy + Swap) but container and taint engines cannot

Not sure what you mean by this. Memory/RAM are not the same as swap space, simply adding more swap doesn't mean that there's more entire RAM? In what regard would k0s recognize this?

How can I tell to Kubernetes on controller to consider swap memory on controller node too?

I've never tried to enable swapping for a Kubernetes node, but I can instantly imagine that this will be extremely tricky to configure. First of all, after a quick glance at the Kubernetes docs, I think the fundamental things are the ones that you already did: Enabling the NodeSwap feature gate and setting the failSwapOn kubelet config flag to false. But in order to be useful, I reckon there's lots of other stuff that needs to be fine-tuned:

  • Linux kernel settings that control how swap is being used
  • Kubelet's eviction settings
  • The container memory requests/limits, qos classes and priority classes
  • Kernel and userland OOM killers

The culprit here is that 1 GiB is not much memory in the fist place to run a Kubernetes control plane plus workloads. For once, I assume you don't plan to run an HA control plane, so you might want to use kine instead of etcd. This will be lighter on resources. Then you probably need to tell the kernel to swap more aggressively, and the kubelet that it should be even more conservative about its eviction thresholds. Otherwise swapping will kick in too late to save the pods from being evicted. Also, you might want to limit the workloads that are run on the controller by re-enabling the master taints (remove --no-taints), or alternatively use a node anti-affinity on the control-plane role label for certain workloads. You'll probably see memory pressure anyways from time to time on the controller. So you might want to tell Kubernetes more about the importance of your workloads (qos/priority classes).

Also have a look at kubernetes/kubernetes#120800. There's some discussions around this, including some examples in how to test this.

Its a good way to run on same host a controller unit separated from worker unit or I can keep with --enable-worker too?

You definitely want to run a single k0s process. The enable-worker flag is exactly for that purpose. Having two processes will only add an annoying lot of extra config trouble without any benefit. Besides that it will need more memory, too.

@twz123 twz123 added question Further information is requested and removed bug Something isn't working labels Dec 18, 2023
@leleobhz
Copy link
Author

Hello @twz123

Yes, the memory-pressure taint is managed by kubelet. It's futile to try to remove it. Moreover, when kubelet deems the node is under pressure, it will start to evict pods, no matter if the taint is on the node or not, or if the pods tolerate it.

Right. So it's by design of Kubernetes.

Not sure what you mean by this. Memory/RAM are not the same as swap space, simply adding more swap doesn't mean that there's more entire RAM? In what regard would k0s recognize this?

Physically aren't the same for sure, but for Kernel, it's allocatable same way. Kubernetes apparently reads only Physical memory. Since new Kubernetes allow swap usage, their readings must relly on swap too when NodeSwap is enabled - as example. But thinking twice, I agree it's not a k0s issue, but upstream issue.

Linux kernel settings that control how swap is being used

Done

Kubelet's eviction settings

This one I did'nt found before. I've changed my parameters to:

/usr/local/bin/k0s controller --kubelet-extra-args=--feature-gates=NodeSwap=true\x20--fail-swap-on=false\x20--eviction-hard=memory.available<100Mi\x20--system-reserved=memory=200Mi
/usr/local/bin/k0s worker --token-file=/etc/k0s/worker-token --kubelet-extra-args=--feature-gates=NodeSwap=true\x20--fail-swap-on=false\x20--eviction-hard=memory.available<100Mi\x20--system-reserved=memory=200Mi

And in fact swap started to be more used. I think the eviction settings is more related to low physical RAM (Since kubernetes does not look at Phy+Swap) than the swap enablement itself.

The container memory requests/limits, qos classes and priority classes

After checking eviction thresholds, services started to get up.

Kernel and userland OOM killers

Already checked

The culprit here is that 1 GiB is not much memory in the fist place to run a Kubernetes control plane plus workloads.

I agree. I've removed --enable-worker from that Pi.

I assume you don't plan to run an HA control plane, so you might want to use kine instead of etcd.

This is a two-sided coin issue: This cluster in specific is to get a running cluster on Rpi3 (Running + something reasonable - just to adjust scales) and experiment what works, what does not, so I tend to keep etcd to test HA sometime. But I also agree that use kine plus some lightweight backend is the best choice.

For now, a setup with MetalLB + Jiva behaves like this:

Master:
image

Worker 1:
image

Worker 2:
image

So I kept this record and agree with you, but for science and for now i'll keep etcd.

Then you probably need to tell the kernel to swap more aggressively

I've applied the following to entire cluster:

root@pi2:~# cat /etc/sysctl.d/99-vm-zram-parameters.conf
vm.swappiness = 180
vm.watermark_boost_factor = 0
vm.watermark_scale_factor = 125
vm.page-cluster = 0
root@pi2:~# cat /etc/systemd/zram-generator.conf | grep -v ^#
[zram0]
host-memory-limit = none
zram-size = ram * 0.8
root@pi2:~#

and the kubelet that it should be even more conservative about its eviction thresholds.

Do you consider the memory hard eviction configuration I did as enough?

Also, you might want to limit the workloads that are run on the controller by re-enabling the master taints (remove --no-taints)

Did it and in fact thing got better. Also, keep master alone is a more realistic scenario and if it's needed, Rpi3 are cheap today :)

Also have a look at kubernetes/kubernetes#120800. There's some discussions around this, including some examples in how to test this.

I'll sometime try to replicate it. I agree with @iholder101 and I'll try to replicate it.

Resuming: This is in fact a upstream question. Since there is a way to configure it on k0s, it's possible to document it in https://docs.k0sproject.io/ ? If you don't mind, I can also draft this documentation to be inserted (As a dedicated page for swap configuration, or in the rpi section as notes for rpi3 and low ram or another page you see it's better).

@twz123
Copy link
Member

twz123 commented Jan 4, 2024

Do you consider the memory hard eviction configuration I did as enough?

I'm not an expert on this. I think it's up to you to experiment with these settings until you find something that works reasonably well.

Resuming: This is in fact a upstream question. Since there is a way to configure it on k0s, it's possible to document it in https://docs.k0sproject.io/ ? If you don't mind, I can also draft this documentation to be inserted (As a dedicated page for swap configuration, or in the rpi section as notes for rpi3 and low ram or another page you see it's better).

That would be awesome, of course. It's always a great help to write down things like that.

@kannon92
Copy link

kannon92 commented Jan 9, 2024

Trying to follow this, so once you edited your evictionHard settings, you were no longer seeing any problems?

Copy link
Contributor

github-actions bot commented Feb 8, 2024

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Feb 8, 2024
@twz123
Copy link
Member

twz123 commented Feb 9, 2024

@leleobhz I'm closing this for now. If you ever find the time and inclination to write up a little tutorial with your findings, that would be splendid, of course.

@twz123 twz123 closed this as completed Feb 9, 2024
@leleobhz
Copy link
Author

leleobhz commented Feb 9, 2024

@leleobhz I'm closing this for now. If you ever find the time and inclination to write up a little tutorial with your findings, that would be splendid, of course.

I'm still ongoing testing all configurations (And I even changed from DietPI because some networking issues). I'll keep here in touch and send here if I have any progress. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

3 participants