Skip to content

Latest commit

 

History

History
1246 lines (930 loc) · 54.6 KB

File metadata and controls

1246 lines (930 loc) · 54.6 KB

KEP-2400: Node system swap support

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubernetes currently does not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, swap support was considered out of scope.

However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.

Motivation

There are two distinct types of user for swap, who may overlap:

  • node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
  • application developers, who have written applications that would benefit from using swap memory

There are hence a number of possible ways that one could envision swap use on a node.

Scenarios

  1. Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
  2. Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
  3. Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.

This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.

Goals

  • On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
  • Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
  • Cluster administrators can enable and configure kubelet swap utilization on a per-node basis.
  • Use of swap memory for cgroupsv2.

Non-Goals

  • Addressing non-Linux operating systems. Swap support will only be available for Linux.
  • Provisioning swap. Swap must already be available on the system.
  • Setting swappiness. This can already be set on a system-wide level outside of Kubernetes.
  • Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, swap will be an overcommitted resource in the context of this KEP.
  • Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
  • Use of swap for cgroupsv1.

Proposal

We propose that, when swap is provisioned and available on a node, cluster administrators can configure the kubelet such that:

  • It can start with swap on.
  • It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
  • It will have configuration options to configure swap utilization for the entire node.

This proposal enables scenarios 1 and 2 above, but not 3.

Enable Swap Support only for Burstable QoS Pods

Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.

Swap access is granted only for pods of Burstable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictable, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.

By doing so, we can ensure a thorough understanding of the feature's performance and stability before considering the manual input of swap values in a subsequent beta release. This cautious approach will ensure the efficient allocation of resources and the smooth integration of swap support into Kubernetes.

Allocate the swap limit equal to the requested memory for each container and adjust the proportion of swap based on the total swap memory available.

Set Aside Swap for System Critical Daemons

Note In Beta2, we found that having system critical daemons swapping memory could cause degration of services.

System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set memory.min on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no memory.swap.min is supported in the kernel.

Since this proposal advocates enabling swap only for the Burstable QoS pods, this can be done by setting memory.swap.max on the cgroups used by the Burstable QoS pods. The value of this memory.swap.max can be calculated by:

memory.swap.max = total swap memory available on the system - system reserve (memory)

This is the total amount of swap available for all the Burstable QoS pods; let's call it TotalPodsSwapAvailable. This will ensure that the system critical daemons will have access to the swap at least equal to the system reserved memory. This will indirectly act as having support for swap in system reserved.

Best Practices

This section is a recommendation for how to set up your nodes with swap if using this feature.

Disable swap for system critical daemons

As we were testing this feature, we found degration of services if you allow system critical daemons to swap. This could mean that kubelet is performing slower than normal so if you experience this, we recommend setting the cgroup for the system slice to avoid swap (ie memory.swap.max 0). While doing this, we found that it is still possible for workloads to impact critical daemons.

Protect system critical daemons for iolatency

As we disabled swap for system slice, we saw cases where the system.slice would still be impacted by workloads swapping. The workloads need to have less priority for IO than the system slice. We found that setting io.latency for system.slice fixes these issues.

See io-control for more details.

Control Plane Swap

We only recommend enabling swap for the worker nodes. The control plane contains mostly Guaranteed QoS Pods, so swap may be disabled for the most part. The main concern would be swapping in the critical services on the control plane which can cause a performance impact.

Use of a dedicated disk for swap

We recommend using a separate disk for your swap partition. We recommend the separate disk be encrypted. If swap is on a partition or the root filesystem, workloads can interfere with system processes needing to write to disk. If they occupy the same disk, it's possible processes can overwhelm swap and throw off the I/O of kubelet/container runtime/systemd, which would affect other workloads. See [#protect-system-critical-daemons-for-iolatency] for more details on that. Swap space is located on a disk so it is imperative to make sure your disk is fast enough for your use cases.

Swap as the default

We will turn the feature on for Beta 2 but the default setting will be NoSwap.

Enabling Swap on nodes is a pretty advanced feature which requires tuning and knowledge of the kernel. We do not recommend swap on all nodes so we still suggest --fail-swap-on=true for most cases of Kubernetes.

If there is interest in trying out this feature, we suggest provisioning swap space on the worker node along with setting ``--fail-swap-on=false` and restarting kubelet.

Steps to Calculate Swap Limit

  1. Calculate the container's memory proportionate to the node's memory:
  • Divide the container's memory request by the total node's physical memory. Let's call this value ContainerMemoryProportion.
  • If a container is defined with memory requests == memory limits, its ContainerMemoryProportion is defined as 0. Therefore, as can be seen below, its overall swap limit is also 0.
  1. Multiply the container memory proportion by the available swap memory for Pods:
  • Meaning: ContainerMemoryProportion * TotalPodsSwapAvailable.

Example

Suppose we have a Burstable QoS pod with two containers:

  • Container A: Memory request 20 GB
  • Container B: Memory request 10 GB

Let's assume the total physical memory is 40 GB and the total swap memory available is also 40 GB. Also assume that the system reserved memory is configured at 2GB,

Step 1: Determine the containers memory proportion:

  • Container A: 20G/40G = 0.5.
  • Container B: 10G/40G = 0.25.

Step 2: Determine swap limitation for the containers:

  • Container A: ContainerMemoryProportion * TotalPodsSwapAvailable = 0.5 * 38G = 19G.
  • Container B: ContainerMemoryProportion * TotalPodsSwapAvailable = 0.25 * 38G = 9.5G.

In this example, Container A would have a swap limit of 19 GB, and Container B would have a swap limit of 9.5 GB.

This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.

User Stories

Improved Node Stability

cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.

This user story is addressed by scenario 1 and 2, and could benefit from 3.

Long-running applications that swap out startup memory

This user story is addressed by scenario 2, and could benefit from 3.

Memory Flexibility

This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).

This user story is addressed by scenario 2, and could benefit from 3.

Local development and systems with fast storage

Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).

This user story is addressed by scenarios 1 and 2, and could benefit from 3.

Low footprint systems

For example, edge devices with limited memory.

This user story is addressed by scenario 2, and could benefit from 3.

Virtualization management overhead

This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.

Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.

With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.

This user story is addressed by scenario 2, and could benefit from 3.

Notes/Constraints/Caveats (Optional)

In updating the CRI, we must ensure that container runtime downstreams are able to support the new configurations.

We considered adding parameters for both per-workload memory-swap and swappiness. These are documented as part of the Open Containers runtime specification for Linux memory configuration. Since memory-swap is a per-workload parameter, and swappiness is optional and can be set globally, we are choosing to only expose memory-swap which will adjust swap available to workloads.

Since we are not currently setting memory-swap in the CRI, the current default behaviour when --fail-swap-on=false is set is to allocate the same amount of swap for a workload as memory requested. We will update the default to not permit the use of swap by setting memory-swap equal to limit.

Risks and Mitigations

Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.

This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.

Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.

Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.

Existing use cases of Swap

As beta2 was being worked on, we discovered use cases where --fail-swap-on=false is used but Kubernetes is not utilizing swap. Kind e2e tests run kubelet with --fail-swap-on=false and the default developer configuration for hack/local-up-cluster allows for running developer clusters with swap enabled.

We need to support the --fail-swap-on=false for both cgroup v1 and cgroupv2. We will not support KEP-2400 with cgroup v1. So when one wants to GA this feature, we need to have a way to disable workloads from using swap while keeping the feature toggle on. To address this, we will propose a new field to MemorySwap called NoSwap. This will disable swap usage on the node while keeping the feature active.

This can address existing use cases where --fail-swap-on=false in cgroupv1 and still allow us to turn this feature on.

Exhausting swap resource

In previous releases of Swap, we had an UnlimitedSwap option for workloads. This can cause problems where workloads can use up all swap. If all swap is used up on a node, it can make the node go unhealthy. To avoid exhausting swap on a node, UnlimitedSwap was dropped from the API in beta2.

Security risk

Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.

To guarantee that system daemons are not swapped, the kubelet must configure the memory.swap.max setting to 0 within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved.

Additionally, end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.

Cgroupv1 support

In the early release of this feature, there was a goal to support cgroup v1. As the feature progressed, sig-node realized that supporting swap with cgroup v1 would be very difficult. Therefore, this feature is limited to cgroupv2 only. The main goal is to deprecate cgroupv1 eventually so this should not be a major inconvience.

Design Details

We summarize the implementation plan as following:

  1. Add a feature gate NodeSwap to enable swap support.
  2. Leave the default value of kubelet flag --fail-on-swap to true, to avoid changing default behaviour.
  3. Introduce a new kubelet config parameter, MemorySwap, which configures how much swap Kubernetes workloads can use on the node.
  4. Introduce a new CRI parameter, memory_swap_limit_in_bytes.
  5. Ensure container runtimes are updated so they can make use of the new CRI configuration.
  6. Based on the behaviour set in the kubelet config, the kubelet will instruct the CRI on the amount of swap to allocate to each container. The container runtime will then write the swap settings to the container level cgroup.
  7. Add node stats to report swap usage.

Enabling swap as an end user

Swap can be enabled as follows:

  1. Provision swap on the target worker nodes,
  2. Enable the NodeSwap feature flag on the kubelet,
  3. Set --fail-on-swap flag to false, and
  4. (Optional) Allow Kubernetes workloads to use swap by setting MemorySwap.SwapBehavior to LimitedSwap in the kubelet config.

API Changes

KubeConfig addition

We will add an optional MemorySwap value to the KubeletConfig struct in pkg/kubelet/apis/config/types.go as follows:

// KubeletConfiguration contains the configuration for the Kubelet
type KubeletConfiguration struct {
	metav1.TypeMeta
...
	// Configure swap memory available to container workloads.
	// +featureGate=NodeSwap
	// +optional
	MemorySwap MemorySwapConfiguration
}

type MemorySwapConfiguration struct {
	// Configure swap memory available to container workloads. May be one of
  // "", "NoSwap": workload will not use swap
	// "LimitedSwap": workload combined memory and swap usage cannot exceed pod memory limit
	SwapBehavior string
}

We want to expose common swap configurations based on the Docker and open container specification for the --memory-swap flag. Thus, the MemorySwapConfiguration.SwapBehavior setting will have the following effects:

  • If SwapBehavior is set to "LimitedSwap", containers do not have access to swap beyond their memory limit. This value prevents a container from using swap in excess of their memory limit, even if it is enabled on a system.
    • With cgroups v2, swap is configured independently from memory. Thus, the container runtimes can set memory.swap.max to 0 in this case, and no swap usage will be permitted.
  • If SwapBehavior is set to "" or "NoSwap", no workloads will utilize swap.

CRI Changes

The CRI requires a corresponding change in order to allow the kubelet to set swap usage in container runtimes. We will introduce a parameter memory_swap_limit_in_bytes to the CRI API (found in k8s.io/cri-api/pkg/apis/runtime/v1/api.proto):

// LinuxContainerResources specifies Linux specific configuration for
// resources.
message LinuxContainerResources {
...
    // Memory + swap limit in bytes. Default: 0 (not specified).
    int64 memory_swap_limit_in_bytes = 9;
...
}

Swap Metrics

We added metrics to the summary stats for the Node to report SwapAvailableBytes and SwapUsageBytes.

type NodeStats struct {
  ...
 // Stats pertaining to swap resources. This is reported to non-windows systems only.
 // +optional
Swap *SwapStats `json:"swap,omitempty"`
}
// SwapStats contains data about memory usage
type SwapStats struct {
 // The time at which these stats were updated.
 Time metav1.Time `json:"time"`
 // Available swap memory for use.  This is defined as the <swap-limit> - <current-swap-usage>.
 // If swap limit is undefined, this value is omitted.
 // +optional
 SwapAvailableBytes *uint64 `json:"swapAvailableBytes,omitempty"`
 // Total swap memory in use.
 // +optional
 SwapUsageBytes *uint64 `json:"swapUsageBytes,omitempty"`
}

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

All existing tests needs to pass with and without swap enabled.

Unit tests

This KEP introduces minor additions of memory swap controlling configuration parameters.

  • Kubelet configuration parameters are tested in the package k8s.io/kubernetes/pkg/kubelet/apis/config/validation
  • Passing parameters to runtime is tested in k8s.io/kubernetes/pkg/kubelet/kuberuntime

Both packages has near 100% coverage and new functionality was covered.

In alpha2, tests will be extended in these packages to support kube-reserved swap settings.

Integration tests

NA.

These tasks require e2e test setup so we did not add any integration tests for this.

e2e tests

For alpha:

  • Swap scenarios are enabled in test-infra for at least two Linux distributions. e2e suites will be run against them.
    • Container runtimes must be bumped in CI to use the new CRI.
  • Data should be gathered from a number of use cases to guide beta graduation and further development efforts.
    • Focus should be on supported user stories as listed above.

Test grid tabs enabled:

No new e2e tests introduced.

For alpha2:

  • Add e2e tests that exercise all available swap configurations via the CRI.
  • Verify MemoryPressure behavior with swap enabled and document any changes for configuring eviction.
  • Verify new system-reserved settings for swap memory.

For beta 1:

  • Add e2e tests that verify pod-level control of swap utilization.
  • Add e2e tests that verify swap performance with pods using a tmpfs.

Graduation Criteria

Alpha

  • Kubelet can be started with swap enabled and will support two configurations for Kubernetes workloads: LimitedSwap and NoSwap.
  • Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap.
  • e2e test jobs are configured for Linux systems with swap enabled.

Alpha2

In alpha2 the focus will be on making sure that the feature can be used on subset of production scenarios to collect more feedback before entering beta. Specifically, security and test coverage will be increased. As well as the new setting that will split swap between kubelet and workload will be introduced.

Once functionality part is resolved while in alpha, beta will be more about performance and feedback on wider range of scenarios.

This will allow to collect feedback from the following scenarios reasonably safe:

  • on cgroupv2: allow host system processes to use swap to increase system reliability under memory pressure.
  • enable swap for the workload in "single large pod per node" scenarios.

Here are specific improvements to be made:

  • Address swap impact on memory-backed volumes: kubernetes/kubernetes#105978.
  • Investigate swap security when enabling on system processes on the node.
  • Improve coverage for appropriate scenarios in testgrid.
  • Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
  • Consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
  • Investigate eviction behavior with swap enabled.

Beta 1

  • Enable Swap Support using Burstable QoS Pods only.
  • Enable Swap Support for Cgroup v2 Only.
  • Add swap memory to the Kubelet stats api.
  • Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
  • Make sure node e2e jobs that use swap are healthy
  • Improve coverage for appropriate scenarios in testgrid.

Beta 2

  • Publish a Kubernetes doc page encouraging users to use encrypted swap if they wish to enable this feature.
  • Add swap specific tests such as, handling the usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at (container cgroup boundary).
  • Fix flaking/failing swap node e2e jobs.
  • Address eviction related issue in swap implementation.
  • Add NoSwap as the default setting.
  • Remove UnlimitedSwap as a supported option.
  • Add e2e test confirming that NoSwap will actually not swap
  • Add e2e test confirming that swap is used for LimitedSwap.
  • Document best practices for setting up Kubernetes with swap

GA

(Tentative.)

  • Test a wide variety of scenarios that may be affected by swap support.
  • Remove feature flag.
  • Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2.

Upgrade / Downgrade Strategy

No changes are required on upgrade to maintain previous behaviour.

It is possible to downgrade a kubelet on a node that was using swap, but this would require disabling the use of swap and setting swapoff on the node.

Version Skew Strategy

Feature flag will apply to kubelet only, so version skew strategy is N/A.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: NodeSwap
    • Components depending on the feature gate: API Server, Kubelet
  • Other
    • Describe the mechanism: --fail-swap-on=false flag for kubelet must also be set at kubelet start
    • Will enabling / disabling the feature require downtime of the control plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be restarted. Hence, there would be brief control component downtime on a given node.
    • Will enabling / disabling the feature require downtime or reprovisioning of a node? Yes. See above; disabling would require brief node downtime.
Does enabling the feature change any default behavior?

No. If the feature flag is enabled, the user must still set --fail-swap-on=false to adjust the default behaviour.

A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

To turn this off, the kubelet would need to be restarted. If a cluster admin wants to disable swap on the node without repartitioning the node, they could stop the kubelet, set swapoff on the node, and restart the kubelet with --fail-swap-on=true. The setting of the feature flag will be ignored in this case.

In Beta2, we realize that we cannot rely on --fail-swap-on=false as a flag for this feature. The flag predates this feature and it has been used over time. We propose a configuration in MemorySwap called NoSwap. Users could also set NoSwap in MemorySwap to disable all workloads from using swap without requiring the user to disable swap if that is needed.

In Beta releases of this feature, one could use turn off NodeSwap feature toggle but once this feature is GA, users could use another option to disable swap for workloads.

What happens if we reenable the feature if it was previously rolled back?

N/A

Are there any tests for feature enablement/disablement?

N/A. This should be tested separately for scenarios with the flag enabled and disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout fail? Can it impact already running workloads?

If a new node with swap memory fails to come online, it will not impact any running components.

It is possible that if a cluster administrator adds swap memory to an already running node, and then performs an in-place upgrade, the new kubelet could fail to start unless the configuration was modified to tolerate swap. However, we would expect that if a cluster admin is adding swap to the node, they will also update the kubelet's configuration to not fail with swap present.

Generally, it is considered best practice to add a swap memory partition at node image/boot time and not provision it dynamically after a kubelet is already running and reporting Ready on a node.

What specific metrics should inform a rollback?

Workload churn or performance degradations on nodes. The metrics will be application/use-case specific, but we can provide some suggestions, based on the stability metrics identified earlier.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must be restarted with or without swap support.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can someone using this feature know that it is working for their instance?

See #swap-metrics

  1. Kubelet stats API will be extended to show swap usage details.
How can an operator determine if the feature is in use by workloads?

KubeletConfiguration has set failOnSwap: false.

The prometheus node_exporter will also export stats on swap memory utilization.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

TBD. We will determine a set of metrics as a requirement for beta graduation. We will need more production data; there is not a single metric or set of metrics that can be used to generally quantify node performance.

This section to be updated before the feature can be marked as graduated, and to be worked on during 1.23 development.

We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.

  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

We added metrics to the node stats to report how much swap is used and the capacity of swap.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

The KubeletConfig API object may slightly increase in size due to new config fields.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, enabling swap can affect performance of other critical daemons on the system. Any scenario where swap memory gets utilized is a result of system running out of physical RAM. Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice along with reserving adequate enough system reserved memory.

The SLI that could potentially be impacted is pod startup latency. If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted. In addition to this SLI, general areas around pod lifecycle (image pulls, sandbox creation, storage) could become slow.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change. Feature is specific to individual nodes.

What are other known failure modes?

Individual nodes with swap memory enabled may experience performance degradations under load. This could potentially cause a cascading failure on nodes without swap: if nodes with swap fail Ready checks, workloads may be rescheduled en masse.

Thus, cluster administrators should be careful while enabling swap. To minimize disruption, you may want to taint nodes with swap available to protect against this problem. Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.

What steps should be taken if SLOs are not being met to determine the problem?

It is suggested that if nodes with swap memory enabled cause performance or stability degradations, those nodes are cordoned, drained, and replaced with nodes that do not use swap memory.

Implementation History

  • 2015-04-24: Discussed in #7294.
  • 2017-10-06: Discussed in #53533.
  • 2021-01-05: Initial design discussion document for swap support and use cases.
  • 2021-04-05: Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
  • 2021-08-09: New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/.
  • 2023-04-17: KEP update for beta1 #3957.
  • 2023-08-15: Beta1 released in kubernetes 1.28
  • 2024-01-12: Updates to Beta2 KEP.

Drawbacks

When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.

Currently, there exists an unsupported workaround, which is setting the kubelet flag --fail-swap-on to false.

Alternatives

Just set --fail-swap-on=false

This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.

This is also a breaking change. Users have used --fail-swap-on=false to allow for kubernetes to run on a swap enabled node.

Restrict swap usage at the cgroup level

Setting a swap limit at the cgroup level would allow us to restrict the usage of swap on a pod-level, rather than container-level basis.

For alpha, we are opting for the container-level basis to simplify the implementation (as the container runtimes already support configuration of swap with the memory-swap-limit parameter). This will also provide the necessary plumbing for container-level accounting of swap, if that is proposed in the future.

In beta, we may want to revisit this.

See the Pod Resource Management design proposal for more background on the cgroup limits the kubelet currently sets based on each QoS class.

Infrastructure Needed (Optional)

We may need Linux VM images built with swap partitions for e2e testing in CI.