docs: add troubleshooting note on VM_MEMORY_OVERHEAD_PERCENT

aws · Aug 20, 2024 · 3523315 · 3523315
1 parent e284774
commit 3523315
Showing 1 changed file with 44 additions and 0 deletions.
diff --git a/website/content/en/docs/troubleshooting.md b/website/content/en/docs/troubleshooting.md
@@ -325,6 +325,50 @@ then the following solution(s) may resolve your issue.
 ...
 ```
 
+### Karpenter incorrectly computes available resources for a node
+
+When creating nodes, the allocatable resources Karpenter computed (as seen in logs and `nodeClaim.status.allocatable`) does not always match the allocatable resources on the created node (`node.status.allocatable`).
+Karpenter uses the results from `ec2:DescribeInstanceTypes` to determine the resources available on a node launched with a given instance type.
+The following computation is used to determine allocatable CPU, memory, and ephemeral storage based on the results returned from `ec2:DescribeInstanceTypes`.
+
+```
+nodeClaim.allocatable.cpu = instance.cpu - kubeReserved.cpu - systemReserved.cpu
+nodeClaim.allocatable.memory = (instance.memory * (1.0 - VM_MEMORY_OVERHEAD_PERCENT)) - kubeReserved.memory - systemReserved.memory - max(evictionSoft.memory.available, evictionHard.memory.available)
+nodeClaim.allocatable.ephemeralStorage = instance.storage - kubeReserved.ephemeralStorage - systemReserved.ephemeralStorage - max(evictionSoft.nodefs.available, evictionHard.nodefs.available)
+```
+
+Most of these factors directly model user configuration (i.e. the KubeletConfiguration options).
+On the other hand, `VM_MEMORY_OVERHEAD_PERCENT` models an implicit reduction of available memory that varies by instance type and AMI.
+Karpenter can't compute the exact value being modeled, so `VM_MEMORY_OVERHEAD_PERCENT` is a [global setting]({{< ref="../reference/settings.md" >}}) used across all instance type and AMI combinations.
+The default value (`7.5%`) has been tuned to closely match reality for the majority of instance types while not overestimating.
+As a result, Karpenter will typically underestimate the memory availble on a node for a given instance type.
+If you know the real `VM_MEMORY_OVERHEAD_PERCENT` for the specific instances you're provisioning in your cluster, you can tune this value to tighten the bound.
+However, this should be done with caution.
+A `VM_MEMORY_OVERHEAD_PERCENT` which results in Karpenter overestimating the memory available on a node can result in Karpenter launching nodes which are too small for your workload.
+In the worst case, this can result in an instance launch loop and your workload remaining unschedulable indefinitely.
+
+To detect instances of Karpenter overestimating resource availability, the following status condition can be monitored:
+
+```bash
+$ kg nodeclaim default-chcj2 -o jsonpath='{.status.conditions[?(@.type=="ConsistentStateFound")]}'
+```
+
+```json
+{
+    "type": "ConsistentStateFound",
+    "status": "False",
+    "reason": "ConsistencyCheckFailed",
+    "message": "Consistency Check Failed",
+    "lastTransitionTime": "2024-08-19T20:02:16Z"
+}
+```
+
+This can be spot checked like shown above, or monitored via the following metric:
+
+```
+operator_status_condition_count{type="ConsistentStateFound",kind="NodeClaim",status="False"}
+```
+
 ## Deprovisioning
 
 ### Nodes not deprovisioned