-
Notifications
You must be signed in to change notification settings - Fork 967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: runtime error: invalid memory address or nil pointer dereference #7184
panic: runtime error: invalid memory address or nil pointer dereference #7184
Comments
Hi thanks for reporting this issue, just to clarify, not all Karpenter deployments with node pools containing 'WhenEmptyOrUnderutilized' were crashlooping? Just ones with a high number of nodes in the cluster? Are there any logs from just before the panic? |
@rschalo attaching the logs, as per the events and logs this issue happens when it tries to disrupt nodes and bring new ones |
This restart stopped after we added a base limit (node:1 along with existing configuration) on disruption budget (node). However noticed that there are errors started from this function karpenter github. we assume if we have 8 nodes and the budget is 20% and nodes (candidates for disruption) is not ready for some reason causes this. Any help is highly appreciated Also we tried patching karpenter to 1.0.4 |
I observe the same issue in our EKS cluster. We've a lot of cronjobs that demand new nodes to be scheduled quite often. Average nodes in the nodepool is about 15-20. Configured a nodepool with both a 20% budget and a static budget of "5". Running controller 1.0.4 with EKS @ 1.30 EDIT 18.10.24
Additionally not each container restart solves the memory issue. Sometimes the container starts and immediately runs into a memory issue again. This takes several restarts until the container finally get's back to work.
|
Sometimes, before the reconciler error occurs, there's also a TLS handshake error, which then continues to the crash loop.
|
Hey, saw the fix in kubernetes-sigs/karpenter#1763 I'm looking forward to try this out in our clusters. Any idea when this will be available? |
Hi @christianfeurer, yes you should be able to try the fix with this snapshot. It isn't recommended for production but should be sufficient to test if it fully addresses your issue.
|
Hey @rschalo, |
Sounds good - we don't have a date for 1.1 just yet. Thanks for your patience! Given the fix, marking this as closed. if this issue persists then please reopen or open a new issue. |
Description
Observed Behavior:
``{"level":"INFO","time":"2024-10-02T13:07:51.585Z","logger":"controller","message":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","commit":"5bdf9c3","controller":"disruption","namespace":"","name":"","reconcileID":"d9e09bca-0703-4b38-a2c7-1cedadcf58a4"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x208 pc=0x230a017]
goroutine 504 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile.func1()
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:111 +0x1e5
panic({0x277f360?, 0x4c9f9d0?})
runtime/panic.go:770 +0x132
sigs.k8s.io/karpenter/pkg/controllers/disruption.filterOutSameType(0xc016e72f08, {0xc0063d2fd8, 0x2, 0xc016b7f180?})
sigs.k8s.io/karpenter@v1.0.0/pkg/controllers/disruption/multinodeconsolidation.go:213 +0x5b7
sigs.k8s.io/karpenter/pkg/controllers/disruption.(MultiNodeConsolidation).firstNConsolidationOption(0xc000452150, {0x3476a98, 0xc006e8fdd0}, {0xc0063d2fd8, 0x3, 0x3}, 0x3)
sigs.k8s.io/karpenter@v1.0.0/pkg/controllers/disruption/multinodeconsolidation.go:147 +0x56f
sigs.k8s.io/karpenter/pkg/controllers/disruption.(MultiNodeConsolidation).ComputeCommand(0xc000452150, {0x3476a98, 0xc006e8fdd0}, 0xc0162b61b0, {0xc0041f0af0, 0x3, 0x9})
sigs.k8s.io/karpenter@v1.0.0/pkg/controllers/disruption/multinodeconsolidation.go:83 +0x430
sigs.k8s.io/karpenter/pkg/controllers/disruption.(Controller).disrupt(0xc000690100, {0x3476a98, 0xc006e8fdd0}, {0x3479920, 0xc000452150})
sigs.k8s.io/karpenter@v1.0.0/pkg/controllers/disruption/controller.go:167 +0x5e7
sigs.k8s.io/karpenter/pkg/controllers/disruption.(Controller).Reconcile(0xc000690100, {0x3476a98, 0xc006e8fda0})
sigs.k8s.io/karpenter@v1.0.0/pkg/controllers/disruption/controller.go:132 +0x405
sigs.k8s.io/karpenter/pkg/controllers/disruption.(Controller).Register.AsReconciler.func1({0x3476a98?, 0xc006e8fda0?}, {{{0x0?, 0x0?}, {0x2ca1bc8?, 0x5?}}})
github.com/awslabs/operatorpkg@v0.0.0-20240805231134-67d0acfb6306/singleton/controller.go:26 +0x2f
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile(0xc006ea0280?, {0x3476a98?, 0xc006e8fda0?}, {{{0x0?, 0x5?}, {0x0?, 0xc003fe8d10?}}})
sigs.k8s.io/controller-runtime@v0.18.4/pkg/reconcile/reconcile.go:113 +0x3d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile(0x347c588?, {0x3476a98?, 0xc006e8fda0?}, {{{0x0?, 0xb?}, {0x0?, 0x0?}}})
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler(0xc000b10840, {0x3476ad0, 0xc0007565a0}, {0x29367a0, 0xc003f47360})
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311 +0x3bc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem(0xc000b10840, {0x3476ad0, 0xc0007565a0})
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261 +0x1be
sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Start.func2.2()
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:222 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 479
sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:218 +0x486``
Expected Behavior:
Karpenter pod not restarting
Reproduction Steps (Please include YAML):
Versions:
kubectl version
): 1.30We recently migrated to karpenter from cluster autoscaler for few of our accounts and started to see the above behaviour where the karpenter pod has been constantly restarting (getting into CrashLoopBackOff status but recovering by itself). We tried restarting the pod and rebooting the karpenter controller node if that helps but it didn't. The issue was however fixed when we updated the consolidationPolicy to WhenEmpty from WhenEmptyOrUnderUtilized . However this would increase the underlying cost. Also this is not for all the clusters which has WhenEmptyOrUnderUtilized but for few (where we have more number of nodes)
Nodepool configuration:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
labels:
app.kubernetes.io/managed-by: Helm
name: XXXX
spec:
disruption:
budgets:
- nodes: "1"
consolidateAfter: 5m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: 150
memory: 150Gi
template:
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: main-XXXX-XXXXX
requirements:
- key: karpenter.k8s.aws/instance-category
minValues: 2
operator: In
values:
- r
- m
- c
- key: karpenter.k8s.aws/instance-family
minValues: 5
operator: Exists
- key: node.kubernetes.io/instance-type
minValues: 10
operator: Exists
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "4"
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-central-1a
- eu-central-1b
- eu-central-1c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/os
operator: In
values:
- linux
terminationGracePeriod: 15m
weight: 10
The text was updated successfully, but these errors were encountered: