-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
karmada-controller-manager: panic: runtime error: index out of range #3854
Comments
Any message for reproduce it? and environment of kubernetes cluster and karmada cluster . |
The searchLastLessElement might return -1: karmada/pkg/modeling/modeling.go Lines 116 to 138 in cf90ffc
@tedli Do you know how to reproduce it? |
Hi @RainbowMango @liangyuanpeng , My case as I mententioned, was caused by line karmada/pkg/modeling/modeling.go Lines 162 to 168 in 2526de8
And I didn't dive deep to find how to reproduce. I just havd a quick fix, that adding index check, to avoid panic. |
/help Let's call for help to figure out the root cause. I'm not sure if the author(@halfrost) has time to take a look :) |
@RainbowMango: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I can fix this issue. index returns negative one, which means that no matching resource model can be found. Let me write some test code to test it again. |
@tedli Hi, could you provide some parameters to initialize resource modeling? I want to reproduce this issue on my local machine. |
Hi @halfrost , Sorry for late reply, here is my Click to view
apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
creationTimestamp: "2023-07-19T07:33:03Z"
finalizers:
- karmada.io/cluster-controller
generation: 3
labels:
k8s.qihoo.net/cluster-type: public
karmada-cluster.k8s.qihoo.net/managed: "true"
name: pub-shbt
resourceVersion: "4113183"
uid: 49cbf86d-8d65-4f03-bbad-bf78d53b6aa3
spec:
apiEndpoint: https://pub-shbt.k8s.qihoo.net
id: 447a3118-39a1-4bbd-bd41-539c3071a2e2
impersonatorSecretRef:
name: pub-shbt-impersonator
namespace: karmada-cluster
resourceModels:
- grade: 0
ranges:
- max: "1"
min: "0"
name: cpu
- max: 4Gi
min: "0"
name: memory
- grade: 1
ranges:
- max: "2"
min: "1"
name: cpu
- max: 16Gi
min: 4Gi
name: memory
- grade: 2
ranges:
- max: "4"
min: "2"
name: cpu
- max: 32Gi
min: 16Gi
name: memory
- grade: 3
ranges:
- max: "8"
min: "4"
name: cpu
- max: 64Gi
min: 32Gi
name: memory
- grade: 4
ranges:
- max: "16"
min: "8"
name: cpu
- max: 128Gi
min: 64Gi
name: memory
- grade: 5
ranges:
- max: "32"
min: "16"
name: cpu
- max: 256Gi
min: 128Gi
name: memory
- grade: 6
ranges:
- max: "64"
min: "32"
name: cpu
- max: 512Gi
min: 256Gi
name: memory
- grade: 7
ranges:
- max: "128"
min: "64"
name: cpu
- max: 1Ti
min: 512Gi
name: memory
- grade: 8
ranges:
- max: "9223372036854775807"
min: "128"
name: cpu
- max: "9223372036854775807"
min: 1Ti
name: memory
secretRef:
name: pub-shbt
namespace: karmada-cluster
syncMode: Push
status:
apiEnablements:
- groupVersion: acme.cert-manager.io/v1
resources:
- kind: Challenge
name: challenges
- kind: Order
name: orders
- groupVersion: admissionregistration.k8s.io/v1
resources:
- kind: MutatingWebhookConfiguration
name: mutatingwebhookconfigurations
- kind: ValidatingWebhookConfiguration
name: validatingwebhookconfigurations
- groupVersion: apiextensions.k8s.io/v1
resources:
- kind: CustomResourceDefinition
name: customresourcedefinitions
- groupVersion: apiregistration.k8s.io/v1
resources:
- kind: APIService
name: apiservices
- groupVersion: apps/v1
resources:
- kind: ControllerRevision
name: controllerrevisions
- kind: DaemonSet
name: daemonsets
- kind: Deployment
name: deployments
- kind: ReplicaSet
name: replicasets
- kind: StatefulSet
name: statefulsets
- groupVersion: authentication.k8s.io/v1
resources:
- kind: TokenReview
name: tokenreviews
- groupVersion: authorization.k8s.io/v1
resources:
- kind: LocalSubjectAccessReview
name: localsubjectaccessreviews
- kind: SelfSubjectAccessReview
name: selfsubjectaccessreviews
- kind: SelfSubjectRulesReview
name: selfsubjectrulesreviews
- kind: SubjectAccessReview
name: subjectaccessreviews
- groupVersion: autoscaling/v1
resources:
- kind: HorizontalPodAutoscaler
name: horizontalpodautoscalers
- groupVersion: autoscaling/v2
resources:
- kind: HorizontalPodAutoscaler
name: horizontalpodautoscalers
- groupVersion: autoscaling/v2beta1
resources:
- kind: HorizontalPodAutoscaler
name: horizontalpodautoscalers
- groupVersion: autoscaling/v2beta2
resources:
- kind: HorizontalPodAutoscaler
name: horizontalpodautoscalers
- groupVersion: batch/v1
resources:
- kind: CronJob
name: cronjobs
- kind: Job
name: jobs
- groupVersion: batch/v1beta1
resources:
- kind: CronJob
name: cronjobs
- groupVersion: catalog.cattle.io/v1
resources:
- kind: App
name: apps
- kind: ClusterRepo
name: clusterrepos
- kind: Operation
name: operations
- groupVersion: cert-manager.io/v1
resources:
- kind: CertificateRequest
name: certificaterequests
- kind: Certificate
name: certificates
- kind: ClusterIssuer
name: clusterissuers
- kind: Issuer
name: issuers
- groupVersion: certificates.k8s.io/v1
resources:
- kind: CertificateSigningRequest
name: certificatesigningrequests
- groupVersion: cilium.io/v2
resources:
- kind: CiliumClusterwideNetworkPolicy
name: ciliumclusterwidenetworkpolicies
- kind: CiliumEndpoint
name: ciliumendpoints
- kind: CiliumExternalWorkload
name: ciliumexternalworkloads
- kind: CiliumIdentity
name: ciliumidentities
- kind: CiliumNetworkPolicy
name: ciliumnetworkpolicies
- kind: CiliumNode
name: ciliumnodes
- groupVersion: cluster.cattle.io/v3
resources:
- kind: ClusterAuthToken
name: clusterauthtokens
- kind: ClusterUserAttribute
name: clusteruserattributes
- groupVersion: config.gatekeeper.sh/v1alpha1
resources:
- kind: Config
name: configs
- groupVersion: config.koordinator.sh/v1alpha1
resources:
- kind: ClusterColocationProfile
name: clustercolocationprofiles
- groupVersion: constraints.gatekeeper.sh/v1alpha1
resources:
- kind: K8sAllowedRegistries
name: k8sallowedregistries
- kind: K8sDisallowedCaps
name: k8sdisallowedcaps
- kind: K8sDisallowedHostIpc
name: k8sdisallowedhostipc
- kind: K8sDisallowedHostNetwork
name: k8sdisallowedhostnetwork
- kind: K8sDisallowedHostpathVolumes
name: k8sdisallowedhostpathvolumes
- kind: K8sDisallowedHostPid
name: k8sdisallowedhostpid
- kind: K8sDisallowedPrivilegedContainer
name: k8sdisallowedprivilegedcontainer
- kind: K8sDisallowedPrivilegeEscalation
name: k8sdisallowedprivilegeescalation
- groupVersion: constraints.gatekeeper.sh/v1beta1
resources:
- kind: K8sAllowedRegistries
name: k8sallowedregistries
- kind: K8sDisallowedCaps
name: k8sdisallowedcaps
- kind: K8sDisallowedHostIpc
name: k8sdisallowedhostipc
- kind: K8sDisallowedHostNetwork
name: k8sdisallowedhostnetwork
- kind: K8sDisallowedHostpathVolumes
name: k8sdisallowedhostpathvolumes
- kind: K8sDisallowedHostPid
name: k8sdisallowedhostpid
- kind: K8sDisallowedPrivilegedContainer
name: k8sdisallowedprivilegedcontainer
- kind: K8sDisallowedPrivilegeEscalation
name: k8sdisallowedprivilegeescalation
- groupVersion: coordination.k8s.io/v1
resources:
- kind: Lease
name: leases
- groupVersion: csi.aliyun.com/v1alpha1
resources:
- kind: NodeLocalStorageInitConfig
name: nodelocalstorageinitconfigs
- kind: NodeLocalStorage
name: nodelocalstorages
- groupVersion: discovery.k8s.io/v1
resources:
- kind: EndpointSlice
name: endpointslices
- groupVersion: discovery.k8s.io/v1beta1
resources:
- kind: EndpointSlice
name: endpointslices
- groupVersion: events.k8s.io/v1
resources:
- kind: Event
name: events
- groupVersion: events.k8s.io/v1beta1
resources:
- kind: Event
name: events
- groupVersion: expansion.gatekeeper.sh/v1alpha1
resources:
- kind: ExpansionTemplate
name: expansiontemplate
- groupVersion: externaldata.gatekeeper.sh/v1alpha1
resources:
- kind: Provider
name: providers
- groupVersion: externaldata.gatekeeper.sh/v1beta1
resources:
- kind: Provider
name: providers
- groupVersion: flowcontrol.apiserver.k8s.io/v1beta1
resources:
- kind: FlowSchema
name: flowschemas
- kind: PriorityLevelConfiguration
name: prioritylevelconfigurations
- groupVersion: flowcontrol.apiserver.k8s.io/v1beta2
resources:
- kind: FlowSchema
name: flowschemas
- kind: PriorityLevelConfiguration
name: prioritylevelconfigurations
- groupVersion: kubeadm.io.qihoo.com/v1
resources:
- kind: Kubecluster
name: kubeclusters
- groupVersion: management.cattle.io/v3
resources:
- kind: APIService
name: apiservices
- kind: AuthConfig
name: authconfigs
- kind: ClusterRegistrationToken
name: clusterregistrationtokens
- kind: Cluster
name: clusters
- kind: Feature
name: features
- kind: GroupMember
name: groupmembers
- kind: Group
name: groups
- kind: Preference
name: preferences
- kind: Setting
name: settings
- kind: Token
name: tokens
- kind: UserAttribute
name: userattributes
- kind: User
name: users
- groupVersion: metrics.k8s.io/v1beta1
resources:
- kind: NodeMetrics
name: nodes
- kind: PodMetrics
name: pods
- groupVersion: monitoring.coreos.com/v1
resources:
- kind: Alertmanager
name: alertmanagers
- kind: PodMonitor
name: podmonitors
- kind: Prometheus
name: prometheuses
- kind: PrometheusRule
name: prometheusrules
- kind: ServiceMonitor
name: servicemonitors
- kind: ThanosRuler
name: thanosrulers
- groupVersion: mutations.gatekeeper.sh/v1
resources:
- kind: Assign
name: assign
- kind: AssignMetadata
name: assignmetadata
- kind: ModifySet
name: modifyset
- groupVersion: mutations.gatekeeper.sh/v1alpha1
resources:
- kind: Assign
name: assign
- kind: AssignMetadata
name: assignmetadata
- kind: ModifySet
name: modifyset
- groupVersion: mutations.gatekeeper.sh/v1beta1
resources:
- kind: Assign
name: assign
- kind: AssignMetadata
name: assignmetadata
- kind: ModifySet
name: modifyset
- groupVersion: network.hulk.qihoo.net/v1alpha1
resources:
- kind: LoadBalancer
name: loadbalancers
- groupVersion: networking.k8s.io/v1
resources:
- kind: IngressClass
name: ingressclasses
- kind: Ingress
name: ingresses
- kind: NetworkPolicy
name: networkpolicies
- groupVersion: node.k8s.io/v1
resources:
- kind: RuntimeClass
name: runtimeclasses
- groupVersion: node.k8s.io/v1beta1
resources:
- kind: RuntimeClass
name: runtimeclasses
- groupVersion: policy/v1
resources:
- kind: PodDisruptionBudget
name: poddisruptionbudgets
- groupVersion: policy/v1beta1
resources:
- kind: PodDisruptionBudget
name: poddisruptionbudgets
- kind: PodSecurityPolicy
name: podsecuritypolicies
- groupVersion: rbac.authorization.k8s.io/v1
resources:
- kind: ClusterRoleBinding
name: clusterrolebindings
- kind: ClusterRole
name: clusterroles
- kind: RoleBinding
name: rolebindings
- kind: Role
name: roles
- groupVersion: scheduling.k8s.io/v1
resources:
- kind: PriorityClass
name: priorityclasses
- groupVersion: scheduling.koordinator.sh/v1alpha1
resources:
- kind: Device
name: devices
- kind: PodMigrationJob
name: podmigrationjobs
- kind: Reservation
name: reservations
- groupVersion: scheduling.sigs.k8s.io/v1alpha1
resources:
- kind: ElasticQuota
name: elasticquotas
- kind: PodGroup
name: podgroups
- groupVersion: slo.koordinator.sh/v1alpha1
resources:
- kind: NodeMetric
name: nodemetrics
- kind: NodeSLO
name: nodeslos
- groupVersion: snapshot.storage.k8s.io/v1
resources:
- kind: VolumeSnapshotClass
name: volumesnapshotclasses
- kind: VolumeSnapshotContent
name: volumesnapshotcontents
- kind: VolumeSnapshot
name: volumesnapshots
- groupVersion: snapshot.storage.k8s.io/v1beta1
resources:
- kind: VolumeSnapshotClass
name: volumesnapshotclasses
- kind: VolumeSnapshotContent
name: volumesnapshotcontents
- kind: VolumeSnapshot
name: volumesnapshots
- groupVersion: status.gatekeeper.sh/v1beta1
resources:
- kind: ConstraintPodStatus
name: constraintpodstatuses
- kind: ConstraintTemplatePodStatus
name: constrainttemplatepodstatuses
- kind: MutatorPodStatus
name: mutatorpodstatuses
- groupVersion: storage.k8s.io/v1
resources:
- kind: CSIDriver
name: csidrivers
- kind: CSINode
name: csinodes
- kind: CSIStorageCapacity
name: csistoragecapacities
- kind: StorageClass
name: storageclasses
- kind: VolumeAttachment
name: volumeattachments
- groupVersion: storage.k8s.io/v1beta1
resources:
- kind: CSIStorageCapacity
name: csistoragecapacities
- groupVersion: templates.gatekeeper.sh/v1
resources:
- kind: ConstraintTemplate
name: constrainttemplates
- groupVersion: templates.gatekeeper.sh/v1alpha1
resources:
- kind: ConstraintTemplate
name: constrainttemplates
- groupVersion: templates.gatekeeper.sh/v1beta1
resources:
- kind: ConstraintTemplate
name: constrainttemplates
- groupVersion: topology.node.k8s.io/v1alpha1
resources:
- kind: NodeResourceTopology
name: noderesourcetopologies
- groupVersion: ui.cattle.io/v1
resources:
- kind: NavLink
name: navlinks
- groupVersion: v1
resources:
- kind: Binding
name: bindings
- kind: ComponentStatus
name: componentstatuses
- kind: ConfigMap
name: configmaps
- kind: Endpoints
name: endpoints
- kind: Event
name: events
- kind: LimitRange
name: limitranges
- kind: Namespace
name: namespaces
- kind: Node
name: nodes
- kind: PersistentVolumeClaim
name: persistentvolumeclaims
- kind: PersistentVolume
name: persistentvolumes
- kind: Pod
name: pods
- kind: PodTemplate
name: podtemplates
- kind: ReplicationController
name: replicationcontrollers
- kind: ResourceQuota
name: resourcequotas
- kind: Secret
name: secrets
- kind: ServiceAccount
name: serviceaccounts
- kind: Service
name: services
conditions:
- lastTransitionTime: "2023-07-19T07:38:49Z"
message: cluster is healthy and ready to accept workloads
reason: ClusterReady
status: "True"
type: Ready
kubernetesVersion: v1.24.13
nodeSummary:
readyNum: 211
totalNum: 212
resourceSummary:
allocatable:
cpu: "7560"
ephemeral-storage: "502308137393688"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
koordinator.sh/gpu: "12800"
koordinator.sh/gpu-core: "12800"
koordinator.sh/gpu-memory: 1891602Mi
koordinator.sh/gpu-memory-ratio: "12800"
kubernetes.io/batch-cpu: "470529"
kubernetes.io/batch-memory: "3540083593912"
memory: 44085786916Ki
netEgress: 218375Mi
netIngress: 218375Mi
nvidia.com/gpu: "136"
pods: "12720"
allocatableModelings:
- count: 5
grade: 0
- count: 46
grade: 1
- count: 56
grade: 2
- count: 17
grade: 3
- count: 7
grade: 4
- count: 16
grade: 5
- count: 12
grade: 6
- count: 0
grade: 7
- count: 0
grade: 8
allocated:
cpu: 5567273m
kubernetes.io/batch-cpu: 529k
kubernetes.io/batch-memory: "3439597715456"
memory: "28402599938109"
nvidia.com/gpu: "113"
pods: "11054"
allocating:
cpu: 10m
kubernetes.io/batch-cpu: "126720"
kubernetes.io/batch-memory: "824633720832"
pods: "25"
|
I have received your @tedli case. I'll fix this tonight. I'm sooooo sorry I've been busy with work these two days. |
No worries, take your time @halfrost. I appreciate that you can help with it. |
@RainbowMango I am definitely responsible for my code to the end, this is my responsibility. I am also very sorry for the bug in my code. I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right? |
@tedli Hi, Is it possible, can you tell me what is the input parameter of AddToResourceSummary(crn ClusterResourceNode) function when panic occurs? |
Hi @halfrost , I don't know what the parameter is. I can add some debug log, to dump this parameter for debugging, but I'm not sure when it will reproduce again. |
@tedli return -1 is a special case. Since model 8 goes to positive infinity, an index should be found. If you want to fix your panic quickly, the solution is to add a piece of code to line 165 of modeling.go: if index == -1 {
klog.Info("ClusterResource can not add to resource summary: index is invalid.")
return
} I will definitely need to add this code, but I am more curious about how to reproduce your panic, I want to find the root cause. Please help me to add some debug logs to get this parameter. |
resourceModels:
- grade: 0
ranges:
- max: "1"
min: "0"
name: cpu
- max: 4Gi
min: "0"
name: memory
- grade: 1
ranges:
- max: "2"
min: "1"
name: cpu
- max: 16Gi
min: 4Gi
name: memory
- grade: 2
ranges:
- max: "4"
min: "2"
name: cpu
- max: 32Gi
min: 16Gi
name: memory If a node has CPU=1 and even with a larger memory than 32Gi, it still belongs |
@tedli Hi, any updates? |
Hi @halfrost , |
@tedli Hi, any updates? Can you reproduce this panic again? |
Hi @halfrost , Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know. Hi @RainbowMango , |
It is the #3472, we fixed it in release-1.6. |
@RainbowMango So sorry about that. It's my fault. I'm trying to reproduce this problem again. |
I always appreciate your help, definitely not blaming you. Just merged the defensive solution(#3950). So, no rush. |
I remember you fixed it with a defensive code, so, this panic is unlikely to reproduce on your side, isn't it? |
Hi all,
Yes, what's more, I add dump args logic, to log args to reproduce. I will post these dumped args, if it appears again. |
UPDATE: This is another place, which goes index out of range, and it's easier than the previous one to debug, because it's only relys on Another place caused index out of range. karmada/pkg/modeling/modeling.go Lines 88 to 93 in 90ccbcd
line The capture shows line for dumpping args for reproducing the previous panic. (but it didn't reproduced again) |
@tedli Got it, let me dig deeper. |
One of my talented colleagues solved this by adding mutex before reading or writing + mu.Lock()
+ defer mu.Unlock()
// generate a sorted array by first priority of ResourceName
modelSortings = make([][]resource.Quantity, len(rsName))
for index := 0; index < len(rsList); index++ {
for i, name := range rsName {
modelSortings[i] = append(modelSortings[i], rsList[index][name])
}
} func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int {
index := math.MaxInt
+ mu.Lock()
+ defer mu.Unlock() In our panic cases, this function can never return -1 in normal conditions because the the first element of karmada/pkg/modeling/modeling.go Line 116 in 0b3e0d9
|
Your modification certainly makes sense. First, regarding the initialization of modelSortings, I believe this function only runs once during initialization, and scenarios where this function is called concurrently don't exist. Hence, I didn't add a lock(). The getIndex function looks like needs a lock. I've been focused on ensuring there are no bugs in my algorithm logic that I overlooked the fact that this piece of code could be called concurrently in a multi-threaded environment. I tested in a local environment with only 2 machines and didn't encounter any concurrency issue. Please extend my gratitude to your colleague. I'm truly grateful for his fix. |
Hope this helps, though i failed to find the root cause that can cause the concurrent situation. |
I reviewed the code and found that the function The default concurrency for handling cluster status is 5:
Based on #3854 (comment), the root reason is: Concurrent read and write global variable And I also checked the code here: The |
@jwcesign Hi, I'm back and so sorry for the late reply. I got Covid last two weeks. I felt so bad. I have recovered. I have 2 questions about your comment:
Regarding the concurrency investigation of function I haven't submitted the PR yet because I'm afraid that the change to add lock() will be too general and affect performance. In all update operations, resources are only added and not deleted. |
Hi, @halfrost
Do I understand correctly for |
Hi, @halfrost I generally agree to what @jwcesign said:
Actually, I think |
@jwcesign Thanks for the advice, let me test it this weekend. I will find out the final solution. @chaosi-zju OK, make sense. Let me revise this part of the code. Thanks. |
I have already converted 🙏Please help me review the pr #4145 |
What happened:
karmada-controller-manager
crashed, caused bypanic: runtime error: index out of range
What you expected to happen:
index checked, not to panic, not to crash.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The
index
may be-1
.karmada/pkg/modeling/modeling.go
Lines 162 to 168 in 2526de8
karmada/pkg/modeling/modeling.go
Lines 105 to 114 in 2526de8
Environment:
kubectl-karmada version
orkarmadactl version
):The text was updated successfully, but these errors were encountered: