karmada-controller-manager: panic: runtime error: index out of range #3854

tedli · 2023-07-28T03:28:46Z

What happened:

karmada-controller-manager crashed, caused by panic: runtime error: index out of range

What you expected to happen:

index checked, not to panic, not to crash.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The index may be -1.

karmada/pkg/modeling/modeling.go

Lines 162 to 168 in 2526de8

    
           // AddToResourceSummary add resource node into modeling summary 
        
           func (rs *ResourceSummary) AddToResourceSummary(crn ClusterResourceNode) { 
        
           	index := rs.getIndex(crn) 
        
           	modeling := &(*rs)[index] 
        
           	if rs.GetNodeNumFromModel(modeling) <= 5 { 
        
           		root := modeling.linkedlist 
        
           		if root == nil {

karmada/pkg/modeling/modeling.go

Lines 105 to 114 in 2526de8

    
           func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int { 
        
           	index := math.MaxInt 
        
           	for i, m := range defaultModelSorting { 
        
           		tmpIndex := searchLastLessElement(modelSortings[i], crn.resourceList[m]) 
        
           		if tmpIndex < index { 
        
           			index = tmpIndex 
        
           		} 
        
           	} 
        
           	return index 
        
           }

Environment:

Karmada version: v1.6.1
kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
Others:

The text was updated successfully, but these errors were encountered:

liangyuanpeng · 2023-07-28T03:31:19Z

Any message for reproduce it? and environment of kubernetes cluster and karmada cluster .

RainbowMango · 2023-07-28T06:28:20Z

The searchLastLessElement might return -1:

karmada/pkg/modeling/modeling.go

Lines 116 to 138 in cf90ffc

    
           func searchLastLessElement(nums []resource.Quantity, target resource.Quantity) int { 
        
           	low, high := 0, len(nums)-1 
        
           	for low <= high { 
        
           		mid := low + ((high - low) >> 1) 
        
           		diff1 := nums[mid].Cmp(target) 
        
           		var diff2 int 
        
           		if mid != len(nums)-1 { 
        
           			diff2 = nums[mid+1].Cmp(target) 
        
           		} 
        
           		// diff < 1 means nums[mid] <= target 
        
           		// diff == 1 means nums[mid+1] > target 
        
           		if diff1 < 1 { 
        
           			if (mid == len(nums)-1) || (diff2 == 1) { 
        
           				// find the last less element that equal to target element 
        
           				return mid 
        
           			} 
        
           			low = mid + 1 
        
           		} else { 
        
           			high = mid - 1 
        
           		} 
        
           	} 
        
           	return -1 
        
           }

@tedli Do you know how to reproduce it?

tedli · 2023-07-29T02:11:46Z

Hi @RainbowMango @liangyuanpeng ，

My case as I mententioned, was caused by line 164, 165

karmada/pkg/modeling/modeling.go

Lines 162 to 168 in 2526de8

    
           // AddToResourceSummary add resource node into modeling summary 
        
           func (rs *ResourceSummary) AddToResourceSummary(crn ClusterResourceNode) { 
        
           	index := rs.getIndex(crn) 
        
           	modeling := &(*rs)[index] 
        
           	if rs.GetNodeNumFromModel(modeling) <= 5 { 
        
           		root := modeling.linkedlist 
        
           		if root == nil {

And I didn't dive deep to find how to reproduce. I just havd a quick fix, that adding index check, to avoid panic.

RainbowMango · 2023-07-29T02:40:39Z

/help

Let's call for help to figure out the root cause.

I'm not sure if the author(@halfrost) has time to take a look :)

karmada-bot · 2023-07-29T02:40:41Z

@RainbowMango:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Let's call for help to figure out the root cause.

I'm not sure if the author(@halfrost) has time to take a look :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

halfrost · 2023-07-29T02:51:14Z

I can fix this issue. index returns negative one, which means that no matching resource model can be found. Let me write some test code to test it again.

halfrost · 2023-07-29T03:01:24Z

@tedli Hi, could you provide some parameters to initialize resource modeling? I want to reproduce this issue on my local machine.

tedli · 2023-07-31T03:42:48Z

Hi @halfrost ,

Sorry for late reply, here is my Cluster cr.

Click to view

apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
  creationTimestamp: "2023-07-19T07:33:03Z"
  finalizers:
  - karmada.io/cluster-controller
  generation: 3
  labels:
    k8s.qihoo.net/cluster-type: public
    karmada-cluster.k8s.qihoo.net/managed: "true"
  name: pub-shbt
  resourceVersion: "4113183"
  uid: 49cbf86d-8d65-4f03-bbad-bf78d53b6aa3
spec:
  apiEndpoint: https://pub-shbt.k8s.qihoo.net
  id: 447a3118-39a1-4bbd-bd41-539c3071a2e2
  impersonatorSecretRef:
    name: pub-shbt-impersonator
    namespace: karmada-cluster
  resourceModels:
  - grade: 0
    ranges:
    - max: "1"
      min: "0"
      name: cpu
    - max: 4Gi
      min: "0"
      name: memory
  - grade: 1
    ranges:
    - max: "2"
      min: "1"
      name: cpu
    - max: 16Gi
      min: 4Gi
      name: memory
  - grade: 2
    ranges:
    - max: "4"
      min: "2"
      name: cpu
    - max: 32Gi
      min: 16Gi
      name: memory
  - grade: 3
    ranges:
    - max: "8"
      min: "4"
      name: cpu
    - max: 64Gi
      min: 32Gi
      name: memory
  - grade: 4
    ranges:
    - max: "16"
      min: "8"
      name: cpu
    - max: 128Gi
      min: 64Gi
      name: memory
  - grade: 5
    ranges:
    - max: "32"
      min: "16"
      name: cpu
    - max: 256Gi
      min: 128Gi
      name: memory
  - grade: 6
    ranges:
    - max: "64"
      min: "32"
      name: cpu
    - max: 512Gi
      min: 256Gi
      name: memory
  - grade: 7
    ranges:
    - max: "128"
      min: "64"
      name: cpu
    - max: 1Ti
      min: 512Gi
      name: memory
  - grade: 8
    ranges:
    - max: "9223372036854775807"
      min: "128"
      name: cpu
    - max: "9223372036854775807"
      min: 1Ti
      name: memory
  secretRef:
    name: pub-shbt
    namespace: karmada-cluster
  syncMode: Push
status:
  apiEnablements:
  - groupVersion: acme.cert-manager.io/v1
    resources:
    - kind: Challenge
      name: challenges
    - kind: Order
      name: orders
  - groupVersion: admissionregistration.k8s.io/v1
    resources:
    - kind: MutatingWebhookConfiguration
      name: mutatingwebhookconfigurations
    - kind: ValidatingWebhookConfiguration
      name: validatingwebhookconfigurations
  - groupVersion: apiextensions.k8s.io/v1
    resources:
    - kind: CustomResourceDefinition
      name: customresourcedefinitions
  - groupVersion: apiregistration.k8s.io/v1
    resources:
    - kind: APIService
      name: apiservices
  - groupVersion: apps/v1
    resources:
    - kind: ControllerRevision
      name: controllerrevisions
    - kind: DaemonSet
      name: daemonsets
    - kind: Deployment
      name: deployments
    - kind: ReplicaSet
      name: replicasets
    - kind: StatefulSet
      name: statefulsets
  - groupVersion: authentication.k8s.io/v1
    resources:
    - kind: TokenReview
      name: tokenreviews
  - groupVersion: authorization.k8s.io/v1
    resources:
    - kind: LocalSubjectAccessReview
      name: localsubjectaccessreviews
    - kind: SelfSubjectAccessReview
      name: selfsubjectaccessreviews
    - kind: SelfSubjectRulesReview
      name: selfsubjectrulesreviews
    - kind: SubjectAccessReview
      name: subjectaccessreviews
  - groupVersion: autoscaling/v1
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2beta1
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: autoscaling/v2beta2
    resources:
    - kind: HorizontalPodAutoscaler
      name: horizontalpodautoscalers
  - groupVersion: batch/v1
    resources:
    - kind: CronJob
      name: cronjobs
    - kind: Job
      name: jobs
  - groupVersion: batch/v1beta1
    resources:
    - kind: CronJob
      name: cronjobs
  - groupVersion: catalog.cattle.io/v1
    resources:
    - kind: App
      name: apps
    - kind: ClusterRepo
      name: clusterrepos
    - kind: Operation
      name: operations
  - groupVersion: cert-manager.io/v1
    resources:
    - kind: CertificateRequest
      name: certificaterequests
    - kind: Certificate
      name: certificates
    - kind: ClusterIssuer
      name: clusterissuers
    - kind: Issuer
      name: issuers
  - groupVersion: certificates.k8s.io/v1
    resources:
    - kind: CertificateSigningRequest
      name: certificatesigningrequests
  - groupVersion: cilium.io/v2
    resources:
    - kind: CiliumClusterwideNetworkPolicy
      name: ciliumclusterwidenetworkpolicies
    - kind: CiliumEndpoint
      name: ciliumendpoints
    - kind: CiliumExternalWorkload
      name: ciliumexternalworkloads
    - kind: CiliumIdentity
      name: ciliumidentities
    - kind: CiliumNetworkPolicy
      name: ciliumnetworkpolicies
    - kind: CiliumNode
      name: ciliumnodes
  - groupVersion: cluster.cattle.io/v3
    resources:
    - kind: ClusterAuthToken
      name: clusterauthtokens
    - kind: ClusterUserAttribute
      name: clusteruserattributes
  - groupVersion: config.gatekeeper.sh/v1alpha1
    resources:
    - kind: Config
      name: configs
  - groupVersion: config.koordinator.sh/v1alpha1
    resources:
    - kind: ClusterColocationProfile
      name: clustercolocationprofiles
  - groupVersion: constraints.gatekeeper.sh/v1alpha1
    resources:
    - kind: K8sAllowedRegistries
      name: k8sallowedregistries
    - kind: K8sDisallowedCaps
      name: k8sdisallowedcaps
    - kind: K8sDisallowedHostIpc
      name: k8sdisallowedhostipc
    - kind: K8sDisallowedHostNetwork
      name: k8sdisallowedhostnetwork
    - kind: K8sDisallowedHostpathVolumes
      name: k8sdisallowedhostpathvolumes
    - kind: K8sDisallowedHostPid
      name: k8sdisallowedhostpid
    - kind: K8sDisallowedPrivilegedContainer
      name: k8sdisallowedprivilegedcontainer
    - kind: K8sDisallowedPrivilegeEscalation
      name: k8sdisallowedprivilegeescalation
  - groupVersion: constraints.gatekeeper.sh/v1beta1
    resources:
    - kind: K8sAllowedRegistries
      name: k8sallowedregistries
    - kind: K8sDisallowedCaps
      name: k8sdisallowedcaps
    - kind: K8sDisallowedHostIpc
      name: k8sdisallowedhostipc
    - kind: K8sDisallowedHostNetwork
      name: k8sdisallowedhostnetwork
    - kind: K8sDisallowedHostpathVolumes
      name: k8sdisallowedhostpathvolumes
    - kind: K8sDisallowedHostPid
      name: k8sdisallowedhostpid
    - kind: K8sDisallowedPrivilegedContainer
      name: k8sdisallowedprivilegedcontainer
    - kind: K8sDisallowedPrivilegeEscalation
      name: k8sdisallowedprivilegeescalation
  - groupVersion: coordination.k8s.io/v1
    resources:
    - kind: Lease
      name: leases
  - groupVersion: csi.aliyun.com/v1alpha1
    resources:
    - kind: NodeLocalStorageInitConfig
      name: nodelocalstorageinitconfigs
    - kind: NodeLocalStorage
      name: nodelocalstorages
  - groupVersion: discovery.k8s.io/v1
    resources:
    - kind: EndpointSlice
      name: endpointslices
  - groupVersion: discovery.k8s.io/v1beta1
    resources:
    - kind: EndpointSlice
      name: endpointslices
  - groupVersion: events.k8s.io/v1
    resources:
    - kind: Event
      name: events
  - groupVersion: events.k8s.io/v1beta1
    resources:
    - kind: Event
      name: events
  - groupVersion: expansion.gatekeeper.sh/v1alpha1
    resources:
    - kind: ExpansionTemplate
      name: expansiontemplate
  - groupVersion: externaldata.gatekeeper.sh/v1alpha1
    resources:
    - kind: Provider
      name: providers
  - groupVersion: externaldata.gatekeeper.sh/v1beta1
    resources:
    - kind: Provider
      name: providers
  - groupVersion: flowcontrol.apiserver.k8s.io/v1beta1
    resources:
    - kind: FlowSchema
      name: flowschemas
    - kind: PriorityLevelConfiguration
      name: prioritylevelconfigurations
  - groupVersion: flowcontrol.apiserver.k8s.io/v1beta2
    resources:
    - kind: FlowSchema
      name: flowschemas
    - kind: PriorityLevelConfiguration
      name: prioritylevelconfigurations
  - groupVersion: kubeadm.io.qihoo.com/v1
    resources:
    - kind: Kubecluster
      name: kubeclusters
  - groupVersion: management.cattle.io/v3
    resources:
    - kind: APIService
      name: apiservices
    - kind: AuthConfig
      name: authconfigs
    - kind: ClusterRegistrationToken
      name: clusterregistrationtokens
    - kind: Cluster
      name: clusters
    - kind: Feature
      name: features
    - kind: GroupMember
      name: groupmembers
    - kind: Group
      name: groups
    - kind: Preference
      name: preferences
    - kind: Setting
      name: settings
    - kind: Token
      name: tokens
    - kind: UserAttribute
      name: userattributes
    - kind: User
      name: users
  - groupVersion: metrics.k8s.io/v1beta1
    resources:
    - kind: NodeMetrics
      name: nodes
    - kind: PodMetrics
      name: pods
  - groupVersion: monitoring.coreos.com/v1
    resources:
    - kind: Alertmanager
      name: alertmanagers
    - kind: PodMonitor
      name: podmonitors
    - kind: Prometheus
      name: prometheuses
    - kind: PrometheusRule
      name: prometheusrules
    - kind: ServiceMonitor
      name: servicemonitors
    - kind: ThanosRuler
      name: thanosrulers
  - groupVersion: mutations.gatekeeper.sh/v1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: mutations.gatekeeper.sh/v1alpha1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: mutations.gatekeeper.sh/v1beta1
    resources:
    - kind: Assign
      name: assign
    - kind: AssignMetadata
      name: assignmetadata
    - kind: ModifySet
      name: modifyset
  - groupVersion: network.hulk.qihoo.net/v1alpha1
    resources:
    - kind: LoadBalancer
      name: loadbalancers
  - groupVersion: networking.k8s.io/v1
    resources:
    - kind: IngressClass
      name: ingressclasses
    - kind: Ingress
      name: ingresses
    - kind: NetworkPolicy
      name: networkpolicies
  - groupVersion: node.k8s.io/v1
    resources:
    - kind: RuntimeClass
      name: runtimeclasses
  - groupVersion: node.k8s.io/v1beta1
    resources:
    - kind: RuntimeClass
      name: runtimeclasses
  - groupVersion: policy/v1
    resources:
    - kind: PodDisruptionBudget
      name: poddisruptionbudgets
  - groupVersion: policy/v1beta1
    resources:
    - kind: PodDisruptionBudget
      name: poddisruptionbudgets
    - kind: PodSecurityPolicy
      name: podsecuritypolicies
  - groupVersion: rbac.authorization.k8s.io/v1
    resources:
    - kind: ClusterRoleBinding
      name: clusterrolebindings
    - kind: ClusterRole
      name: clusterroles
    - kind: RoleBinding
      name: rolebindings
    - kind: Role
      name: roles
  - groupVersion: scheduling.k8s.io/v1
    resources:
    - kind: PriorityClass
      name: priorityclasses
  - groupVersion: scheduling.koordinator.sh/v1alpha1
    resources:
    - kind: Device
      name: devices
    - kind: PodMigrationJob
      name: podmigrationjobs
    - kind: Reservation
      name: reservations
  - groupVersion: scheduling.sigs.k8s.io/v1alpha1
    resources:
    - kind: ElasticQuota
      name: elasticquotas
    - kind: PodGroup
      name: podgroups
  - groupVersion: slo.koordinator.sh/v1alpha1
    resources:
    - kind: NodeMetric
      name: nodemetrics
    - kind: NodeSLO
      name: nodeslos
  - groupVersion: snapshot.storage.k8s.io/v1
    resources:
    - kind: VolumeSnapshotClass
      name: volumesnapshotclasses
    - kind: VolumeSnapshotContent
      name: volumesnapshotcontents
    - kind: VolumeSnapshot
      name: volumesnapshots
  - groupVersion: snapshot.storage.k8s.io/v1beta1
    resources:
    - kind: VolumeSnapshotClass
      name: volumesnapshotclasses
    - kind: VolumeSnapshotContent
      name: volumesnapshotcontents
    - kind: VolumeSnapshot
      name: volumesnapshots
  - groupVersion: status.gatekeeper.sh/v1beta1
    resources:
    - kind: ConstraintPodStatus
      name: constraintpodstatuses
    - kind: ConstraintTemplatePodStatus
      name: constrainttemplatepodstatuses
    - kind: MutatorPodStatus
      name: mutatorpodstatuses
  - groupVersion: storage.k8s.io/v1
    resources:
    - kind: CSIDriver
      name: csidrivers
    - kind: CSINode
      name: csinodes
    - kind: CSIStorageCapacity
      name: csistoragecapacities
    - kind: StorageClass
      name: storageclasses
    - kind: VolumeAttachment
      name: volumeattachments
  - groupVersion: storage.k8s.io/v1beta1
    resources:
    - kind: CSIStorageCapacity
      name: csistoragecapacities
  - groupVersion: templates.gatekeeper.sh/v1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: templates.gatekeeper.sh/v1alpha1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: templates.gatekeeper.sh/v1beta1
    resources:
    - kind: ConstraintTemplate
      name: constrainttemplates
  - groupVersion: topology.node.k8s.io/v1alpha1
    resources:
    - kind: NodeResourceTopology
      name: noderesourcetopologies
  - groupVersion: ui.cattle.io/v1
    resources:
    - kind: NavLink
      name: navlinks
  - groupVersion: v1
    resources:
    - kind: Binding
      name: bindings
    - kind: ComponentStatus
      name: componentstatuses
    - kind: ConfigMap
      name: configmaps
    - kind: Endpoints
      name: endpoints
    - kind: Event
      name: events
    - kind: LimitRange
      name: limitranges
    - kind: Namespace
      name: namespaces
    - kind: Node
      name: nodes
    - kind: PersistentVolumeClaim
      name: persistentvolumeclaims
    - kind: PersistentVolume
      name: persistentvolumes
    - kind: Pod
      name: pods
    - kind: PodTemplate
      name: podtemplates
    - kind: ReplicationController
      name: replicationcontrollers
    - kind: ResourceQuota
      name: resourcequotas
    - kind: Secret
      name: secrets
    - kind: ServiceAccount
      name: serviceaccounts
    - kind: Service
      name: services
  conditions:
  - lastTransitionTime: "2023-07-19T07:38:49Z"
    message: cluster is healthy and ready to accept workloads
    reason: ClusterReady
    status: "True"
    type: Ready
  kubernetesVersion: v1.24.13
  nodeSummary:
    readyNum: 211
    totalNum: 212
  resourceSummary:
    allocatable:
      cpu: "7560"
      ephemeral-storage: "502308137393688"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      koordinator.sh/gpu: "12800"
      koordinator.sh/gpu-core: "12800"
      koordinator.sh/gpu-memory: 1891602Mi
      koordinator.sh/gpu-memory-ratio: "12800"
      kubernetes.io/batch-cpu: "470529"
      kubernetes.io/batch-memory: "3540083593912"
      memory: 44085786916Ki
      netEgress: 218375Mi
      netIngress: 218375Mi
      nvidia.com/gpu: "136"
      pods: "12720"
    allocatableModelings:
    - count: 5
      grade: 0
    - count: 46
      grade: 1
    - count: 56
      grade: 2
    - count: 17
      grade: 3
    - count: 7
      grade: 4
    - count: 16
      grade: 5
    - count: 12
      grade: 6
    - count: 0
      grade: 7
    - count: 0
      grade: 8
    allocated:
      cpu: 5567273m
      kubernetes.io/batch-cpu: 529k
      kubernetes.io/batch-memory: "3439597715456"
      memory: "28402599938109"
      nvidia.com/gpu: "113"
      pods: "11054"
    allocating:
      cpu: 10m
      kubernetes.io/batch-cpu: "126720"
      kubernetes.io/batch-memory: "824633720832"
      pods: "25"

halfrost · 2023-08-02T01:26:04Z

I have received your @tedli case. I'll fix this tonight. I'm sooooo sorry I've been busy with work these two days.

RainbowMango · 2023-08-02T03:43:46Z

No worries, take your time @halfrost. I appreciate that you can help with it.

halfrost · 2023-08-03T04:34:02Z

@RainbowMango I am definitely responsible for my code to the end, this is my responsibility. I am also very sorry for the bug in my code.

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

halfrost · 2023-08-03T04:56:00Z

@tedli Hi, Is it possible, can you tell me what is the input parameter of AddToResourceSummary(crn ClusterResourceNode) function when panic occurs?

tedli · 2023-08-03T06:07:36Z

Hi @halfrost ,

I don't know what the parameter is.

I can add some debug log, to dump this parameter for debugging, but I'm not sure when it will reproduce again.

halfrost · 2023-08-03T06:20:04Z

@tedli return -1 is a special case. Since model 8 goes to positive infinity, an index should be found. If you want to fix your panic quickly, the solution is to add a piece of code to line 165 of modeling.go:

	if index == -1 {
		klog.Info("ClusterResource can not add to resource summary: index is invalid.")
		return
	}

I will definitely need to add this code, but I am more curious about how to reproduce your panic, I want to find the root cause. Please help me to add some debug logs to get this parameter.

RainbowMango · 2023-08-08T06:52:02Z

I kind of forgot the conclusion of our discussion of requirements. For the scenario above, if there is a request with cpu = 1, memory = 1024, which model should I return? Is it model 7? Both cpu and memory requests need to be satisfied, right?

resourceModels:
  - grade: 0
    ranges:
    - max: "1"
      min: "0"
      name: cpu
    - max: 4Gi
      min: "0"
      name: memory
  - grade: 1
    ranges:
    - max: "2"
      min: "1"
      name: cpu
    - max: 16Gi
      min: 4Gi
      name: memory
  - grade: 2
    ranges:
    - max: "4"
      min: "2"
      name: cpu
    - max: 32Gi
      min: 16Gi
      name: memory

If a node has CPU=1 and even with a larger memory than 32Gi, it still belongs grade 2.

halfrost · 2023-08-09T00:28:25Z

@tedli Hi, any updates?

tedli · 2023-08-09T05:44:12Z

Hi @halfrost ,

I did add the debug log, but untill now nothing produced. I'm not sure how to reproduce this.

halfrost · 2023-08-18T02:02:07Z

@tedli Hi, any updates? Can you reproduce this panic again?

RainbowMango · 2023-08-18T02:32:14Z

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

tedli · 2023-08-18T02:37:07Z

Hi @halfrost ,

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

Hi @RainbowMango ,
I'm using 1.6.1 .

RainbowMango · 2023-08-18T02:37:09Z

It is the #3472, we fixed it in release-1.6.

halfrost · 2023-08-18T03:11:19Z

Perhaps we can't rely on @tedli to reproduce this issue, it's not feasible to ask a user to spend time trying to reproduce it.

I remember we fixed a panic issue before(I'll try to find the PR), @tedli which version of Karmada is being used?

@RainbowMango So sorry about that. It's my fault. I'm trying to reproduce this problem again.

RainbowMango · 2023-08-18T03:30:46Z

I always appreciate your help, definitely not blaming you. Just merged the defensive solution(#3950). So, no rush.

RainbowMango · 2023-08-18T03:33:16Z

Thanks for keep watching this issue. Saddly, it seemd not reproduced again. I will add a prometheus metric, when this reproduced again, the metric will alert me, then I will let you know.

I remember you fixed it with a defensive code, so, this panic is unlikely to reproduce on your side, isn't it?

tedli · 2023-08-18T03:38:36Z

Hi all,

I remember you fixed it with a defensive code

Yes, what's more, I add dump args logic, to log args to reproduce.
#3854 (comment)

I will post these dumped args, if it appears again.

tedli · 2023-08-21T06:04:53Z

UPDATE: This is another place, which goes index out of range, and it's easier than the previous one to debug, because it's only relys on Cluster cr, I unmarshal the cluster, to reproduce it locally, didn't reproduce.

Another place caused index out of range.

karmada/pkg/modeling/modeling.go

Lines 88 to 93 in 90ccbcd

    
           modelSortings = make([][]resource.Quantity, len(rsName)) 
        
           for index := 0; index < len(rsList); index++ { 
        
           	for i, name := range rsName { 
        
           		modelSortings[i] = append(modelSortings[i], rsList[index][name]) 
        
           	} 
        
           }

line 91, ~~index~~.

The capture shows line 106, is because I added this:

for dumpping args for reproducing the previous panic. (but it didn't reproduced again)

halfrost · 2023-08-22T00:22:42Z

@tedli Got it, let me dig deeper.

NickYadance · 2023-09-13T11:02:38Z

One of my talented colleagues solved this by adding mutex before reading or writing modelSortings (and of course unlock after the reading or writing). The issue never happened again since the fix.

+	mu.Lock()
+	defer mu.Unlock()
	// generate a sorted array by first priority of ResourceName
	modelSortings = make([][]resource.Quantity, len(rsName))
	for index := 0; index < len(rsList); index++ {
		for i, name := range rsName {
			modelSortings[i] = append(modelSortings[i], rsList[index][name])
		}
	}

func (rs *ResourceSummary) getIndex(crn ClusterResourceNode) int {
	index := math.MaxInt
+	mu.Lock()
+	defer mu.Unlock()

In our panic cases, this function can never return -1 in normal conditions because the the first element of nums is 0, but it did. That's why we tried mutex.

karmada/pkg/modeling/modeling.go

Line 116 in 0b3e0d9

    
           func searchLastLessElement(nums []resource.Quantity, target resource.Quantity) int {

halfrost · 2023-09-13T19:55:34Z

@NickYadance

Your modification certainly makes sense. First, regarding the initialization of modelSortings, I believe this function only runs once during initialization, and scenarios where this function is called concurrently don't exist. Hence, I didn't add a lock().

The getIndex function looks like needs a lock. I've been focused on ensuring there are no bugs in my algorithm logic that I overlooked the fact that this piece of code could be called concurrently in a multi-threaded environment. I tested in a local environment with only 2 machines and didn't encounter any concurrency issue.

Please extend my gratitude to your colleague. I'm truly grateful for his fix.

NickYadance · 2023-09-14T02:28:21Z

Hope this helps, though i failed to find the root cause that can cause the concurrent situation.
It just works ...

jwcesign · 2023-09-14T12:03:35Z

Hi, @halfrost @NickYadance

I reviewed the code and found that the function AddToResourceSummary will be called when there is a change in the status of the clusters.

The default concurrency for handling cluster status is 5:

karmada/cmd/controller-manager/app/options/options.go

Line 206 in 5c77f45

    
           flags.IntVar(&o.ConcurrentClusterSyncs, "concurrent-cluster-syncs", 5, "The number of Clusters that are allowed to sync concurrently.")

Based on #3854 (comment), the root reason is: Concurrent read and write global variable modelSortings.

And I also checked the code here:
https://github.com/karmada-io/karmada/blob/5c77f4516eb9c27c0ce38ed25a389ab9e56d8122/pkg/controllers/status/cluster_status_controller.go#L595C1-L620C1

The modelSortings looks like can be changed to a local variable, then this problem will not exist, what do you think? @halfrost

halfrost · 2023-10-03T22:15:15Z

@jwcesign Hi, I'm back and so sorry for the late reply. I got Covid last two weeks. I felt so bad. I have recovered. I have 2 questions about your comment:

The number 5, how did you come up with this number? By lots of concurrent testing?
For the modelSortings, modelSortings cannot be a local variable because this state needs to be kept up to date in real time. In the binary search below, we need to search for the latest status. I prefer the above approach of adding a lock(). After I locked it locally, there was indeed no panic.

Regarding the concurrency investigation of function AddToResourceSummary, what you actually mean is that the getAllocatableModelings() function will be called concurrently in the controller, right? AddToResourceSummary will be called in the function getAllocatableModelings(). I just tested it locally and I can indeed see a lot of calls to the AddToResourceSummary function. I think this function caused the panic. I plan to add lock() every time I update the resource. Do you agree with me about this solution?

I haven't submitted the PR yet because I'm afraid that the change to add lock() will be too general and affect performance. In all update operations, resources are only added and not deleted.

jwcesign · 2023-10-07T02:06:05Z

Hi, @halfrost

By reviewing the code, getAllocatableModelings will be called

karmada/pkg/controllers/status/cluster_status_controller.go

Line 108 in 5c77f45

func (c *ClusterStatusController) Reconcile(ctx context.Context, req controllerruntime.Request) (controllerruntime.Result, error) {

, and the concurrency of calling this function is 5 by default.
function AddToResourceSummary will initialize defaultModelSorting and modelSortings every time it's called, and function AddToResourceSummary will use them, and these two functions are only used in getAllocatableModelings. So I think defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings. And if they are local variables in the function getAllocatableModelings, the value is real-time.

Do I understand correctly for 2? @halfrost

chaosi-zju · 2023-10-07T02:54:39Z

Hi, @halfrost

I generally agree to what @jwcesign said:

defaultModelSorting and modelSortings can be local variables in the function getAllocatableModelings

Actually, I think defaultModelSorting and modelSortings can be the member variables of ResourceSummary, in this way, it can avoid concurrency issues caused by global variables and avoid the use of locks.

halfrost · 2023-10-11T19:45:51Z

@jwcesign Thanks for the advice, let me test it this weekend. I will find out the final solution.

@chaosi-zju OK, make sense. Let me revise this part of the code. Thanks.

halfrost · 2023-10-18T07:24:25Z

@jwcesign @chaosi-zju

I have already converted modelSortings into a member variable of ResourceSummary. Modifying defaultModelSorting is more challenging because it's used as a global variable within the clusterResourceNodeComparator comparator. Additionally, this comparator is used in other comparison functions.

🙏Please help me review the pr #4145

tedli added the kind/bug Categorizes issue or PR as related to a bug. label Jul 28, 2023

RainbowMango added this to the v1.7 milestone Jul 29, 2023

RainbowMango added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 29, 2023

karmada-bot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 29, 2023

liangyuanpeng mentioned this issue Aug 17, 2023

ResourceSummary add check index for -1 #3950

Merged

RainbowMango modified the milestones: v1.7, v1.8 Aug 25, 2023

halfrost mentioned this issue Oct 18, 2023

Fix: karmada-controller-manager: panic: runtime error: index out of range #4145

Merged

karmada-bot closed this as completed in #4145 Oct 24, 2023

jwcesign mentioned this issue Oct 25, 2023

fix: change defaultModelSorting as local variable #4176

Merged

karmada-controller-manager: panic: runtime error: index out of range #3854

karmada-controller-manager: panic: runtime error: index out of range #3854

Comments

tedli commented Jul 28, 2023 • edited by RainbowMango Loading

liangyuanpeng commented Jul 28, 2023 • edited Loading

RainbowMango commented Jul 28, 2023

tedli commented Jul 29, 2023

RainbowMango commented Jul 29, 2023

karmada-bot commented Jul 29, 2023

halfrost commented Jul 29, 2023 • edited Loading

halfrost commented Jul 29, 2023

tedli commented Jul 31, 2023

halfrost commented Aug 2, 2023

RainbowMango commented Aug 2, 2023

halfrost commented Aug 3, 2023

halfrost commented Aug 3, 2023

tedli commented Aug 3, 2023

halfrost commented Aug 3, 2023 • edited Loading

RainbowMango commented Aug 8, 2023

halfrost commented Aug 9, 2023

tedli commented Aug 9, 2023

halfrost commented Aug 18, 2023

RainbowMango commented Aug 18, 2023

tedli commented Aug 18, 2023

RainbowMango commented Aug 18, 2023

halfrost commented Aug 18, 2023

RainbowMango commented Aug 18, 2023

RainbowMango commented Aug 18, 2023

tedli commented Aug 18, 2023

tedli commented Aug 21, 2023 • edited Loading

halfrost commented Aug 22, 2023

NickYadance commented Sep 13, 2023

halfrost commented Sep 13, 2023

NickYadance commented Sep 14, 2023

jwcesign commented Sep 14, 2023

halfrost commented Oct 3, 2023 • edited Loading

jwcesign commented Oct 7, 2023 • edited Loading

chaosi-zju commented Oct 7, 2023 • edited Loading

halfrost commented Oct 11, 2023

halfrost commented Oct 18, 2023

tedli commented Jul 28, 2023 •

edited by RainbowMango

Loading

liangyuanpeng commented Jul 28, 2023 •

edited

Loading

halfrost commented Jul 29, 2023 •

edited

Loading

halfrost commented Aug 3, 2023 •

edited

Loading

tedli commented Aug 21, 2023 •

edited

Loading

halfrost commented Oct 3, 2023 •

edited

Loading

jwcesign commented Oct 7, 2023 •

edited

Loading

chaosi-zju commented Oct 7, 2023 •

edited

Loading