Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Update resources limits for controller-manager to fix OOMKilled error #982

Closed
cassanellicarlo opened this issue Sep 13, 2023 · 14 comments
Assignees
Labels
area/csm-operator type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@cassanellicarlo
Copy link

How can the Team help you today?

Details: ?

I'm using dell-csm-operator-certified.v1.2.0 operator on OpenShift 4.12. I installed it successfully, but the controller-manager is getting OOM-killed because it's consuming more memory than the limit set.

The default limit for the container is set to 256Mi. How can one increase it in the ContainerStorageModule resource?

@cassanellicarlo cassanellicarlo added the type/question Ask a question. This is the default label associated with a question issue. label Sep 13, 2023
@jooseppi-luna jooseppi-luna self-assigned this Sep 13, 2023
@jooseppi-luna
Copy link
Contributor

Hi Carlo, thanks for the question. Are you getting OOM-killed before you do anything with the operator or are you getting killed while trying to do a bunch of stuff with it? Do you have any relevant logs?

@jooseppi-luna
Copy link
Contributor

jooseppi-luna commented Sep 13, 2023

Looks like I am able to add additional memory by editing line 921 of the deploy/operator.yaml file. After editing that line to 512Mi and reinstalling, I get the following when I describe the controller-manager pod (snipped for readability):

[root@master-1-095zyzFtPRfV5 csm-operator]# k describe pod -n dell-csm-operator   dell-csm-operator-controller-manager-6bd6569b56-bqbs5
...
Containers:
  manager:
    Container ID:  containerd://17f0b8031735e468fdb066ae31e119b174a5ab567a4d7d69aa386714b4701f62
    Image:         docker.io/dellemc/dell-csm-operator:v1.2.0
    Image ID:      docker.io/dellemc/dell-csm-operator@sha256:814895bdff2f49c0f9a7789490e6316688f85e3cab2c0a6215fa0f68034c5f32
    Port:          <none>
    Host Port:     <none>
    Command:
      /manager
    Args:
      --leader-elect
    State:          Running
      Started:      Wed, 13 Sep 2023 09:40:25 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  512Mi
    Requests:
      cpu:        100m
      memory:     192Mi
...

@jooseppi-luna
Copy link
Contributor

jooseppi-luna commented Sep 13, 2023

If you could provide us with details of anything else that you might have installed on the system, as well as what all the operator has done leading up to the OOM kill, that would be super helpful! Thanks.

@cassanellicarlo
Copy link
Author

cassanellicarlo commented Sep 13, 2023

Starting logs of controller manager:

2023-09-13T12:35:32.398Z DEBUG workspace/main.go:79 Operator Version {"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:80 Go Version: go1.20.5 {"TraceId": "main"} 2023-09-13T12:35:32.398Z DEBUG workspace/main.go:81 Go OS/Arch: linux/amd64 {"TraceId": "main"} I0913 12:35:33.500640 1 request.go:665] Waited for 1.01097461s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/logging.openshift.io/v1 2023-09-13T12:35:39.751Z INFO workspace/main.go:93 Openshift environment {"TraceId": "main"} 2023-09-13T12:35:39.753Z INFO workspace/main.go:132 Current kubernetes version is 1.25 which is a supported version {"TraceId": "main"} 2023-09-13T12:35:39.754Z INFO workspace/main.go:143 Use ConfigDirectory /etc/config/dell-csm-operator {"TraceId": "main"} I0913 12:35:43.505285 1 request.go:665] Waited for 3.743544237s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/apps.gitlab.com/v1beta2?timeout=32s 1.6946085471103349e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 1.6946085471115081e+09 INFO setup starting manager 1.694608547111709e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 1.694608547111713e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}

Previous last logs from restarted container

[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>} 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:40 GetController {"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"} 2023-09-13T12:35:05.760Z ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. {"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSc... 2023-09-13T12:35:05.760Z DEBUG drivers/commonconfig.go:51 DriverSpec {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.764Z DEBUG drivers/commonconfig.go:72 Adding toleration {"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}} 2023-09-13T12:35:05.764Z INFO drivers/commonconfig.go:111 Container to be removed {"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"} 2023-09-13T12:35:05.764Z INFO controllers/csm_controller.go:530 Checking if standalone modules need clean up {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.775Z INFO controllers/csm_controller.go:723 Starting SYNC for default-source-cluster cluster {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"} 2023-09-13T12:35:05.976Z INFO serviceaccount/serviceaccount.go:45 ServiceAccount already exists {"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"} 2023-09-13T12:35:06.077Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.111Z INFO rbac/clusterrole.go:45 Updating ClusterRoleName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.242Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-node {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.275Z INFO rbac/rolebindings.go:40 Updating ClusterRoleBindingName:<omitted>-unity-controller {"TraceId": "<omitted>-unity-1"} 2023-09-13T12:35:06.407Z INFO csidriver/csidriver.go:41 CSIDriver Object exist {"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}

The only error i'm seeing is "ERROR zap@v1.21.0/sugar.go:173 Ignored key without a value. " but i don't know if that is related.

I'm installing the Operator via OLM Subscription. I'm not using the operator.yaml

Metrics of the controller manager

image

@cassanellicarlo
Copy link
Author

I manually changed the limits in the operator yaml from the OpenShift console

              - resources:
                  limits:
                    cpu: 200m
                    memory: 500Mi
                  requests:
                    cpu: 100m
                    memory: 200Mi

and now the controller manager seems to work fine without restarting.
But that's not a good way to set it.

@jooseppi-luna
Copy link
Contributor

Ok that's good, I'm glad it's at least not getting killed right now. I agree that that's not a good long-term solution, we will work on a better fix and keep this issue updated.

@bharathsreekanth
Copy link
Contributor

@jooseppi-luna can you confirm if this is same as #184?

@jooseppi-luna
Copy link
Contributor

@bharathsreekanth it's related but not the same, #184 is for adding resource limits to helm charts. These resource limits already exist in operator and are what we are adjusting here to make the deployment work. See here for where we set them in operator.

@jooseppi-luna
Copy link
Contributor

jooseppi-luna commented Sep 13, 2023

@cassanellicarlo I spoke with @rensyct and it would help us to have these three things from you to figure this out:

  1. Details on everything you installed/attempted to install with operator before it got killed.
  2. Attach the sample files you used to install any drivers/modules you are installing (e.g., I can see you are installing csi-unity, can you attach your edited sample file for us to review and test on our end).
  3. Attach complete operator logs from your controller-manager, you can get them like this: kubectl logs dell-csm-operator-controller-manager-xxxxxxxxxx-xxxxx -n dell-csm-operator > operator-logs.txt (fill in your pod name and namespace).

@cassanellicarlo
Copy link
Author

cassanellicarlo commented Sep 14, 2023

Operator: dell-csm-operator-certified.v1.2.0

ContainerStorageModule

`apiVersion: storage.dell.com/v1
kind: ContainerStorageModule
metadata:
  name: 
  namespace: {{ .Values.namespace }}
spec:
  driver:
    csiDriverType: "unity"
    csiDriverSpec:
      # fsGroupPolicy: Defines if the underlying volume supports changing ownership and permission of the volume before being mounted.
      # Allowed values: ReadWriteOnceWithFSType, File , None
      # Default value: ReadWriteOnceWithFSType
      fSGroupPolicy: "ReadWriteOnceWithFSType"
    # Config version for CSI Unity v2.7.0 driver
    configVersion: {{ .Values.driver.release }}
    # Controller count
    replicas: 2
    dnsPolicy: ClusterFirstWithHostNet
    forceUpdate: false
    forceRemoveDriver: true
    common:
      # Image for CSI Unity driver v2.7.0
      image: "dellemc/csi-unity:{{ .Values.driver.release }}"
      imagePullPolicy: IfNotPresent
      envs:
          # X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS - Flag to enable sharing of volumes across multiple pods within the same node in RWO access mode.
          # Allowed values: boolean
          # Default value: "false"
          # Examples : "true" , "false"
        - name: X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS
          value: "false"
        - name: X_CSI_EPHEMERAL_STAGING_PATH
          value: "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"
        # X_CSI_ISCSI_CHROOT is the path to which the driver will chroot before
        # running any iscsi commands. This value should only be set when instructed
        # by technical support
        - name: X_CSI_ISCSI_CHROOT
          value: "/noderoot"
        # X_CSI_UNITY_SYNC_NODEINFO_INTERVAL - Time interval to add node info to array. Default 15 minutes. Minimum value should be 1.
        # Allowed values: integer
        # Default value: 15
        # Examples : 0 , 2
        - name: X_CSI_UNITY_SYNC_NODEINFO_INTERVAL
          value: "15"
        # Specify kubelet config dir path.
        # Ensure that the config.yaml file is present at this path.
        # Default value: None
        - name: KUBELET_CONFIG_DIR
          value: /var/lib/kubelet
        # CSI_LOG_LEVEL is used to set the logging level of the driver.
        # Allowed values: "error", "warn"/"warning", "info", "debug"
        # Default value: "info"
        - name: CSI_LOG_LEVEL
          value: {{ .Values.logLevel }}
        # TENANT_NAME - Tenant name that need to added while adding host entry to the array.
        # Allowed values: string
        # Default value: ""
        # Examples : "tenant2" , "tenant3"
        - name: TENANT_NAME
          value: ""
        # CERT_SECRET_COUNT: Represents number of certificate secrets, which user is going to create for
        # ssl authentication. (unity-cert-0..unity-cert-n)
        # This field is only verified if X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION is set to false
        # Allowed values: n, where n > 0
        # Default value: None          
        - name: CERT_SECRET_COUNT
          value: "1"
        # X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION: Specifies if the driver is going to validate unisphere certs while connecting to the Unisphere REST API interface.
        # If it is set to false, then a secret unity-certs has to be created with an X.509 certificate of CA which signed the Unisphere certificate
        # Allowed values:
        #   true: skip Unisphere API server's certificate verification
        #   false: verify Unisphere API server's certificates 
        # Default value: true	
        - name: X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION
          value: "true"

    sideCars:
      # health monitor is disabled by default, refer to driver documentation before enabling it
      - name: external-health-monitor
        enabled: false
        args: ["--monitor-interval=60s"]
    controller:
      envs:
        # X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from Controller plugin - volume condition.
        # Install the 'external-health-monitor' sidecar accordingly.
        # Allowed values:
        #   true: enable checking of health condition of CSI volumes
        #   false: disable checking of health condition of CSI volumes
        # Default value: false
        - name: X_CSI_HEALTH_MONITOR_ENABLED
          value: "true"
      #nodeSelector:
      # Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
      #  node-role.kubernetes.io/control-plane: ""

      # tolerations: Define tolerations for the controllers, if required.
      # Leave as blank to install controller on worker nodes
      # Default value: None
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists
    node:
      envs:
        # X_CSI_HEALTH_MONITOR_ENABLED: Enable/Disable health monitor of CSI volumes from node plugin - volume usage
        # Allowed values:
        #   true: enable checking of health condition of CSI volumes
        #   false: disable checking of health condition of CSI volumes
        # Default value: false
        - name: X_CSI_HEALTH_MONITOR_ENABLED
          value: "true"

      # nodeSelector: Define node selection constraints for node pods.
      # For the pod to be eligible to run on a node, the node must have each
      # of the indicated key-value pairs as labels.
      # Leave as blank to consider all nodes
      # Allowed values: map of key-value pairs
      # Default value: None
      #nodeSelector:
      # Uncomment if nodes you wish to use have the node-role.kubernetes.io/control-plane taint
      #  node-role.kubernetes.io/control-plane: ""

      # tolerations: Define tolerations for the controllers, if required.
      # Leave as blank to install controller on worker nodes
      # Default value: None
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra
          operator: Exists`

Logs

2023-09-13T12:24:00.532Z	DEBUG	workspace/main.go:79	Operator Version	{"TraceId": "main", "Version": "1.2.0", "Commit ID": "081702a4c6969af8038a31eaf072b13554323f51", "Commit SHA": "Fri, 23 Jun 2023 07:46:51 UTC"}
2023-09-13T12:24:00.532Z	DEBUG	workspace/main.go:80	Go Version: go1.20.5	{"TraceId": "main"}
2023-09-13T12:24:00.532Z	DEBUG	workspace/main.go:81	Go OS/Arch: linux/amd64	{"TraceId": "main"}
I0913 12:24:01.656484       1 request.go:665] Waited for 1.042073415s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/ingress.operator.openshift.io/v1
2023-09-13T12:24:07.908Z	INFO	workspace/main.go:93	Openshift environment	{"TraceId": "main"}
2023-09-13T12:24:07.911Z	INFO	workspace/main.go:132	Current kubernetes version is 1.25 which is a supported version 	{"TraceId": "main"}
2023-09-13T12:24:07.911Z	INFO	workspace/main.go:143	Use ConfigDirectory /etc/config/dell-csm-operator	{"TraceId": "main"}
I0913 12:24:11.663191       1 request.go:665] Waited for 3.742098704s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1alpha1?timeout=32s
1.6946078552691364e+09	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
1.6946078552714586e+09	INFO	setup	starting manager
1.6946078552724323e+09	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0913 12:24:15.272451       1 leaderelection.go:248] attempting to acquire leader lease dell-csm/090cae6a.dell.com...
1.6946078552724354e+09	INFO	Starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0913 12:24:34.016487       1 leaderelection.go:258] successfully acquired lease dell-csm/090cae6a.dell.com
1.6946078740166562e+09	INFO	controller.containerstoragemodule	Starting EventSource	{"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "source": "kind source: *v1.ContainerStorageModule"}
1.694607874016701e+09	INFO	controller.containerstoragemodule	Starting Controller	{"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule"}
1.6946078740167027e+09	DEBUG	events	Normal	{"object": {"kind":"ConfigMap","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"80857b55-a5bd-405f-91f6-9e50580ecc85","apiVersion":"v1","resourceVersion":"1282335978"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078740168204e+09	DEBUG	events	Normal	{"object": {"kind":"Lease","namespace":"dell-csm","name":"090cae6a.dell.com","uid":"3ae38669-e585-44cf-9ae1-7a7cc849e250","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1282335979"}, "reason": "LeaderElection", "message": "dell-csm-operator-controller-manager-7b8dc694fd-9vh5n_de44f76c-dbdc-4a1a-8624-e1420fff6861 became leader"}
1.6946078741176162e+09	INFO	controller.containerstoragemodule	Starting workers	{"reconciler group": "storage.dell.com", "reconciler kind": "ContainerStorageModule", "worker count": 1}
2023-09-13T12:24:34.117Z	INFO	controllers/csm_controller.go:203	################Starting Reconcile##############	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:34.117Z	INFO	controllers/csm_controller.go:206	reconcile for	{"TraceId": "<omitted>-unity-1", "Namespace": "dell-csm", "Name": "<omitted>-unity", "Attempt": 1}
2023-09-13T12:24:34.117Z	DEBUG	drivers/unity.go:88	preCheck	{"TraceId": "<omitted>-unity-1", "secrets": 1, "certCount": 1, "Namespace": "dell-csm"}
2023-09-13T12:24:35.918Z	INFO	controllers/csm_controller.go:1202	proceeding with modification of driver install	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z	INFO	controllers/csm_controller.go:1130	Owner reference is found and matches	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:35.923Z	INFO	utils/status.go:156	
daemonset status for cluster: default-source-cluster	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-dzj26 : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.226Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-7lr7z : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-rz2xj : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-p2gr5 : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-wxk4t : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-gzs7r : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:181	daemonset pod <omitted>-unity-node-srh72 : Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:204	daemonset status available pods 7	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:205	daemonset status failedCount pods 0	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:206	daemonset status desired pods 7	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:239	deployment controllerReplicas [2]	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:240	deployment controllerStatus.Available [2]	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:242	daemonset expected [7]	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:243	daemonset nodeStatus.Available [7]	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:249	calculate overall state [Succeeded]	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:277	Driver State	{"TraceId": "<omitted>-unity-1", "Controller": {"available":"2","desired":"2","failed":"0"}, "Node": {"available":"7","desired":"7","failed":"0"}}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:365	HandleSuccess Driver state 	{"TraceId": "<omitted>-unity-1", "newStatus.State": "Running"}
2023-09-13T12:24:36.227Z	INFO	utils/status.go:369	HandleSuccess Driver state didn't change from Running	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	INFO	controllers/csm_controller.go:887	Getting unity CSI Driver for Dell Technologies	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.227Z	DEBUG	drivers/commonconfig.go:333	GetConfigMap	{"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/driver-config-params.yaml"}
2023-09-13T12:24:36.227Z	DEBUG	drivers/commonconfig.go:368	GetCSIDriver	{"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/csidriver.yaml"}
2023-09-13T12:24:36.228Z	DEBUG	drivers/commonconfig.go:390	GetCSIDriver	{"TraceId": "<omitted>-unity-1", "fsGroupPolicy": "ReadWriteOnceWithFSType"}
2023-09-13T12:24:36.228Z	DEBUG	drivers/commonconfig.go:176	GetNode	{"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/node.yaml"}
[BRUH] toleration t: {Key:node-role.kubernetes.io/infra Operator:Exists Value: Effect:NoSchedule TolerationSeconds:<nil>}
2023-09-13T12:24:36.232Z	DEBUG	drivers/commonconfig.go:40	GetController	{"TraceId": "<omitted>-unity-1", "configMapPath": "/etc/config/dell-csm-operator/driverconfig/unity/v2.7.0/controller.yaml"}
2023-09-13T12:24:36.302Z	ERROR	zap@v1.21.0/sugar.go:173	Ignored key without a value.	{"TraceId": "<omitted>-unity-1", "ignored": {"driver":{"csiDriverType":"unity","csiDriverSpec":{"fSGroupPolicy":"ReadWriteOnceWithFSType"},"configVersion":"v2.7.0","replicas":2,"dnsPolicy":"ClusterFirstWithHostNet","common":{"image":"dellemc/csi-unity:v2.7.0","imagePullPolicy":"IfNotPresent","envs":[{"name":"X_CSI_UNITY_ALLOW_MULTI_POD_ACCESS","value":"false"},{"name":"X_CSI_EPHEMERAL_STAGING_PATH","value":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/"},{"name":"X_CSI_ISCSI_CHROOT","value":"/noderoot"},{"name":"X_CSI_UNITY_SYNC_NODEINFO_INTERVAL","value":"15"},{"name":"KUBELET_CONFIG_DIR","value":"/var/lib/kubelet"},{"name":"CSI_LOG_LEVEL","value":"info"},{"name":"TENANT_NAME"},{"name":"CERT_SECRET_COUNT","value":"1"},{"name":"X_CSI_UNITY_SKIP_CERTIFICATE_VALIDATION","value":"true"}]},"controller":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"node":{"envs":[{"name":"X_CSI_HEALTH_MONITOR_ENABLED","value":"true"}],"tolerations":[{"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}]},"sideCars":[{"name":"external-health-monitor","enabled":false,"args":["--monitor-interval=60s"]}],"forceRemoveDriver":true}}}
2023-09-13T12:24:36.302Z	DEBUG	drivers/commonconfig.go:51	DriverSpec 	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.306Z	DEBUG	drivers/commonconfig.go:72	Adding toleration	{"TraceId": "<omitted>-unity-1", "t": {"key":"node-role.kubernetes.io/infra","operator":"Exists","effect":"NoSchedule"}}
2023-09-13T12:24:36.306Z	INFO	drivers/commonconfig.go:111	Container to be removed	{"TraceId": "<omitted>-unity-1", "name": "external-health-monitor"}
2023-09-13T12:24:36.306Z	INFO	controllers/csm_controller.go:530	Checking if standalone modules need clean up	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.316Z	INFO	controllers/csm_controller.go:723	Starting SYNC for default-source-cluster cluster	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.518Z	INFO	serviceaccount/serviceaccount.go:45	ServiceAccount already exists	{"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-node"}
2023-09-13T12:24:36.518Z	INFO	serviceaccount/serviceaccount.go:45	ServiceAccount already exists	{"TraceId": "<omitted>-unity-1", "Name:": "<omitted>-unity-controller"}
2023-09-13T12:24:36.619Z	INFO	rbac/clusterrole.go:45	Updating ClusterRoleName:<omitted>-unity-node	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.658Z	INFO	rbac/clusterrole.go:45	Updating ClusterRoleName:<omitted>-unity-controller	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.794Z	INFO	rbac/rolebindings.go:40	Updating ClusterRoleBindingName:<omitted>-unity-node	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.835Z	INFO	rbac/rolebindings.go:40	Updating ClusterRoleBindingName:<omitted>-unity-controller	{"TraceId": "<omitted>-unity-1"}
2023-09-13T12:24:36.973Z	INFO	csidriver/csidriver.go:41	CSIDriver Object exist	{"TraceId": "<omitted>-unity-1", "Name:": "csi-unity.dellemc.com"}

@jooseppi-luna
Copy link
Contributor

Thanks for the logs! We will investigate to see if we can replicate the issue and decide if we should bump up the limits in an upcoming release. One thing I noticed is that the health monitor sidecar is disabled, but the health monitor env var is enabled for controller and node -- is that intentional/what use case is that?

@bharathsreekanth
Copy link
Contributor

bharathsreekanth commented Oct 13, 2023

@chimanjain @jooseppi-luna Do we have any internal ticket to track this? If so, then we need to move this query from a question to an appropriate bucket in GH.

@chimanjain chimanjain changed the title [QUESTION]: How can i set resources limits for controller-manager? [BUG]: Update resources limits for controller-manager to fix OOMKilled error Oct 19, 2023
@chimanjain chimanjain added type/bug Something isn't working. This is the default label associated with a bug issue. and removed type/question Ask a question. This is the default label associated with a question issue. needs-triage Issue requires triage. labels Oct 19, 2023
@cassanellicarlo
Copy link
Author

@jooseppi-luna any news on this?

@jooseppi-luna
Copy link
Contributor

@cassanellicarlo sorry for the late follow up! We have increased the limits in the upcoming CSM 1.9 release (csm-operator v1.4.0). If you have any further questions or issues, please file them here and we will get to it asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csm-operator type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests

5 participants