OOM kill of the pelorus operator controller manager #777

mpryc · 2023-01-13T13:05:33Z

Another OOM Kill of the pelorus-operator-controller-manager was spotted.

One approach is to update documentation to ensure users do know how to adjust them dynamically.

It's unclear why this OOM Kill happened, from this diagram we can see it did not reach limit of 512MiB, however maybe node has not enough memory to allocate for this pod ?

mpryc · 2023-01-13T14:02:28Z

To check the node resource usage/limits:

$ oc describe pod pelorus-operator-controller -n pelorus | grep Node\:
Node:         ip-10-0-164-133.eu-central-1.compute.internal/10.0.164.133

Describe the node (remove /IP from the Node: output):

$ oc describe node ip-10-0-164-133.eu-central-1.compute.internal
[...]
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         803m (22%)    1400m (40%)
  memory                      2076Mi (13%)  2036Mi (13%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
[...]

mpryc · 2023-01-16T13:45:30Z

After short debug session:

OpenShift cluster version: 4.9.51
The node has no problem with resources. Plenty to accommodate, so it's definitely not over-committed.
Pelorus helm operator manager is OOMKilled almost every minute
There were some spikes where pelorus-operator-manager was taking more then 512MB, around 530MB
The pelorus-operator-manager is using mostly 200MB, so way below 512MB limit. It has requested 512MB and limit set to the same number, which means it won't overcommit
On the shell inside the pelorus-operator-manager pod:

ls /proc/sys/vm/overcommit_*; cat /proc/sys/vm/overcommit_*
0
1
50

Wasn't able to adjust / perm denied:

echo -1000 /proc/1/oom_score_adj

mpryc · 2023-01-16T15:13:16Z

After discussing with @dmesser it may be required to adjust Maximum Concurrent Reconciles:

https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/max_concurrent_reconciles/

mpryc · 2023-01-16T15:16:23Z

@milles9393
Please try the following:

oc edit csv -n pelorus pelorus-operator.v0.0.1

Find the containers: section and add the - --max-concurrent-reconciles=, by default it's set to 1:

[...]
                        values:
                        - linux
              containers:
              - args:
                - --health-probe-bind-address=:8081
                - --metrics-bind-address=127.0.0.1:8080
                - --max-concurrent-reconciles=5
                - --leader-elect
                - --leader-election-id=pelorus-operator
                image: quay.io/pelorus/pelorus-operator:0.0.1
                livenessProbe:
[...]

Then save it and we will see if there was any improvement.

weshayutin · 2023-01-17T14:15:12Z

Reading list for wes:
operator-framework/helm-operator-plugins#198
operator-framework/operator-sdk#6026

Pelorus Operator that addresses dora-metrics#777 Signed-off-by: Michal Pryc <mpryc@redhat.com>

Pelorus Operator that addresses #777 Signed-off-by: Michal Pryc <mpryc@redhat.com> Signed-off-by: Michal Pryc <mpryc@redhat.com>

mpryc · 2023-01-24T11:20:05Z

The bug should be closed now:

We released Pelorus Operator v0.0.2 which is already in the OpenShift Marketplace with the fix (Pelorus Operator v0.0.2 #786).

Note that our documentation on the https://pelorus.readthedocs.io/ is currently up to date and reviewed for the Pelrous Operator scenario.

weshayutin assigned mpryc Jan 13, 2023

weshayutin added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2023

weshayutin pinned this issue Jan 13, 2023

mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023

Pelorus Operator v0.0.2

18171a2

Pelorus Operator that addresses dora-metrics#777 Signed-off-by: Michal Pryc <mpryc@redhat.com>

mpryc mentioned this issue Jan 18, 2023

Pelorus Operator v0.0.2 #786

Merged

mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023

Pelorus Operator v0.0.2

38c05b8

Pelorus Operator that addresses dora-metrics#777 Signed-off-by: Michal Pryc <mpryc@redhat.com>

mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023

Pelorus Operator v0.0.2

a80a820

Pelorus Operator that addresses dora-metrics#777 Signed-off-by: Michal Pryc <mpryc@redhat.com>

weshayutin pushed a commit that referenced this issue Jan 18, 2023

Pelorus Operator v0.0.2 (#786)

1258e16

Pelorus Operator that addresses #777 Signed-off-by: Michal Pryc <mpryc@redhat.com> Signed-off-by: Michal Pryc <mpryc@redhat.com>

mpryc closed this as completed Jan 24, 2023

mateusoliveira43 unpinned this issue Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM kill of the pelorus operator controller manager #777

OOM kill of the pelorus operator controller manager #777

mpryc commented Jan 13, 2023 •

edited

Loading

mpryc commented Jan 13, 2023

mpryc commented Jan 16, 2023 •

edited

Loading

mpryc commented Jan 16, 2023

mpryc commented Jan 16, 2023 •

edited

Loading

weshayutin commented Jan 17, 2023

mpryc commented Jan 24, 2023

OOM kill of the pelorus operator controller manager #777

OOM kill of the pelorus operator controller manager #777

Comments

mpryc commented Jan 13, 2023 • edited Loading

mpryc commented Jan 13, 2023

mpryc commented Jan 16, 2023 • edited Loading

mpryc commented Jan 16, 2023

mpryc commented Jan 16, 2023 • edited Loading

weshayutin commented Jan 17, 2023

mpryc commented Jan 24, 2023

mpryc commented Jan 13, 2023 •

edited

Loading

mpryc commented Jan 16, 2023 •

edited

Loading

mpryc commented Jan 16, 2023 •

edited

Loading