Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM kill of the pelorus operator controller manager #777

Closed
mpryc opened this issue Jan 13, 2023 · 6 comments
Closed

OOM kill of the pelorus operator controller manager #777

mpryc opened this issue Jan 13, 2023 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mpryc
Copy link
Collaborator

mpryc commented Jan 13, 2023

Another OOM Kill of the pelorus-operator-controller-manager was spotted.

One approach is to update documentation to ensure users do know how to adjust them dynamically.

oom_kill

It's unclear why this OOM Kill happened, from this diagram we can see it did not reach limit of 512MiB, however maybe node has not enough memory to allocate for this pod ?

@mpryc
Copy link
Collaborator Author

mpryc commented Jan 13, 2023

To check the node resource usage/limits:

$ oc describe pod pelorus-operator-controller -n pelorus | grep Node\:
Node:         ip-10-0-164-133.eu-central-1.compute.internal/10.0.164.133

Describe the node (remove /IP from the Node: output):

$ oc describe node ip-10-0-164-133.eu-central-1.compute.internal
[...]
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         803m (22%)    1400m (40%)
  memory                      2076Mi (13%)  2036Mi (13%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
[...]

@weshayutin weshayutin added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2023
@weshayutin weshayutin pinned this issue Jan 13, 2023
@mpryc
Copy link
Collaborator Author

mpryc commented Jan 16, 2023

After short debug session:

  1. OpenShift cluster version: 4.9.51
  2. The node has no problem with resources. Plenty to accommodate, so it's definitely not over-committed.
  3. Pelorus helm operator manager is OOMKilled almost every minute
  4. There were some spikes where pelorus-operator-manager was taking more then 512MB, around 530MB
  5. The pelorus-operator-manager is using mostly 200MB, so way below 512MB limit. It has requested 512MB and limit set to the same number, which means it won't overcommit
  6. On the shell inside the pelorus-operator-manager pod:
ls /proc/sys/vm/overcommit_*; cat /proc/sys/vm/overcommit_*
0
1
50
  1. Wasn't able to adjust / perm denied:
echo -1000 /proc/1/oom_score_adj

@mpryc
Copy link
Collaborator Author

mpryc commented Jan 16, 2023

After discussing with @dmesser it may be required to adjust Maximum Concurrent Reconciles:

https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/max_concurrent_reconciles/

@mpryc
Copy link
Collaborator Author

mpryc commented Jan 16, 2023

@milles9393
Please try the following:

oc edit csv -n pelorus pelorus-operator.v0.0.1

Find the containers: section and add the - --max-concurrent-reconciles=, by default it's set to 1:

[...]
                        values:
                        - linux
              containers:
              - args:
                - --health-probe-bind-address=:8081
                - --metrics-bind-address=127.0.0.1:8080
                - --max-concurrent-reconciles=5
                - --leader-elect
                - --leader-election-id=pelorus-operator
                image: quay.io/pelorus/pelorus-operator:0.0.1
                livenessProbe:
[...]

Then save it and we will see if there was any improvement.

@weshayutin
Copy link
Contributor

mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023
Pelorus Operator that addresses dora-metrics#777

Signed-off-by: Michal Pryc <mpryc@redhat.com>
mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023
Pelorus Operator that addresses dora-metrics#777

Signed-off-by: Michal Pryc <mpryc@redhat.com>
mpryc added a commit to mpryc/pelorus that referenced this issue Jan 18, 2023
Pelorus Operator that addresses dora-metrics#777

Signed-off-by: Michal Pryc <mpryc@redhat.com>
weshayutin pushed a commit that referenced this issue Jan 18, 2023
Pelorus Operator that addresses #777

Signed-off-by: Michal Pryc <mpryc@redhat.com>

Signed-off-by: Michal Pryc <mpryc@redhat.com>
@mpryc
Copy link
Collaborator Author

mpryc commented Jan 24, 2023

The bug should be closed now:

Note that our documentation on the https://pelorus.readthedocs.io/ is currently up to date and reviewed for the Pelrous Operator scenario.

@mpryc mpryc closed this as completed Jan 24, 2023
@mateusoliveira43 mateusoliveira43 unpinned this issue Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants