Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator crashes with OOM error because of small limit #1468

Closed
dhrp opened this issue Aug 2, 2019 · 18 comments
Closed

Operator crashes with OOM error because of small limit #1468

dhrp opened this issue Aug 2, 2019 · 18 comments
Assignees

Comments

@dhrp
Copy link

dhrp commented Aug 2, 2019

Bug Report

What did you do?

I simply followed the instructions here:
https://www.elastic.co/elasticsearch-kubernetes

What did you expect to see?
To have an elasticsearch node up and running.

What did you see instead? Under which circumstances?

I noticed the elastic-operator to be OOMKilled:

NAME                     READY   STATUS      RESTARTS   AGE
pod/elastic-operator-0   0/1     OOMKilled   6          7m56s

I noticed the memory limit set for the operator is also small (only 100Mb)
https://github.com/elastic/cloud-on-k8s/blob/master/operators/config/operator/all-in-one/operator.template.yaml#L39

Environment

  • Script version: https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml

  • Kubernetes 1.13.5

  • Version information:

https://download.elastic.co/downloads/eck/0.9.0/all-in-one.yaml

  • Kubernetes information:

EC2 on AWS, (not EKS), using Rancher 2.2.2.
Kubernetes 1.13.5

$ kubectl version
1.15
  • Resource definition:
if relevant insert the resource definition
  • Logs:
insert operator logs or any relevant message to the issue here

Other notes:
Interestingly enough I used the same operator on a different cluster on digital ocean, and there it didn't need more that the 100Mb limit.

I now increased the limit on my machine to 500M and it works well. (probably could have done less).

@barkbay
Copy link
Contributor

barkbay commented Aug 2, 2019

Thanks for the report.
Out of curiosity:

  • how many clusters/resources are managed by the operator ?
  • how long does it take for the operator to be killed by the OOMKiller ?

@philrhinehart
Copy link

I've been seeing the same issue over the past few days.

In my case, the increased memory usage seems to be correlated to an increase in the number of nodes being managed by the operator (currently 10).

On average, it takes about 30-40 minutes for the pod to be killed due to OOM errors.

image

Increasing the limit seems to have resolved things for me as well.

@charith-elastic charith-elastic self-assigned this Sep 27, 2019
@charith-elastic
Copy link
Contributor

@philrhinehart Can you provide more information about the workload? Is it just one Elasticsearch cluster of 10 nodes managed by the operator? What is the Kubernetes cluster utilization? If it is possible to do so without revealing any sensitive information, can you provide the manifest for Elasticsearch as well?

@charith-elastic
Copy link
Contributor

I am closing this issue for now since our internal testing couldn't reproduce the problem. If anybody else experiences the same problem, please re-open this issue and provide details about the environment.

@masterkain
Copy link

my elastic operator kept oom too, removed the requirements it is sitting at 188MB memory, one cluster, one node, https://www.elastic.co/guide/en/cloud-on-k8s/1.0/k8s-quickstart.html

@charith-elastic
Copy link
Contributor

@masterkain can you provide more details about your environment? Kubernetes version, self-hosted or cloud, other workloads in the cluster, whether this is a fresh install of ECK etc.

@mcfearsome
Copy link

Happening to me too.

ECK Version: 1.0.0-beta1

 15:29:32 ❯  kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-aae39f", GitCommit:"aae39f4697508697bf16c0de4a5687d464f4da81", GitTreeState:"clean", BuildDate:"2019-12-23T08:19:12Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

AWS EKS
Operator running a single 3 node cluster, 1 node apm server, and kibana.

Containers:
  manager:
    Container ID:  docker://f2b005e92e2423b1ee0b9bb829bf20a60500233007409f114e39e4f9b0744823
    Image:         docker.elastic.co/eck/eck-operator:1.0.0-beta1
    Image ID:      docker-pullable://docker.elastic.co/eck/eck-operator@sha256:1b612a5ae47fb93144d0ab1dea658c94e87e9eedc9449552fabad2205eee3ed8
    Port:          9876/TCP
    Host Port:     0/TCP
    Args:
      manager
      --operator-roles
      all
      --enable-debug-logs=false
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 11 Feb 2020 10:00:58 -0500
      Finished:     Tue, 11 Feb 2020 10:01:00 -0500
    Ready:          False
    Restart Count:  8
    Limits:
      cpu:     1
      memory:  150Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      OPERATOR_NAMESPACE:  elastic-system (v1:metadata.namespace)
      WEBHOOK_SECRET:      webhook-server-secret
      WEBHOOK_PODS_LABEL:  elastic-operator
      OPERATOR_IMAGE:      docker.elastic.co/eck/eck-operator:1.0.0-beta1
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from elastic-operator-token-gdnj4 (ro)

Anything else I can provide, let me know.

@anyasabo
Copy link
Contributor

@mcfearsome there were multiple changes that reduced memory usage that landed in v1.0.0 (especially https://www.elastic.co/guide/en/cloud-on-k8s/1.0-beta/release-highlights-1.0.0-beta1.html#k8s_memory_leak_in_the_eck_process), I would recommend upgrading and/or increasing the memory limit.

@Docteur-RS
Copy link

Same problem here. I couldn't deploy the CRD because the resulting pod was continually killed by OOM.

Which is strange because I had a 2 nodes (Kubernetes) cluster and the operator had no issues, and on this new cluster that has 7 (kubernetes) nodes and more CPU/RAM it gets killed every 30 sec...

I removed the limits/requests section for now and everything seems to be back to normal.

@sebgl
Copy link
Contributor

sebgl commented Mar 26, 2020

@Docteur-RS can you provide more details about your Kubernetes environment? Which version of ECK are you using?

@Docteur-RS
Copy link

Docteur-RS commented Mar 26, 2020

@sebgl
Using ECK 1.0 on premise.
Kubernetes : 1.16.7
Each kubernetes nodes has about 16 gig of RAM

I updated the default resources to the following and it did not work either:

resources:
  limits:
    cpu: 1
    memory: 350Mi
  requests:
    cpu: 500m
    memory: 300Mi

I pretty much doubled everything...

In the end I just commented the whole "resources" section and it fixed the OOM.

@barkbay
Copy link
Contributor

barkbay commented Mar 26, 2020

Could you give us more information about the OS you are using:

  • lsb_release -a or cat /etc/os-release
  • uname -a
  • What distribution of K8S
  • Also if you can copy/paste the OOMKiller logs, you can usually get them using dmesg, it looks like that:
[3458703.013937] Task in ... killed as a result of limit of ....
[3458703.039061] memory: usage x, limit x, failcnt x3458703.044979] memory+swap: usage x, limit x, failcnt x
[3458703.051495] kmem: usage x, limit x, failcnt 0
[3458703.058078] Memory cgroup stats for x ... active_file:0KB unevictable:0KB
....
[3458703.135532] oom_reaper: reaped process x (x), now anon-rss:x, file-rss:x, shmem-rss:0kB

@barkbay barkbay reopened this Mar 26, 2020
@botelastic botelastic bot added the triage label Mar 26, 2020
@Docteur-RS
Copy link

@barkbay
Centos 7
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

I applyed the limits / requests that failled and this is the logs I got:

kgp -n elastic-system -w
NAME                      READY   STATUS      RESTARTS   AGE
elastic-operator-0        0/1     OOMKilled   2          55s

2 stacktraces that appears multiple times in dmesg

[767251.824704] Tasks state (memory values in pages):
[767251.825073] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[767251.825763] [  26660]     0 26660      255        1    32768        0          -998 pause
[767251.826394] [  27554]   101 27554   169888    94675   966656        0           982 elastic-operato
[767251.827344] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/d4f18256793a4c968094ffdac57fe51a09c439a2e9afe04aefec738b513d8005,task=elastic-operato,pid=27554,uid=101
[767251.830085] Memory cgroup out of memory: Killed process 27554 (elastic-operato) total-vm:679552kB, anon-rss:354896kB, file-rss:23804kB, shmem-rss:0kB, UID:101 pgtables:944kB oom_score_adj:982
[767251.841491] oom_reaper: reaped process 27554 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[767291.775788] elastic-operato invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=982
[767291.776576] CPU: 2 PID: 28225 Comm: elastic-operato Not tainted 5.5.9-1.el7.elrepo.x86_64 #1
[767291.777262] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[767291.777908] Call Trace:
[767291.778202]  dump_stack+0x6d/0x98
[767291.778548]  dump_header+0x51/0x210
[767291.778852]  oom_kill_process+0x102/0x130
[767291.779194]  out_of_memory+0x105/0x510
[767291.779517]  mem_cgroup_out_of_memory+0xb9/0xd0
[767291.779888]  try_charge+0x756/0x7c0
[767291.780211]  ? __alloc_pages_nodemask+0x16c/0x320
[767291.780727]  mem_cgroup_try_charge+0x72/0x1e0
[767291.781221]  mem_cgroup_try_charge_delay+0x22/0x50
[767291.781750]  do_anonymous_page+0x11a/0x650
[767291.782265]  handle_pte_fault+0x2a8/0xad0
[767291.782754]  __handle_mm_fault+0x4a8/0x680
[767291.783223]  ? __switch_to_asm+0x40/0x70
[767291.783639]  handle_mm_fault+0xea/0x200
[767291.784010]  __do_page_fault+0x225/0x490
[767291.784458]  do_page_fault+0x36/0x120
[767291.784845]  page_fault+0x3e/0x50
[767291.785220] RIP: 0033:0x46055f
[767291.785572] Code: 00 00 c5 fe 6f 06 c5 fe 6f 4e 20 c5 fe 6f 56 40 c5 fe 6f 5e 60 48 81 c6 80 00 00 00 c5 fd e7 07 c5 fd e7 4f 20 c5 fd e7 57 40 <c5> fd e7 5f 60 48 81 c7 80 00 00 00 48 81 eb 80 00 00 00 77 b5 0f
[767291.787044] RSP: 002b:000000c000847118 EFLAGS: 00010202
[767291.787501] RAX: 0000000007fffe00 RBX: 0000000000bafde0 RCX: 000000c018e2be00
[767291.788085] RDX: 000000000ffffe00 RSI: 000000c01027c020 RDI: 000000c01827bfa0
[767291.788710] RBP: 000000c000847160 R08: 000000c010e2c000 R09: 0000000000000000
[767291.789327] R10: 0000000000000020 R11: 0000000000000202 R12: 0000000000000002
[767291.789937] R13: 00000000025731c0 R14: 000000000045eea0 R15: 0000000000000000
[767291.790745] memory: usage 358400kB, limit 358400kB, failcnt 1512
[767291.791234] memory+swap: usage 358400kB, limit 9007199254740988kB, failcnt 0
[767291.791823] kmem: usage 2744kB, limit 9007199254740988kB, failcnt 0
[767291.792299] Memory cgroup stats for /kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b:
[767291.794894] anon 363929600
[767291.809817] Tasks state (memory values in pages):
[767291.810587] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[767291.811981] [  26660]     0 26660      255        1    32768        0          -998 pause
[767291.813170] [  28205]   101 28205   169888    94762   958464        0           982 elastic-operato
[767291.814479] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,mems_allowed=0,oom_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b,task_memcg=/kubepods/burstable/pod7b717e63-8036-4ba9-af17-930bf5bab43b/fe06b305e97236ed3bceebfcf354d2aed79b729003caade99fd193d605c79407,task=elastic-operato,pid=28205,uid=101
[767291.818732] Memory cgroup out of memory: Killed process 28205 (elastic-operato) total-vm:679552kB, anon-rss:354792kB, file-rss:24256kB, shmem-rss:0kB, UID:101 pgtables:936kB oom_score_adj:982
[767291.827011] oom_reaper: reaped process 28205 (elastic-operato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@barkbay
Copy link
Contributor

barkbay commented Mar 26, 2020

Centos 7
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Default Kernel for CentOS 7 is 3.10 (CentOS 8 is bundled with 4.18)
5.5 has been released in January . Any reason to not use the default one ?

Kubernetes and container runtimes are relying on low level Kernel functions (like cgroups)
I would not advise to use something else that the kernel provided by default for your distribution.

@Docteur-RS
Copy link

Hum... if I remember correctly we updated the Kernel version because Cilium (our kubernetes' CNI) needed BPF which was not availble in the default Kernel version we had.

Though I checked the cluster's version on which it was working correctly:
Linux p5vm7 5.4.12-1.el7.elrepo.x86_64 #1 SMP Tue Jan 14 16:02:20 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
And on the one it's not:
Linux p4vm107 5.5.9-1.el7.elrepo.x86_64 #1 SMP Wed Mar 11 19:01:01 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux

Small difference but maybe it's only what it takes.

@barkbay
Copy link
Contributor

barkbay commented Mar 31, 2020

I think you can use Centos 8 if you want to use Cillium on CentOS.
I'm closing this issue because I'm not sure we will be able to help for this kind of configuration (old distro + very recent Kernel)

@pidren
Copy link

pidren commented May 13, 2020

Just wanted to add some metric points here:

ECK 1.0 on GKE - the operator kept OOMkilling itself, on average tries to use ~140Mi of memory, and I'm now trying to stabilize this with 200Mi of memory with guaranteed QoS.

@anyasabo
Copy link
Contributor

We increased the default memory limits in #3046 which should be included in the next release, so it can work out of the box in more environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests