Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM Detection #117

Open
cmelone opened this issue Oct 7, 2024 · 1 comment · May be fixed by spack/spack#46447
Open

OOM Detection #117

cmelone opened this issue Oct 7, 2024 · 1 comment · May be fixed by spack/spack#46447
Assignees

Comments

@cmelone
Copy link
Collaborator

cmelone commented Oct 7, 2024

Problem/Opportunity Statement

we will eventually enable memory limits for CI jobs. There is no current way to detect this in k8s/prometheus in our environment.

For example, I set KUBERNETES_MEMORY_LIMIT=1500M for this job, which was killed shortly after starting. There is no error reason in the log or in the exit code. See this opensearch query.

The kube_pod_container_status_last_terminated_exitcode metric is supposed to indicate an OOM kill for a job, but this isn't working.

relevant issues:

I came across a blog post that describes the same issue and I've been corresponding with the author (@jimmy-outschool)

According to his info, k8s is looking for the primary process to exit due to OOM instead of the non-pid 1 process that is launched by the gitlab runner.

What would success / a fix look like?

His solution involves a small patch to gitlab runner, which looks for OOM events in the kernel message buffer and outputs the correct exit code to the log. He has attempted to upstream this to no avail.

While we may face headwinds when pushing to deploy a custom version of gitlab runners, the alternative solutions are not great:

  1. using memory usage, we could see if the last reported number is within 90% of the limit to determine if it was killed. However, the spikes are so large that I've seen figures as low as 70% of limit before OOM killed.
  2. recent kernel versions have support for cgroups v2 which detects if any non-main processes were OOM killed and reports those statuses. However, many of our runner containers are using OS versions outside of the support matrix for this feature.
@cmelone cmelone self-assigned this Oct 7, 2024
@cmelone
Copy link
Collaborator Author

cmelone commented Oct 7, 2024

idea from Alec:

Instead of integrating the code into the GitLab runner what if we wrapped the execution of Spack?
Thus the subprocess would be killed and the parent could then look up the kernel message

questions:

  • permissions to access kernel messages
  • does the main process have enough info to find subprocess messages?

@cmelone cmelone linked a pull request Oct 24, 2024 that will close this issue
cmelone added a commit to cmelone/spack that referenced this issue Oct 24, 2024
Closes spack/spack-gantry#117

This PR is motivated by the fact that we will be implementing memory limits into CI at some point, and we want a robust and stable way of detecting if we are killing jobs due to memory constraints.

There is no current way to detect this in k8s/prometheus in out environment.

For example, this job was [OOM killed](https://gitlab.spack.io/spack/spack/-/jobs/12730664), yet the information reported to prometheus/opensearch/etc does not suggest a reason.

I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue, which boils down to the fact k8s can only detect OOM kills for pid=1. In the build containers, the gitlab runner itself is pid 1, where the script steps are spawned independently.

This is something that has changed with cgroups v2, [which checks for OOM kills in all processes](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9). However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature.

The author of the blog post I mentioned pushed [a feature](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) to his fork of gitlab runner that checks for OOM using kernel messages after job failure.

I adapted this to a call in `after_script`, which relies upon permission to run `dmesg`.

The benefit of `after_script` is that it's executed regardless of exit reason, unless the runner dies or times out.

If an OOM is detected, it's output to the trace and a file is written to `jobs_scratch_dir/user_data/oom-info`, which can be accessed by a client like:

```
GET https://gitlab.spack.io/api/v4/projects/:id/jobs/:job_id/artifacts/jobs_scratch_dir/user_data/oom-info
```

I attempted to have this propagated as a pod/annotation label to no avail, and other methods of sending this to prometheus would be far too complex.

I've tested it in the staging cluster by setting artificially low limits, check out [this pipeline](https://gitlab.staging.spack.io/spack/spack/-/pipelines/1256).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant