-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak when repeating silences #2659
Comments
Thanks for your report. Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ? |
Hello, However I'm not alone on this issue. You should have the profile from someone else soon (let us wait some hours to reproduce the bug again) Regards |
I'm in @ngc104 's team. During his holidays I will follow this issues. Thx for your help Regards, |
Hi again @roidelapluie I was able to reproduce the issue this morning Please find the profile here and heap here Let me know if you need more informations Regards, |
Hi guys Any update on this case ? Tell me if you're need mode informations Regards, |
Can we have an update on this issues ? Do you need more informations ? Regards |
Hi everyone Will it be possible to have on update on this case ? Could become critical on our side Regards, |
Hi everyone, @roidelapluie
I sent you the element requested in the begining of august. Could you help us on this case please ? Regards |
Hello, Somehow the heap dump does not seem to help. I see only a few megabytes of memory used here. I will try to reproduce this on my side asap. |
Have you succedded to reproduce this issue on your side ? Do you have some informations that could help us please ? Regards, |
i did not have the chance yet. maybe you can come up with a pprof that shows the issue ? the one that was sent previously was not really conclusive. |
Edited on 13/06/2023 (typo and switch to 0.25.0) Procedure to reproduce the bugGet alertmanager(yes, sorry, so complex for just curling a tar.gz and uncompress it)
Launch alertmanager
Alertmanager is working : http://localhost:9093 Harass alertmanager with alerts and silences
The first The second Difference with Prometheus/Kthxbye : we harass Alertmanager every 1 minute. go tool pprof
We can also play with go tool pprof (since 29/08/2023)
Then run "./trace.sh" from times to times |
pprof profiles.pdfI'm back on this issue. I harassed Alertmanager for a workday. Here are some pdf (I don't know if it is the preferred format, sorry) 09h19 profile-20220113_091951-allocs.pdf 09h19 4Mb ( Because I can now reproduce the bug easily, please tell me what kind of pprof traces you need to investigate. |
Hello @roidelapluie I'm back on this issue (replacing @Winael ). If you can come back on it too (or anybody else), please tell me how to take pprof traces if you need more infos. Note : I took heap traces too. All the day. On the PDF, I can read |
I was going through the profiles and notice that you are using alloc_space which is about how much memory space was allocated. Can you inuse_space and check if the memory consumption is going up. |
@prashbnair not sure to understand... You want me to run I will try that ASAP. Are PDF files the best output for you ? |
Starting taking the traces. They will be posted later today. Here is my script to take traces :
|
pprof more profiles.pdfI restarted my test (as described in a previous comment above). I took some traces right after I started Alertmanager and my loop for the alert, at 09:19. I took other traces after lunch at 14:02. And I ended my test at 17:58. 09:19
14:02
17:58
I tried the 4 sample indexes I found for allocations. I hope that it is what you wanted @prashbnair . |
@ngc104 By looking at the inuse_space I notice that it does not go up over time. Are you still able to reproduce the error? |
@prashbnair Yes. In fact, the main problem is Kubernetes killing the Alertmanager pod container when going OOM because of the "memory" going higher than the resource limit. What is the "memory" measured by Kubernetes to kill the container ? If this is not alloc_space, I don't know how to measure it with go pprof. |
Restarted my test at 9h22 UTC. Here is some interesting result at 10h03 UTC...
Could the "memory leak" be something else than a memory leak ? |
I was bad on this one... With amtool, I notice that I have a lot of silences. But no silence after 5 days. When I restart Alertmanager, I expect to have a memory consumption that goes continuously growing for 5 days. Then it should become flat. But I observe that it continues to grow after 5 days. So my guess was bad : this is not the sum of expired silences being flushed. I'm thinking again about a memory leak. But no idea on how to debug it. |
Hello and happy new year ! Alertmanager-0.25.0 is released. Let's give a try... 12h37 profile-20230106_155103-allocs.pdf 12h37 14Mb ( Nothing seems to have changed in 0.25.0 related to this issue. |
Hello, I had not noticed that I was running a standalone alertmanager in cluster mode. In order to remove any noise in the investigation, I retried with this command line :
Then I took some traces as usual : 10h56 : profile-20230113_105632-allocs.pdf 10h56 : 15Mb ( I hope these traces are easier to use without the noise of the data structures needed for the absent cluster. |
👋 it would help if you could share the raw pprof flies. |
Hello, Sorry for the delay... Too much work besides this incident. And I'm not sure that the "raw" pprof files are obtained just with And I forgot to start Alertmanager with Here are the logs : profiles.zip Here is my script to generate them :
Edit : I also upload binary files (I have no idea how they work, but I guess you like them too) : |
I missed the fact that you're using version 0.21.0. Can you try with v0.25.0? |
All tests were done with latest version. In 2021 when I created the issue it was tested with 0.21.0. Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet. |
Hello, Alertmanager 0.26 is released 🥳 The bug is still there 💣 I modified my comment above with the procedure to reproduce the bug (I changed the way to run And here are new profiles.zip files. Below is an extract of the
As you can see something is growing... (dates are in format YYYYMMDD_HHMMSS) |
To see if it was related to data retention not deleting expired silences I ran Alertmanager with |
How to reproduceI was fed up with rebuilding my env test each time I wanted to test if the bug is still there. I also thought it would be good to make it easier for everyone to build it. So I wrote some scripts : https://github.com/ngc104/alertmanager-issue-2659 TL;DR :
But the README.md file is quite short. You can read it. How to measure the leakI can see the bug in Kubernetes environment with In very short, this metric is close to the RSS metric for a process and comes from Linux cgroups. This is the metric used by Kubernetes to detect OOMs and kill the containers. For that specific reason, it is important to use exactly that metric and not other similar ones. After a lot of readings, I decided to
My script is 25 lines long and should work on most Linux (cgroup v1 and v2)... Not so hard to understand and fix if it does not work for you. How to show the leeakThanks to your graphs @grobinson-grafana I thought I could provide grafs too, and let anyone have their own graphs easily. So my env test comes from :
Now it's easy to run this stack and provide graphs. |
Some graphs with the stack aboveThe stack was launched for hours with different data retention. Alertmanager version : 0.26.0 Data retention 2hStart : 10h35 EDIT 02/02/2024 : we had more metrics than we thought. Here are new and larger graphs (retention 2h) Start : 10h35 Data retention 3hStart : 10h50 Data retention 5mStart : 11h40 Note : this is the retention that @grobinson-grafana used for the test 2 months and a half ago. |
Analysing the graphs aboveNote : the metrics have been aggregated with Nb of expired silencesThe number of expired silences is mostly the same (except at the beginning of course). The exception is for dataretention=5m : why does it increase so much 2 hours after the beginning and then stbilize 1h30 after the increase ? Well, that is not the object of this issue. Let's just notice it then forget it for while. memstat_heap_objectsThis metric comes from alertmanager itself. The number of heap objects looks stable (after the growing phase at beginning). Notice that with dataretention=5m the growing phase looks strange with not growing so much. This groing phase is 2h with dataretention=2h and is also 2h with dataretention=3h. With dataretention=5m, the strange start will also last something like 3h. I see no link between this metric and the memory leak. Workingset sizeThis is the main metric. This is why the issue was opened. With dataretention=5m, the issue does not appear. It did not appear on @grobinson-grafana 's test. It does not appear here either. I guess that using dataretention=5m will hide the bug. With dataretention=2h we can see the Workingset size growing slowly until the "cliff". After the cliff, we have not enough data to say if is a flat infinite step or if it will still grow. Maybe we should run this test again. EDIT 02/02/2024 : with newer graphs on a larger time window, we see that the Workingset size continues growing. Why are there steps ? No idea. With dataretention=3h we can clearly see the Workingset size growing slowly but continuously. The bug is there. But we need retentionsize >= 3h and a lot of time to see it But remember that we run Alertmanager on Kubernetes for months and at the end, the container is killed because of an OOM. On this test with dataretention=3h, if memory limit was set to 2.25M the container would have been killed after 20h running. Not before. |
cc @simonpasquier (I don't know if you are still working on it or if you can do anything). If the 3 comments were too long to read, please check my little lab to reproduce the bug : https://github.com/ngc104/alertmanager-issue-2659/tree/main and pay attention to the metric workingset size : this is the metric that causes Kubernetes (in reality, Linux) to OOM-kill Alertmanager. |
Alertmanager-0.27The memory leak is still there I'm running https://github.com/ngc104/alertmanager-issue-2659/tree/main over 1 day and a half. I set the parameter
During all the test, including after 9h13, the same alert is renewed every one minute with The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly. The strange things:
Note : the alert is set every one minute with The history of this issue is long but the issue has not changed since the beginning. |
I'm sorry but I'm still don't see where the memory leak is in your screenshots. This looks perfect to me? In the screenshot I can see that garbage is being created and then reclaimed by the garbage collector, and the graph itself is a flat line somewhere between 23MB and 24MB.
I don't see that in the screenshot above? This also wouldn't happen if there was a memory leak because if there was a memory leak then the memory wouldn't go down? Can you help me understand? |
Let's zoom on the significant period (when alertmanager is being harassed, without the few beginning minutes) I need to use I cannot run my lab for months. But I have Alertmanagers in Production at work that are running for months. They sometimes explode with OOM. |
I think the problem is the second graph. It's taking the average over 4 hours, which is skewed from the peaks before a garbage collection happens. The first graph, which has a much smaller window for |
That looks good. There is no memory leak in this graph. If you want you can show it for the last 48 hours or 7 days and it should look the same. |
These also look good. There is no memory leak in these graphs either! |
So hard to reproduce and so hard to interpret this little window of time when the issue needs about one month in our production environment to generate an OOM... I guest that next time I give a try, I will generate 100 alerts and 100 silences instead of one. We will see if that changes anything. Thanks for pointing that graphs are only the reflection of what we want to see. And in fact, when I add It definitely needs a new try with more alerts and silences. Next time I have time... Stay tuned :) |
I think it would be good to see the graphs for |
Spikes of go_memstats_heap_objects are higher and higher... |
In this graph, I presume the two lines are separate Alertmanagers. It's interesting that the yellow line is flat while the green line keeps growing. Is the green line the Alertmanager being tested? |
Both are running in the same namespace : they are in cluster. |
Silences are gossiped between the two Alertmanagers when clustered, so both will have the same silences. |
OK. So we have another mystery... (I won't open an issue for that : I don't even know how to reproduce it and we have had no problem with that... ) Let's focus on the green line... |
I can still see some increase on the first graphs. Today, my problem is that the environment where our bug appears changed since I opened this bug and we have less OOM alerts. It becomes hard to reproduce in our production environment. As a result, I will not be able to say if the bug was already fixed in 0.27, or if 0.28 and PR #3930 (thanks !) solved it, or if the bug still exists. I am sorry to say that, but I will no more follow this issue. Let's leave it open for a while, get stale, then auto-close. |
What did you do?
I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.
You can reproduce this without Kthxbye :
1/ generate an alert (or use any alert sent by Prometheus), for example
PrometheusNotIngestingSamples
.2/ With Alertmanager, generate silences like this :
Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.
What did you expect to see?
Nothing interesting (no abnormal memory increase)
What did you see instead? Under which circumstances?
Follow the metric
container_memory_working_set_bytes
for Alertmanager. After some hours you can see it slightly grow up.Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.
My Alertmanager is running with the default
--data.retention=120h
. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a
kubectl delete pod
(right side).Environment
System information:
Kubernetes (deployed with https://github.com/prometheus-community/helm-charts/tree/main/charts/alertmanager)
Alertmanager version:
The text was updated successfully, but these errors were encountered: