Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when repeating silences #2659

Open
ngc104 opened this issue Jul 30, 2021 · 54 comments
Open

Memory leak when repeating silences #2659

ngc104 opened this issue Jul 30, 2021 · 54 comments

Comments

@ngc104
Copy link

ngc104 commented Jul 30, 2021

What did you do?

I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.

You can reproduce this without Kthxbye :

1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples.

2/ With Alertmanager, generate silences like this :

while true; do
  date # useless; just for tracing...
  amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
  sleep 50
done

Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.

What did you expect to see?

Nothing interesting (no abnormal memory increase)

What did you see instead? Under which circumstances?

Follow the metric container_memory_working_set_bytes for Alertmanager. After some hours you can see it slightly grow up.

Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.

image

My Alertmanager is running with the default --data.retention=120h. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.

image
The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod (right side).

Environment

/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4
  • Alertmanager configuration file:
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
receivers:
- name: rocketchat
  webhook_configs:
  - send_resolved: true
    url: https://xxxx.rocketchat.xxxx/hooks/xxxxxx/xxxxxxxxx
route:
  group_by:
  - xxxxxxx
  - yyyyyyy
  - alertname
  group_interval: 5m
  group_wait: 30s
  receiver: rocketchat
  repeat_interval: 5m
  routes:
  - continue: true
    receiver: rocketchat
templates:
- /etc/alertmanager/*.tmpl

  • Logs:
➜ k -n monitoring logs caascad-alertmanager-0 
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093
level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused"
level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13
@roidelapluie
Copy link
Member

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

@ngc104
Copy link
Author

ngc104 commented Aug 4, 2021

Hello,
I'm on holidays since today... Too bad...

However I'm not alone on this issue. You should have the profile from someone else soon (let us wait some hours to reproduce the bug again)

Regards

@Winael
Copy link

Winael commented Aug 5, 2021

Hi @roidelapluie

I'm in @ngc104 's team. During his holidays I will follow this issues.
I tried to reproduce the bug yesterday but unfortunatelly the pod was restarted before I was able to extracted the profile and the heap. I'll try to provide you those informations as soon as I can :)

Thx for your help

Regards,

@Winael
Copy link

Winael commented Aug 5, 2021

Hi again @roidelapluie

I was able to reproduce the issue this morning

Capture d’écran de 2021-08-05 11-23-32

Please find the profile here and heap here

Let me know if you need more informations

Regards,

@Winael
Copy link

Winael commented Aug 17, 2021

Hi guys

Any update on this case ? Tell me if you're need mode informations

Regards,

@Winael
Copy link

Winael commented Aug 25, 2021

Hi @roidelapluie

Can we have an update on this issues ? Do you need more informations ?

Regards

@Winael
Copy link

Winael commented Sep 7, 2021

Hi everyone

Will it be possible to have on update on this case ? Could become critical on our side

Regards,

@Winael
Copy link

Winael commented Sep 29, 2021

Hi everyone, @roidelapluie

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

I sent you the element requested in the begining of august. Could you help us on this case please ?

Regards
Winael

@roidelapluie
Copy link
Member

Hello,

Somehow the heap dump does not seem to help. I see only a few megabytes of memory used here. I will try to reproduce this on my side asap.

@Winael
Copy link

Winael commented Nov 2, 2021

Hi @roidelapluie

Have you succedded to reproduce this issue on your side ? Do you have some informations that could help us please ?

Regards,
Winael

@roidelapluie
Copy link
Member

i did not have the chance yet. maybe you can come up with a pprof that shows the issue ? the one that was sent previously was not really conclusive.

@ngc104
Copy link
Author

ngc104 commented Jan 13, 2022

Edited on 13/06/2023 (typo and switch to 0.25.0)
Edited on 29/08/2023 (switch to 0.26.0 + run alertmanager --version to show the version + added trace.sh)

Procedure to reproduce the bug

Get alertmanager

(yes, sorry, so complex for just curling a tar.gz and uncompress it)

ALERTMANAGER_VERSION="0.26.0"
ALERTMANAGER_ARCHIVE="alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
ALERTMANAGER_URL="https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/${ALERTMANAGER_ARCHIVE}"

# Download
[ -e "${ALERTMANAGER_ARCHIVE}" ] || curl -sL "${ALERTMANAGER_URL}" -o "${ALERTMANAGER_ARCHIVE}"
[ -d "alertmanager" ] || (tar xzf "${ALERTMANAGER_ARCHIVE}" ; mv "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64" "alertmanager")

Launch alertmanager

➜ cat <<EOF > alertmanager.yml 
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'mytest'
receivers:
  - name: 'mytest'
EOF
./alertmanager/alertmanager --version > version.txt
./alertmanager/alertmanager

Alertmanager is working : http://localhost:9093

Harass alertmanager with alerts and silences

while true; do 
  date; 
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence add -a "awesome user" -d 1m -c "awesome comment" alertname=foo
  sleep 60
done

The first amtool call will send an alert to Alertmanager just like Prometheus would do it.

The second amtool call will set a silence on that alert just like kthxbye would do it.

Difference with Prometheus/Kthxbye : we harass Alertmanager every 1 minute.

go tool pprof

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs.pdf"

We can also play with go tool pprof -http localhost:12345 http://localhost:9093/debug/pprof/allocs

go tool pprof (since 29/08/2023)

➜ cat <<'EOF' > trace.sh

DATE=$(date +'%Y%m%d_%H%M%S')
REPORTDIR="profiles"
mkdir -p "${REPORTDIR}"
FILEPREFIX="${REPORTDIR}/profile-${DATE}"
export PPROF_TMPDIR="$(mktemp -d /tmp/pprof-XXXXXX)"
echo "Traces in '${PPROF_TMPDIR}'"

echo "=======================================" >> "${REPORTDIR}/report.txt"
echo "${DATE}" >> "${REPORTDIR}/report.txt"
./alertmanager/alertmanager --version >> "${REPORTDIR}/report.txt"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > /dev/null
mv "${PPROF_TMPDIR}/pprof.alertmanager.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz" "${FILEPREFIX}.pb.gz"
go tool pprof -sample_index=alloc_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=alloc_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"

rmdir "${PPROF_TMPDIR}"
EOF

Then run "./trace.sh" from times to times

@ngc104
Copy link
Author

ngc104 commented Jan 13, 2022

pprof profiles.pdf

I'm back on this issue. I harassed Alertmanager for a workday.

Here are some pdf (I don't know if it is the preferred format, sorry)

09h19 profile-20220113_091951-allocs.pdf
10h59 profile-20220113_105909-allocs.pdf
16h55 profile-20220113_165542-allocs.pdf

09h19 4Mb (nodes accounting for 33245.19kB, 78.29% of 42466.72kB total)
10h59 75Mb (nodes accounting for 56.61MB, 75.30% of 75.18MB total)
16h55 173Mb (nodes accounting for 134.34MB, 77.30% of 173.80MB total)

Because I can now reproduce the bug easily, please tell me what kind of pprof traces you need to investigate.

@ngc104
Copy link
Author

ngc104 commented Jan 13, 2022

Hello @roidelapluie I'm back on this issue (replacing @Winael ). If you can come back on it too (or anybody else), please tell me how to take pprof traces if you need more infos.

Note : I took heap traces too. All the day. On the PDF, I can read Showing nodes accounting for 3609.71kB, 100% of 3609.71kB total at the beginning and at the end of the day. Heap seems to be stable.

@prashbnair
Copy link
Contributor

I was going through the profiles and notice that you are using alloc_space which is about how much memory space was allocated. Can you inuse_space and check if the memory consumption is going up.

@ngc104
Copy link
Author

ngc104 commented Feb 16, 2022

@prashbnair not sure to understand... You want me to run go tool pprof -sample_index=inuse_space instead of just go tool pprof ?

I will try that ASAP.

Are PDF files the best output for you ?

@ngc104
Copy link
Author

ngc104 commented Feb 18, 2022

Starting taking the traces. They will be posted later today.

Here is my script to take traces :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -sample_index=inuse_space -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_space.pdf"

go tool pprof -sample_index=inuse_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_objects.pdf"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_space.pdf"

go tool pprof -sample_index=alloc_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_objects.pdf"

@ngc104
Copy link
Author

ngc104 commented Feb 18, 2022

@prashbnair
Copy link
Contributor

@ngc104 By looking at the inuse_space I notice that it does not go up over time. Are you still able to reproduce the error?

@ngc104
Copy link
Author

ngc104 commented Feb 21, 2022

@prashbnair Yes.

In fact, the main problem is Kubernetes killing the Alertmanager pod container when going OOM because of the "memory" going higher than the resource limit.

What is the "memory" measured by Kubernetes to kill the container ? If this is not alloc_space, I don't know how to measure it with go pprof.

@ngc104
Copy link
Author

ngc104 commented May 2, 2022

Restarted my test at 9h22 UTC. Here is some interesting result at 10h03 UTC...

➜ ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence query --expired
ID                                    Matchers         Ends At                  Created By    Comment          
2fca38e1-9dc3-4fbf-8f35-291dcac02a34  alertname="foo"  2022-05-02 09:22:11 UTC  awesome user  awesome comment  
8ab24f44-c779-4fb7-8ba5-da9061b635d0  alertname="foo"  2022-05-02 09:23:11 UTC  awesome user  awesome comment  
7347ceff-527c-4106-8dc9-861d8205b374  alertname="foo"  2022-05-02 09:24:11 UTC  awesome user  awesome comment  
26299122-2d9c-44dc-bc46-fa945379f4d7  alertname="foo"  2022-05-02 09:25:11 UTC  awesome user  awesome comment  
7d27178d-59ad-443f-81dd-e9351b6a9701  alertname="foo"  2022-05-02 09:26:11 UTC  awesome user  awesome comment  
0219f67a-507a-4b4a-99e4-bf8020fe6615  alertname="foo"  2022-05-02 09:27:11 UTC  awesome user  awesome comment  
73173814-25d6-4181-8de7-0677b8011fca  alertname="foo"  2022-05-02 09:28:11 UTC  awesome user  awesome comment  
82cf631a-07c3-4f90-8131-e7d4ade49d35  alertname="foo"  2022-05-02 09:29:11 UTC  awesome user  awesome comment  
37978ad4-9c1f-4ad7-a6af-d75e33baf75b  alertname="foo"  2022-05-02 09:30:11 UTC  awesome user  awesome comment  
dc17d617-ac31-460a-949f-4021fd945374  alertname="foo"  2022-05-02 09:31:11 UTC  awesome user  awesome comment  
41f5a1b8-ce3a-4a85-a118-672b0dce2267  alertname="foo"  2022-05-02 09:32:11 UTC  awesome user  awesome comment  
d0e8b085-fe67-49dc-9ee2-96e6161a7d0d  alertname="foo"  2022-05-02 09:33:12 UTC  awesome user  awesome comment  
222bae9c-7b07-4a80-997b-06867411d6c8  alertname="foo"  2022-05-02 09:34:12 UTC  awesome user  awesome comment  
025cd910-b437-4455-b925-2744f7a5e027  alertname="foo"  2022-05-02 09:35:12 UTC  awesome user  awesome comment  
0f6dcfc7-8320-4bfb-aab5-fc79fe661546  alertname="foo"  2022-05-02 09:36:12 UTC  awesome user  awesome comment  
3ca66616-f985-4e7c-b60f-7526fa0e7ae6  alertname="foo"  2022-05-02 09:37:12 UTC  awesome user  awesome comment  
0c2ef5b5-5e82-4823-8453-e223f18289a7  alertname="foo"  2022-05-02 09:38:12 UTC  awesome user  awesome comment  
7abe6854-ab5e-4fc8-bd6e-a7495ef8b366  alertname="foo"  2022-05-02 09:39:12 UTC  awesome user  awesome comment  
41a55f9a-b330-494f-a842-8e97e740fb27  alertname="foo"  2022-05-02 09:40:12 UTC  awesome user  awesome comment  
8c361742-e1c0-44c3-8d7b-530771335458  alertname="foo"  2022-05-02 09:41:12 UTC  awesome user  awesome comment  
e95a41c6-91f8-45e2-9bcb-aa21d32fb60e  alertname="foo"  2022-05-02 09:42:12 UTC  awesome user  awesome comment  
6016bdfa-2638-47eb-81b6-502f2317a29f  alertname="foo"  2022-05-02 09:43:12 UTC  awesome user  awesome comment  
11bb8940-1865-4512-8dd2-248535f02997  alertname="foo"  2022-05-02 09:44:12 UTC  awesome user  awesome comment  
51853d4f-a9cc-450c-bc4c-84d40975be83  alertname="foo"  2022-05-02 09:45:12 UTC  awesome user  awesome comment  
200dcc2e-5b5d-4ed8-80cd-3b87c93d35f6  alertname="foo"  2022-05-02 09:46:12 UTC  awesome user  awesome comment  
fec736cf-ff1d-4a03-95e9-9615b834eba6  alertname="foo"  2022-05-02 09:47:12 UTC  awesome user  awesome comment  
1aac8602-7c30-4eda-a37b-a8e54bcf6252  alertname="foo"  2022-05-02 09:48:12 UTC  awesome user  awesome comment  
4b3abc68-5621-400f-9c56-65b05980c568  alertname="foo"  2022-05-02 09:49:12 UTC  awesome user  awesome comment  
1a3276b4-1f61-4df4-8baf-a98906b972f5  alertname="foo"  2022-05-02 09:50:12 UTC  awesome user  awesome comment  
b06f383d-a4e5-45ae-bf09-0b7cb32a2665  alertname="foo"  2022-05-02 09:51:12 UTC  awesome user  awesome comment  
9e6a8329-e395-4578-909e-111f22c9cb8f  alertname="foo"  2022-05-02 09:52:13 UTC  awesome user  awesome comment  
ebcd1791-cb78-48d6-bd46-db768cb42476  alertname="foo"  2022-05-02 09:53:13 UTC  awesome user  awesome comment  
c0f29585-db7a-4560-90d2-825fc99b819a  alertname="foo"  2022-05-02 09:54:13 UTC  awesome user  awesome comment  
0efb0aca-4471-41d7-8daf-a9be279bac23  alertname="foo"  2022-05-02 09:55:13 UTC  awesome user  awesome comment  
a6d2db55-c5c8-4e6e-9eed-e1ec87da51e5  alertname="foo"  2022-05-02 09:56:13 UTC  awesome user  awesome comment  
90bc593e-0518-479a-8c02-4072bc0c4421  alertname="foo"  2022-05-02 09:57:13 UTC  awesome user  awesome comment  
49c4bc3b-57a3-4d6e-8bd8-8166addb18fa  alertname="foo"  2022-05-02 09:58:13 UTC  awesome user  awesome comment  
971734d2-c8ad-40bd-a27c-95c592d65efc  alertname="foo"  2022-05-02 09:59:13 UTC  awesome user  awesome comment  
c773ba25-af00-4045-9a70-b63e75ecd5a8  alertname="foo"  2022-05-02 10:00:13 UTC  awesome user  awesome comment  
fb8f23e9-32ad-4ab4-b449-cf86e34c5917  alertname="foo"  2022-05-02 10:01:13 UTC  awesome user  awesome comment  
4e95d124-d052-403d-b5e6-86d0acc22c17  alertname="foo"  2022-05-02 10:02:13 UTC  awesome user  awesome comment  

Could the "memory leak" be something else than a memory leak ?
Could it be the sum of expired silences not being flushed ?

@ngc104
Copy link
Author

ngc104 commented May 13, 2022

I was bad on this one...
I was exploring data.retention in the Production environment.
We have data.retention unset (e.g. using the default value 120h / 5 days).

With amtool, I notice that I have a lot of silences. But no silence after 5 days.
There is no alerts except the 2 or 3 active and suppressed ones (3 alerts won't create a memory leak, right ?)

When I restart Alertmanager, I expect to have a memory consumption that goes continuously growing for 5 days. Then it should become flat. But I observe that it continues to grow after 5 days. So my guess was bad : this is not the sum of expired silences being flushed.

I'm thinking again about a memory leak. But no idea on how to debug it.

@ngc104
Copy link
Author

ngc104 commented Jan 6, 2023

Hello and happy new year !

Alertmanager-0.25.0 is released. Let's give a try...
Same procedure as above (in a previous comment)

12h37 profile-20230106_155103-allocs.pdf
14h06 profile-20230106_140612-allocs.pdf
16h55 profile-20230106_123718-allocs.pdf

12h37 14Mb (nodes accounting for 14467.52kB, 100% of 14467.52kB total)
14h06 43Mb (nodes accounting for 27635.94kB, 63.32% of 43644.80kB total)
16h55 66Mb (nodes accounting for 43.69MB, 65.99% of 66.21MB total)

Nothing seems to have changed in 0.25.0 related to this issue.

@ngc104
Copy link
Author

ngc104 commented Jan 13, 2023

Hello,

I had not noticed that I was running a standalone alertmanager in cluster mode. In order to remove any noise in the investigation, I retried with this command line :

./alertmanager/alertmanager --cluster.listen-address=""

Then I took some traces as usual :

10h56 : profile-20230113_105632-allocs.pdf
11h34 : profile-20230113_113413-allocs.pdf
14h01 : profile-20230113_140152-allocs.pdf
14h44 : profile-20230113_144421-allocs.pdf
18h58 : profile-20230113_185807-allocs.pdf

10h56 : 15Mb (nodes accounting for 15216.73kB, 100% of 15216.73kB total)
11h34 : 23Mb (nodes accounting for 17373.59kB, 73.87% of 23519.01kB total)
14h01 : 47Mb (nodes accounting for 34402.62kB, 71.93% of 47825.89kB total)
14h44 : 56Mb (nodes accounting for 40.78MB, 72.40% of 56.33MB total)
18h58 : 99Mb (Showing nodes accounting for 72MB, 72.58% of 99.20MB total)

I hope these traces are easier to use without the noise of the data structures needed for the absent cluster.

@simonpasquier
Copy link
Member

👋 it would help if you could share the raw pprof flies.

@ngc104
Copy link
Author

ngc104 commented Jun 13, 2023

Hello,

Sorry for the delay... Too much work besides this incident. And I'm not sure that the "raw" pprof files are obtained just with -raw option instead of -pdf. Please give me more info if the following does not fit.

And I forgot to start Alertmanager with --cluster.listen-address="". So this one was a mono-cluster...

Here are the logs : profiles.zip

Here is my script to generate them :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
mkdir -p profiles
FILEPREFIX="profiles/profile-${DATE}"

go tool pprof -sample_index=inuse_space -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_space.raw"
go tool pprof -sample_index=inuse_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_objects.raw"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_space.raw"
go tool pprof -sample_index=alloc_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_objects.raw"

Edit : I also upload binary files (I have no idea how they work, but I guess you like them too) :
pprof.zip

cc @simonpasquier

@simonpasquier
Copy link
Member

I missed the fact that you're using version 0.21.0. Can you try with v0.25.0?

@ngc104
Copy link
Author

ngc104 commented Aug 4, 2023

All tests were done with latest version.

In 2021 when I created the issue it was tested with 0.21.0.
Latest test in 2023 (june 13th) was with 0.25.0.

Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet.

@ngc104
Copy link
Author

ngc104 commented Aug 29, 2023

Hello,

Alertmanager 0.26 is released 🥳

The bug is still there 💣

I modified my comment above with the procedure to reproduce the bug (I changed the way to run go tool pprof).

And here are new profiles.zip files.

Below is an extract of the report.txt file that I added in the zip file :

alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
  build user:       root@df8d7debeef4
  build date:       20230824-11:11:58
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo
...
=======================================
20230829_114408
Showing nodes accounting for 155789, 99.48% of 156600 total
Showing nodes accounting for 34175.86kB, 100% of 34175.86kB total
=======================================
20230829_124408
Showing nodes accounting for 293671, 98.51% of 298100 total
Showing nodes accounting for 52.35MB, 100% of 52.35MB total
=======================================
20230829_134408
Showing nodes accounting for 376318, 96.80% of 388746 total
Showing nodes accounting for 63.89MB, 100% of 63.89MB total
=======================================
20230829_144408
Showing nodes accounting for 432682, 95.95% of 450947 total
Showing nodes accounting for 81.63MB, 100% of 81.63MB total
=======================================
20230829_154408
Showing nodes accounting for 492675, 94.88% of 519250 total
Showing nodes accounting for 94.59MB, 100% of 94.59MB total
=======================================
20230829_164408
Showing nodes accounting for 666111, 94.82% of 702485 total
Showing nodes accounting for 98.54MB, 86.75% of 113.58MB total
=======================================
20230829_174409
Showing nodes accounting for 734087, 95.01% of 772671 total
Showing nodes accounting for 119.25MB, 89.80% of 132.79MB total
=======================================
20230829_184409
Showing nodes accounting for 819591, 94.48% of 867496 total
Showing nodes accounting for 136.83MB, 91.30% of 149.87MB total

As you can see something is growing... (dates are in format YYYYMMDD_HHMMSS)

cc @simonpasquier

@grobinson-grafana
Copy link
Contributor

Hi! 👋 I attempted to reproduce this on Alertmanager 0.26.0 using the same script but so far I have not been able to. I ran the script for 1h:

#!/bin/sh

while true; do
  ./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=bar
  ./amtool --alertmanager.url=http://127.0.0.1:9093 silence add foo=bar -c="" -d 1m
  sleep 5
done

Here is go_memstats_heap_objects and go_memstats_heap_inuse_bytes:

Screenshot 2023-11-19 at 1 05 30 pm Screenshot 2023-11-19 at 1 06 03 pm

@grobinson-grafana
Copy link
Contributor

To see if it was related to data retention not deleting expired silences I ran Alertmanager with --data.retention=5m.

@ngc104
Copy link
Author

ngc104 commented Feb 1, 2024

How to reproduce

I was fed up with rebuilding my env test each time I wanted to test if the bug is still there. I also thought it would be good to make it easier for everyone to build it. So I wrote some scripts : https://github.com/ngc104/alertmanager-issue-2659

TL;DR :

./buid_env.sh
./run.sh

But the README.md file is quite short. You can read it.

How to measure the leak

I can see the bug in Kubernetes environment with container_memory_working_set_bytes metric. But what is that metric ?

In very short, this metric is close to the RSS metric for a process and comes from Linux cgroups.

This is the metric used by Kubernetes to detect OOMs and kill the containers. For that specific reason, it is important to use exactly that metric and not other similar ones.

After a lot of readings, I decided to

My script is 25 lines long and should work on most Linux (cgroup v1 and v2)... Not so hard to understand and fix if it does not work for you.

How to show the leeak

Thanks to your graphs @grobinson-grafana I thought I could provide grafs too, and let anyone have their own graphs easily.

So my env test comes from :

  • alertmanager, in a docker container (e.g. in a dedicated cgroup)
  • telegraf that runs the script above and retrieve metrics (yes, because this is the first one that came into my mind to do the job)
  • prometheus that scraps alertmanager metrics and telegraf metrics

Now it's easy to run this stack and provide graphs.

@ngc104
Copy link
Author

ngc104 commented Feb 1, 2024

Some graphs with the stack above

The stack was launched for hours with different data retention.

Alertmanager version : 0.26.0

Data retention 2h

Start : 10h35
End : 21h10
Duration : 10h35

Capture d’écran de 2024-01-29 22-36-00
Capture d’écran de 2024-01-29 22-35-43
Capture d’écran de 2024-01-29 22-35-23

EDIT 02/02/2024 : we had more metrics than we thought. Here are new and larger graphs (retention 2h)

Start : 10h35
End : 10h27
Duration : 23h08

image
image
image

Data retention 3h

Start : 10h50
End : 10h20
Duration : 23h30

Capture d’écran de 2024-01-31 10-17-15
Capture d’écran de 2024-01-31 10-19-16
Capture d’écran de 2024-01-31 10-19-52

Data retention 5m

Start : 11h40
End : 21h00
Duration : 9h20

Note : this is the retention that @grobinson-grafana used for the test 2 months and a half ago.

Capture d’écran de 2024-01-31 21-01-14
Capture d’écran de 2024-01-31 21-01-00
Capture d’écran de 2024-01-31 21-01-43

@ngc104
Copy link
Author

ngc104 commented Feb 1, 2024

Analysing the graphs above

Note : the metrics have been aggregated with avg_over_time(...[15m]) to remove spikes and see the trends better.

Nb of expired silences

The number of expired silences is mostly the same (except at the beginning of course).

The exception is for dataretention=5m : why does it increase so much 2 hours after the beginning and then stbilize 1h30 after the increase ? Well, that is not the object of this issue. Let's just notice it then forget it for while.

memstat_heap_objects

This metric comes from alertmanager itself.

The number of heap objects looks stable (after the growing phase at beginning).

Notice that with dataretention=5m the growing phase looks strange with not growing so much.

This groing phase is 2h with dataretention=2h and is also 2h with dataretention=3h. With dataretention=5m, the strange start will also last something like 3h.

I see no link between this metric and the memory leak.

Workingset size

This is the main metric. This is why the issue was opened.

With dataretention=5m, the issue does not appear. It did not appear on @grobinson-grafana 's test. It does not appear here either. I guess that using dataretention=5m will hide the bug.

With dataretention=2h we can see the Workingset size growing slowly until the "cliff". After the cliff, we have not enough data to say if is a flat infinite step or if it will still grow. Maybe we should run this test again.

EDIT 02/02/2024 : with newer graphs on a larger time window, we see that the Workingset size continues growing. Why are there steps ? No idea.

With dataretention=3h we can clearly see the Workingset size growing slowly but continuously.

The bug is there. But we need retentionsize >= 3h and a lot of time to see it

But remember that we run Alertmanager on Kubernetes for months and at the end, the container is killed because of an OOM.

On this test with dataretention=3h, if memory limit was set to 2.25M the container would have been killed after 20h running. Not before.

@ngc104
Copy link
Author

ngc104 commented Feb 1, 2024

cc @simonpasquier (I don't know if you are still working on it or if you can do anything).

If the 3 comments were too long to read, please check my little lab to reproduce the bug : https://github.com/ngc104/alertmanager-issue-2659/tree/main and pay attention to the metric workingset size : this is the metric that causes Kubernetes (in reality, Linux) to OOM-kill Alertmanager.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

Alertmanager-0.27

The memory leak is still there

I'm running https://github.com/ngc104/alertmanager-issue-2659/tree/main over 1 day and a half.

image

image

image

I set the parameter --data.retention=3h. All other parameters are default ones (except --cluster.listen-address="" but that does not matter).

  • 2024-04-18 08h27 : beginning of the test. Set and renew an alert every 1 minute and add a silence (duration 1 minute) every 1 minute.
  • 2024-04-19 10h13 : stopped adding silences
  • 2024-04-19 15h16 : stopping the test (end of working day in my timezone)

During all the test, including after 9h13, the same alert is renewed every one minute with amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar.

The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.

The strange things:

  1. after 3 hours I should reach some maximum value. But I don't. This is the issue. Still there.
  2. when I stop harassing Alertmanager, the memory goes down (which is normal : old silences are dropped and there are no newer silences to keep in memory). But : after 3 hours, the memory still goes down. I expected that after 3 hours I reach a newer (and lower) maximum limit. But it still goes down !

Note : the alert is set every one minute with amtool alert add, even after I stopped

The history of this issue is long but the issue has not changed since the beginning.
It's still there in 2024 with Alertmanager-0.27.0.
And the use-case with Kthxbye is still there, with OOM after one or two months running with this small memory leak.

@grobinson-grafana
Copy link
Contributor

I'm sorry but I'm still don't see where the memory leak is in your screenshots. This looks perfect to me? In the screenshot I can see that garbage is being created and then reclaimed by the garbage collector, and the graph itself is a flat line somewhere between 23MB and 24MB.

image

The memory goes on growing and growing (slowly). When I stop harassing Alertmanager with silences, the memory goes down significantly.

I don't see that in the screenshot above? This also wouldn't happen if there was a memory leak because if there was a memory leak then the memory wouldn't go down?

Can you help me understand?

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

Let's zoom on the significant period (when alertmanager is being harassed, without the few beginning minutes)

image

image

I need to use avg_over_time to see the tendency. The 2nd graph clearly shows that it grows.

I cannot run my lab for months. But I have Alertmanagers in Production at work that are running for months. They sometimes explode with OOM.

@grobinson-grafana
Copy link
Contributor

I think the problem is the second graph. It's taking the average over 4 hours, which is skewed from the peaks before a garbage collection happens. The first graph, which has a much smaller window for avg_over_time, is what I expect to see. I would recommend showing a screenshot of the metric without avg_over_time and then we can see for sure if there is a leak or not.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

image

@grobinson-grafana
Copy link
Contributor

That looks good. There is no memory leak in this graph. If you want you can show it for the last 48 hours or 7 days and it should look the same.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

image

image

@grobinson-grafana
Copy link
Contributor

These also look good. There is no memory leak in these graphs either!

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

So hard to reproduce and so hard to interpret this little window of time when the issue needs about one month in our production environment to generate an OOM...

I guest that next time I give a try, I will generate 100 alerts and 100 silences instead of one. We will see if that changes anything.

Thanks for pointing that graphs are only the reflection of what we want to see. And in fact, when I add xxx_over_time I see what I want to show and without, I consider that the spike hide the global trend.

It definitely needs a new try with more alerts and silences. Next time I have time... Stay tuned :)

@grobinson-grafana
Copy link
Contributor

I think it would be good to see the graphs for go_memstats_alloc_bytes and go_memstats_heap_objects. If there is a memory leak then both go_memstats_alloc_bytes and go_memstats_heap_objects should keep increasing over time.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

image

This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

image
image

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

Spikes of go_memstats_heap_objects are higher and higher...

@grobinson-grafana
Copy link
Contributor

image

This is our Production environment, with Alertmanager-0.25.0 and being harassed by kthxbye. Both are in cluster.

In this graph, I presume the two lines are separate Alertmanagers. It's interesting that the yellow line is flat while the green line keeps growing. Is the green line the Alertmanager being tested?

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

Both are running in the same namespace : they are in cluster.
And yes, I was surprised that 2 alertmanagers running exactly the same way and being in cluster do not have the same memory behaviour.
I don't know how Alertmanager internals work but I guess that silences may be on one alertmanager and not the other. Both communicate to know if an alert is silenced. If I'm true, it would explain why one alertmanager is running "flat" and the other one with many silences is running "growing".

@grobinson-grafana
Copy link
Contributor

I don't know how Alertmanager internals work but I guess that silences may be on one alertmanager and not the other. Both communicate to know if an alert is silenced.

Silences are gossiped between the two Alertmanagers when clustered, so both will have the same silences.

@ngc104
Copy link
Author

ngc104 commented Apr 19, 2024

OK. So we have another mystery... (I won't open an issue for that : I don't even know how to reproduce it and we have had no problem with that... ) Let's focus on the green line...

@ngc104
Copy link
Author

ngc104 commented Jan 17, 2025

Version 0.28.0

16/01 ~10h : start the test
17/01 ~14h : stop harassing Alertmanager with silences
17/01 ~17h30 : end the test

Here are the same graphs as before.

Image

Image

Image

Image

Image

Image

Image

@ngc104
Copy link
Author

ngc104 commented Jan 17, 2025

I can still see some increase on the first graphs.
However, it has already been discussed : aggregations can change the meaning of the graph.

Today, my problem is that the environment where our bug appears changed since I opened this bug and we have less OOM alerts. It becomes hard to reproduce in our production environment.

As a result, I will not be able to say if the bug was already fixed in 0.27, or if 0.28 and PR #3930 (thanks !) solved it, or if the bug still exists.

I am sorry to say that, but I will no more follow this issue. Let's leave it open for a while, get stale, then auto-close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants