Skip to content
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.

systemd: Failed to start Alertmanager (when Prometheus is not running yet) #74

Closed
dmke opened this issue Jul 10, 2019 · 9 comments
Closed

Comments

@dmke
Copy link

dmke commented Jul 10, 2019

The Alertmanager service fails to start, when Prometheus has not started yet. We observer this mainly after a machine reboot:

# journalctl -u alertmanager.service --boot
-- Logs begin at Thu 2019-07-04 01:20:09 UTC, end at Wed 2019-07-10 08:03:29 UTC. --
Jul 10 00:00:19 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=info ts=2019-07-10T00:00:19.771Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=info ts=2019-07-10T00:00:19.773Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=warn ts=2019-07-10T00:00:19.799Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=error ts=2019-07-10T00:00:19.815Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:19 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:19 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 1.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=info ts=2019-07-10T00:00:20.213Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=info ts=2019-07-10T00:00:20.213Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=warn ts=2019-07-10T00:00:20.221Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=error ts=2019-07-10T00:00:20.227Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 2.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=info ts=2019-07-10T00:00:20.468Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=info ts=2019-07-10T00:00:20.468Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=warn ts=2019-07-10T00:00:20.472Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=error ts=2019-07-10T00:00:20.476Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 3.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=info ts=2019-07-10T00:00:20.874Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=info ts=2019-07-10T00:00:20.877Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=warn ts=2019-07-10T00:00:20.882Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=error ts=2019-07-10T00:00:20.885Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 4.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=info ts=2019-07-10T00:00:21.109Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=info ts=2019-07-10T00:00:21.110Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=warn ts=2019-07-10T00:00:21.115Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=error ts=2019-07-10T00:00:21.118Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 5.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Start request repeated too quickly.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Failed to start Prometheus Alertmanager.

We're running Prometheus and Alertmanager on the same host (deployed using your Ansible roles 👍), so waiting for Prometheus seems a good measure:

diff templates/alertmanager.service.j2
 [Unit]
-After=network.target
+After=network.target prometheus.service

I realise this might need a new variable and a conditional for general usage (i.e. when both services run on different hosts). Alternatively (or additionaly), it might be also useful to add a delay between the retries to give Prometheus a fair chance to start (as you can see in the log above, the restart attempts all happened within 2s):

diff templates/alertmanager.service.j2
 Restart=always
+RestartSec=5s
@paulfantom
Copy link
Member

Sorry for my lack of reply.

We try not to have any dependencies between roles and we cannot ensure there is prometheus.service on the same host as alertmanager. And even when there is prometheus.service there is no guarantee that alertmanager is configured to receive alerts from this prometheus instance. I would recommend to do such check on a playbook level.

As for the second part (RestartSec=5s), I am currently securing all service files and will include it shortly.

@dmke
Copy link
Author

dmke commented Aug 20, 2019

Fair point. We're actually already using a drop-ins for that purpose:

# systemctl status alertmanager.service
● alertmanager.service - Prometheus Alertmanager
   Loaded: loaded (/etc/systemd/system/alertmanager.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/alertmanager.service.d
           └─after-prometheus.conf
           └─slowdown-restarts.conf
[...]

with /etc/systemd/system/alertmanager.service.d/after-prometheus.conf containing just

[Unit]
After=prometheus.service

Maybe this could be added to the README.md, e.g. in the Example section?

As for the second part (RestartSec=5s), I am currently securing all service files and will include it shortly.

Great!

@paulfantom
Copy link
Member

I also looked at your error msgs and it seems quite strange that your alertmanager requires prometheus server to start. This is not a usual case for alertmanager as it should be able to operate without any prometheus server (communication is unidirectional and alertmanager is on receiving end).

From what I see you are having problems because of networking problem as can be seen in logs:

level=warn ts=2019-07-10T00:00:21.115Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"

Are you using alertmanager in HA mode with gossip network? Because it looks like alertmanager cannot start because gossip network address specified with --cluster.advertise-address is not available. Could you provide ansible variables which were used to deploy alertmanager?


Maybe this could be added to the README.md, e.g. in the Example section?

No, there is no requirement of having prometheus running to use alertmanager. Of course it doesn't make much sense, but it is not a hard requirement.

@paulfantom
Copy link
Member

paulfantom commented Aug 20, 2019

This error:

level=warn ts=2019-07-10T00:00:21.115Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"

is something you get when one of alertmanager requirements is not met. As alertmanager docs says:

The cluster.advertise-address flag is required if the instance doesn't have an IP address that is part of RFC 6980 with a default route.

Source: https://github.com/prometheus/alertmanager#high-availability

@dmke
Copy link
Author

dmke commented Aug 20, 2019

I'm not aware we're using alertmanager in HA mode. From what I can gather from our Gitlab instance (I don't have the repo checked out right now), these are the only alertmanager_* variables set:

alertmanager_version:        0.18.0
alertmanager_receivers:      [] # multiple configures
alertmanager_route:          {} # configured
alertmanager_child_routes:   [] # routes omitted
alertmanager_listen_address: "127.0.0.1:9093"
alertmanager_external_url:   "https://alertmanager.example.com"

Maybe of note: we've submoduled 50d90b5.

I'll have a more detailed look tomorrow when I'm back in the office.

@paulfantom
Copy link
Member

paulfantom commented Aug 20, 2019

I am not sure if this would cause problems, but you have misconfiguration in two places:
- alertmanager_listen_address should be alertmanager_web_listen_address
- alertmanager_external_url should be alertmanager_web_external_url

Those options were changed in 0.11.0 to make it similar to alertmanager config.

I just saw that we have backwards-compatibility layer in place so those variables are translated to newer versions. Nevertheless you should consider updating them.

On a side note, wow, you are using it for quite a long time (0.11.0 was released more than a year ago)!

I just discovered we didn't update that part of our docs for over a year 🤦‍♂️

@dmke
Copy link
Author

dmke commented Aug 21, 2019

I've checked and we don't run alertmanager in HA mode. However, the alertmanager docs state (emphasis mine):

--cluster.listen-address string: cluster listen address (default "0.0.0.0:9094"; empty string disables HA mode)

combined with alertmanager_cluster configured to be {} (i.e. the default value), this results in the following ExecStart directive:

ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=127.0.0.1:9093 \
  --web.external-url=https://alertmanager.example.com

Note the absence of a --cluster.listen-address="", which would disable HA mode.

This is just a bad default on alertmanager's part. I'll check whether

alertmanager_cluster:
  listen-address: ""

shuts that off.


I just discovered we didn't update that part of our docs for over a year 🤦‍♂️

😀

@dmke
Copy link
Author

dmke commented Aug 21, 2019

Yeah, alertmanager_cluster["listen-address"] = "" does the trick. I'll close this for now.

@lock
Copy link

lock bot commented Sep 20, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants