Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keepalived can't call notify_master script timely #2407

Closed
yyang13 opened this issue Apr 28, 2024 · 4 comments
Closed

keepalived can't call notify_master script timely #2407

yyang13 opened this issue Apr 28, 2024 · 4 comments

Comments

@yyang13
Copy link

yyang13 commented Apr 28, 2024

Describe the bug
we configured 'notify_master "/usr/bin/refresh-and-enable-arp MASTER"', but "/usr/bin/refresh-and-enable-arp" can't be called timely when VRRP instance transitions to MASTER. Here are two long delay cases:

case 1

Apr 15 08:54:36 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE

but log of "/usr/bin/refresh-and-enable-arp" indicates it is called after about 15 seconds, what happened here?

2024/04/15 08:54:51.460339 ...

case 2

Apr 15 11:38:54 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE

but log of "/usr/bin/refresh-and-enable-arp" indicates it is called after about 3 seconds, what happened here?

2024/04/15 11:38:57.276773 ...

To Reproduce
Not easy to reproduce it. we don't know how to reproduce it. but this did happen several time on our online system. By the way, in most of cases, it is timely, so we're confused, what happened there in those bad cases?

Expected behavior
It should be called very timely.

Keepalived version
Keepalived v2.1.5 (07/13,2020)

Copyright(C) 2001-2020 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 5.10.70
Running on Linux 5.10.0-28-amd64 #1 SMP Debian 5.10.209-2 (2024-01-31)

configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-kernel-dir=debian/ --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-json --enable-bfd --enable-regex build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -ffile-prefix-map=/build/keepalived-jCUlld/keepalived-2.1.5=. -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS=-Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2

Config options: NFTABLES LVS REGEX VRRP VRRP_AUTH JSON BFD OLD_CHKSUM_COMPAT FIB_ROUTING SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV4_DEVCONF IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_OIFNAME FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS IP_MULTICAST_ALL LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_VMAC VRRP_IPVLAN IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE VRF SO_MARK SCHED_RESET_ON_FORK

Distro (please complete the following information):

  • Name: Debian
  • Version: 11
  • Architecture: x86_64

Details of any containerisation or hosted service (e.g. AWS)
Run on VM

Configuration file:
! Configuration File for keepalived
global_defs {
vrrp_garp_master_refresh 60
vrrp_garp_master_refresh_repeat 3
}

vrrp_instance VI_1 {
state BACKUP
nopreempt
interface enp6s19.91
virtual_router_id 101
priority 99
advert_int 1
authentication {
auth_type PASS
auth_pass ********
}
virtual_ipaddress {
10.4.91.3/24 dev enp6s19.91
}
track_interface {
dummy0
enp6s19
}
notify_master "/usr/bin/refresh-and-enable-arp MASTER"
notify_master_rx_lower_pri "/usr/bin/refresh-and-enable-arp MASTER_RX_LOWER_PRI"
notify_backup "/usr/bin/pkill-and-disable-arp BACKUP"
notify_fault "/usr/bin/pkill-and-disable-arp FAULT"
}

Notify and track scripts
$ cat refresh-and-enable-arp
#!/bin/bash

/usr/bin/enable-arp
/usr/bin/refresh-arp >> /tmp/refresh-arp.log 2>&1
/usr/bin/keepalived-state-change $1

System Log entries
Apr 05 00:23:22 node2 Keepalived_vrrp[7989]: (VI_1) Entering BACKUP STATE
Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE
Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Master received advert from 10.4.91.1 with higher priority 100, ours 99
Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Entering BACKUP STATE
Apr 15 08:54:36 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE
Apr 15 08:54:37 node2 Keepalived_vrrp[7989]: (VI_1) Master received advert from 10.4.91.1 with higher priority 100, ours 99
Apr 15 08:54:37 node2 Keepalived_vrrp[7989]: (VI_1) Entering BACKUP STATE
Apr 15 11:38:54 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE
Apr 15 11:38:54 node2 Keepalived_vrrp[7989]: (VI_1) Master received advert from 10.4.91.1 with higher priority 100, ours 99
Apr 15 11:38:54 node2 Keepalived_vrrp[7989]: (VI_1) Entering BACKUP STATE

Did keepalived coredump?

If so, can you please provide a stacktrace from the coredump, using gdb.

Additional context
vrrp advert packets lost, so BACKUP node becomes MASTER, then it becomes BACKUP because of receiving higher priority advert in one second.

@pqarmitage
Copy link
Collaborator

There is no delay in the keepalived code between logging that the VRRP instance is entering master state and the notify_master script being run (see vrrp_state_master_tx() in keepalived/vrrp/vrrp.c). It is of course possible that the kernel suspends keepalived to schedule other processes to run, but that is outside the control of keepalived. If you suspect this is happening you could set vrrp_rt_priority nn in global_defs where nn is a suitable number between 1 and 99, which will run the keepalived vrrp process at a higher priority.

It isn't clear where the time is logged in /usr/bin/refresh-and-enable-arp, but my guess is that it is /usr/bin/refresh-arp that does that. It is therefore quite possible that /usr/bin/enable-arp is causing the delay.

The bigger problem you have is that VRRP adverts appear to be being "lost". This could be a network problem, or it could be a scheduling problem; I suggest that this is the more important problem to resolve.

@yyang13
Copy link
Author

yyang13 commented Apr 29, 2024

There is no delay in the keepalived code between logging that the VRRP instance is entering master state and the notify_master script being run (see vrrp_state_master_tx() in keepalived/vrrp/vrrp.c). It is of course possible that the kernel suspends keepalived to schedule other processes to run, but that is outside the control of keepalived. If you suspect this is happening you could set vrrp_rt_priority nn in global_defs where nn is a suitable number between 1 and 99, which will run the keepalived vrrp process at a higher priority.

It isn't clear where the time is logged in /usr/bin/refresh-and-enable-arp, but my guess is that it is /usr/bin/refresh-arp that does that. It is therefore quite possible that /usr/bin/enable-arp is causing the delay.

The bigger problem you have is that VRRP adverts appear to be being "lost". This could be a network problem, or it could be a scheduling problem; I suggest that this is the more important problem to resolve.

refresh-arp logged timestamp at the beginning, so I guess enable-arp was scheduled very lately because enable-arp is very simple, it is impossible to have a so big delay.

Is it possible VM system clock is changed at those points because of some unkown reasons?

@yyang13
Copy link
Author

yyang13 commented Apr 29, 2024

There is no delay in the keepalived code between logging that the VRRP instance is entering master state and the notify_master script being run (see vrrp_state_master_tx() in keepalived/vrrp/vrrp.c). It is of course possible that the kernel suspends keepalived to schedule other processes to run, but that is outside the control of keepalived. If you suspect this is happening you could set vrrp_rt_priority nn in global_defs where nn is a suitable number between 1 and 99, which will run the keepalived vrrp process at a higher priority.

Does vrrp_rt_priority have any side effect? Is rt scheduler policy is FIFO or RR? Will notify_master script also inherit vrrp_rt_priority?

It isn't clear where the time is logged in /usr/bin/refresh-and-enable-arp, but my guess is that it is /usr/bin/refresh-arp that does that. It is therefore quite possible that /usr/bin/enable-arp is causing the delay.

The bigger problem you have is that VRRP adverts appear to be being "lost". This could be a network problem, or it could be a scheduling problem; I suggest that this is the more important problem to resolve.

@pqarmitage
Copy link
Collaborator

Is it possible VM system clock is changed at those points because of some unkown reasons?

This seems unlikely, but I don't know what is happening on your systems.

Does vrrp_rt_priority have any side effect? Is rt scheduler policy is FIFO or RR? Will notify_master script also inherit vrrp_rt_priority?

It uses SCHED_RR. It also sets flag SCHED_RESET_ON_FORK, so notify scripts will not inherit the realtime scheduling.

I really think you need to concentrate on discovering why you are getting the following messages:

Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Entering MASTER STATE
Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Master received advert from 10.4.91.1 with higher priority 100, ours 99
Apr 12 00:49:43 node2 Keepalived_vrrp[7989]: (VI_1) Entering BACKUP STATE

Fixing these happening could well resolve the issue of the scripts being delayed in their execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants