Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalived Master/Backup swapping frequently - V2.2.0 #1995

Closed
senthilnathann opened this issue Sep 13, 2021 · 23 comments
Closed

Keepalived Master/Backup swapping frequently - V2.2.0 #1995

senthilnathann opened this issue Sep 13, 2021 · 23 comments

Comments

@senthilnathann
Copy link

Describe the issue
We have upgrading keepalived version to V2.2.0 ,After upgrading ,we are getting segmentation fault error and Master/Backup servers are swapping frequently and facing split brain issue too.

To Reproduce
If we using sync group based keepalived conf,we are getting above issues.

Expected behavior

Keepalived version
Output of keepalived -v
Keepalived v2.2.0 (01/09,2021)

Copyright(C) 2001-2021 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 3.10.0
Running on Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
Distro: CentOS Linux 7 (Core)

configure options: --prefix=/root/sen

Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_PREF FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL NET_LINUX_IF_H_COLLISION NET_LINUX_IF_ETHER_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK VRRP_VMAC IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE SO_MARK SCHED_RESET_ON_FORK

Distro (please complete the following information):

  • Name : CentOS Linux release 7.6.1810 (Core)
  • Version [e.g. 29]
  • Architecture [e.g. x86_64] x86_64

Details of any containerisation or hosted service (e.g. AWS)
If keepalived is being run in a container or on a hosted service, provide full details

Configuration file:
A full copy of the configuration file, obfuscated if necessary to protect passwords and IP addresses

global_defs {
notification_email {
xxxx@gmail.com
}
notification_email_from xxx@gmail.com
smtp_server xxx
smtp_connect_timeout 30
router_id LVSID01
}
vrrp_sync_group VG1 {
group {
VI_1
VI_GATEWAY
}
}
vrrp_instance VI_GATEWAY {
interface bond0
virtual_router_id 77
priority 100
advert_int 1
higher_prio_send_advert true
smtp_alert
authentication {
auth_type PASS
auth_pass xxxxx
}
virtual_ipaddress {
172.20.1.1
}
}
vrrp_instance VI_1 {
interface bond0
virtual_router_id 76
priority 100
advert_int 1
higher_prio_send_advert true
smtp_alert
authentication {
auth_type PASS
auth_pass xxxxx
}
virtual_ipaddress {
172.20.131.127
172.20.131.130
}

Notify and track scripts
If any notify or track scripts are in use, please provide copies of them

System Log entries

Full keepalived system log entries from when keepalived started
Sep 13 16:23:25 172 Keepalived[31592]: Starting Keepalived v2.2.0 (01/09,2021)
Sep 13 16:23:25 172 Keepalived[31592]: Running on Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 (built for Linux 3.10.0)
Sep 13 16:23:25 172 Keepalived[31592]: Command line: '/sbin/keepalived' '-D'
Sep 13 16:23:25 172 Keepalived[31592]: Opening file '/etc/keepalived/keepalived.conf'.
Sep 13 16:23:25 172 Keepalived[31592]: Configuration file /etc/keepalived/keepalived.conf
Sep 13 16:23:25 172 Keepalived[31593]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Sep 13 16:23:25 172 Keepalived[31593]: Starting Healthcheck child process, pid=31594
Sep 13 16:23:25 172 Keepalived[31593]: Starting VRRP child process, pid=31595
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: Registering Kernel netlink reflector
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: Registering Kernel netlink command channel
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 590) Extra '}' found
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 590) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 591) Unknown keyword 'real_server'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 591) Unexpected '{' - ignoring
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 592) Unknown keyword 'weight'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 593) Unknown keyword 'uthreshold'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 594) Unknown keyword 'lthreshold'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 596) Unknown keyword 'TCP_CHECK'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 596) Unexpected '{' - ignoring
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 597) Unknown keyword 'connect_timeout'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 598) Unknown keyword 'connect_port'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 599) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 600) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 601) Extra '}' found
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 601) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1781) Unterminated quote 'lthrn/https.sh 172.20.47.143 443"'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1781) Unmatched quote: 'lthrn/https.sh 172.20.47.143 443"'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1785) Unknown keyword 'real_server'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1785) Unexpected '{' - ignoring
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1786) Unknown keyword 'weight'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1787) Unknown keyword 'uthreshold'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1788) Unknown keyword 'lthreshold'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1790) Unknown keyword 'MISC_CHECK'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1790) Unexpected '{' - ignoring
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1791) Unknown keyword 'misc_path'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1792) Unknown keyword 'misc_timeout'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1793) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1794) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1795) Extra '}' found
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (/etc/keepalived/keepalived.conf: Line 1795) Unknown keyword '}'
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: Assigned address 172.20.133.246 for interface bond0
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: Assigned address fe80::ec4:7aff:feb0:112a for interface bond0
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: Registering gratuitous ARP shared channel
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (VI_GATEWAY) removing VIPs.
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (VI_1) removing VIPs.
Sep 13 16:23:25 172 Keepalived[31593]: pid 31594 exited due to segmentation fault (SIGSEGV).
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (VI_1) removing E-VIPs.
Sep 13 16:23:25 172 Keepalived[31593]: Please report a bug at https://github.com/acassen/keepalived/issues
Sep 13 16:23:25 172 Keepalived[31593]: and include this log from when keepalived started, a description
Sep 13 16:23:25 172 Keepalived[31593]: of what happened before the crash, your configuration file and the details below.
Sep 13 16:23:25 172 Keepalived[31593]: Also provide the output of keepalived -v, what Linux distro and version
Sep 13 16:23:25 172 Keepalived[31593]: you are running on, and whether keepalived is being run in a container or VM.
Sep 13 16:23:25 172 Keepalived[31593]: A failure to provide all this information may mean the crash cannot be investigated.
Sep 13 16:23:25 172 Keepalived[31593]: If you are able to provide a stack backtrace with gdb that would really help.
Sep 13 16:23:25 172 Keepalived[31593]: Source version 2.2.0
Sep 13 16:23:25 172 Keepalived[31593]: Built with kernel headers for Linux 3.10.0
Sep 13 16:23:25 172 Keepalived[31593]: Running on Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
Sep 13 16:23:25 172 Keepalived[31593]: Command line: '/sbin/keepalived' '-D'
Sep 13 16:23:25 172 Keepalived[31593]: configure options: --prefix=/root/sen
Sep 13 16:23:25 172 Keepalived[31593]: Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING
Sep 13 16:23:25 172 Keepalived[31593]: System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV6_ADVANCED_API LIBNL3 RTA_ENCAP
Sep 13 16:23:25 172 Keepalived[31593]: RTA_EXPIRES RTA_PREF FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK FRA_OIFNAME IFA_FLAGS
Sep 13 16:23:25 172 Keepalived[31593]: IP_MULTICAST_ALL NET_LINUX_IF_H_COLLISION NET_LINUX_IF_ETHER_H_COLLISION
Sep 13 16:23:25 172 Keepalived[31593]: LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK VRRP_VMAC IFLA_LINK_NETNSID CN_PROC
Sep 13 16:23:25 172 Keepalived[31593]: SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE SO_MARK
Sep 13 16:23:25 172 Keepalived[31593]: SCHED_RESET_ON_FORK
Sep 13 16:23:25 172 Keepalived[31593]: Healthcheck child process(31594) died: Respawning
Sep 13 16:23:25 172 Keepalived[31593]: Please log an issue at https://github.com/acassen/keepalived/issues/
Sep 13 16:23:25 172 Keepalived[31593]: and include a full copy of your keepalived configuration files, and
Sep 13 16:23:25 172 Keepalived[31593]: copies of the keepalived system log entries around the time this happened
Sep 13 16:23:25 172 Keepalived[31593]: Restart of Healthcheck process delayed 0 seconds to limit respawn rate
Sep 13 16:23:25 172 Keepalived[31593]: Starting Healthcheck child process, pid=31598
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (VI_GATEWAY) Entering BACKUP STATE (init)
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: (VI_1) Entering BACKUP STATE (init)
Sep 13 16:23:25 172 Keepalived_vrrp[31595]: VRRP sockpool: [ifindex( 4), family(IPv4), proto(112), fd(12,13)]
Sep 13 16:23:29 172 Keepalived_vrrp[31595]: (VI_GATEWAY) Receive advertisement timeout
Sep 13 16:23:29 172 Keepalived_vrrp[31595]: (VI_1) Receive advertisement timeout
Sep 13 16:23:29 172 Keepalived_vrrp[31595]: (VI_1) Entering MASTER STATE
Sep 13 16:23:29 172 Keepalived_vrrp[31595]: (VI_1) setting VIPs.
Sep 13 16:23:29 172 Keepalived_vrrp[31595]: (VI_1) setting E-VIPs.

Master/backup swap issue is due to below ,
Master log:
Sep 13 16:23:54 172 Keepalived_vrrp[4866]: (VI_1): send advert error 22 (Invalid argument)
Sep 13 16:23:54 172 Keepalived_vrrp[4866]: (VI_GATEWAY): send advert error 22 (Invalid argument).
Backup Log:
Sep 13 16:23:56 172 Keepalived_vrrp[31595]: (VI_GATEWAY) Receive advertisement timeout
Sep 13 16:23:56 172 Keepalived_vrrp[31595]: (VI_GATEWAY) Entering MASTER STATE

Did keepalived coredump?
If so, can you please provide a stacktrace from the coredump, using gdb.

Additional context

@pqarmitage
Copy link
Collaborator

@senthilnathann The log entries are showing that your configuration file is at least 1795 lines long, whereas the configuration you have provided is only 45 lines long.

Can you please provide a copy of your actual configuration file, unmodified other than passwords and email addresses, so that we can try and work out what is happening (please either add the configuration file as an attachment here, or email it to me directly). Also, if you are able to generate a stack backtrace with gdb from the coredump that would be very helpful, but your actual configuration is what we need first.

@senthilnathann
Copy link
Author

senthilnathann commented Sep 15, 2021 via email

@pqarmitage
Copy link
Collaborator

@senthilnathann Many thanks for your config file.

I have now identified that the issue is being caused by a bug in the v2.2.0 code when reading configuration files larger than 4096 bytes. Commit ca03dc2 resolved to issue. You can either apply commit ca03dc2 directly to the v2.2.0 source, or better update to v2.2.1 which is v2.2.0 with a small number of bug fixes and improvements.

@senthilnathann
Copy link
Author

@pqarmitage Thank you so much for your quick action . Thanks.....

@senthilnathann
Copy link
Author

senthilnathann commented Sep 24, 2021

@pqarmitage I have upgraded the keepalived version with 2.2.1 , again getting below error and keepalived swapped frequently.

Sep 25 00:41:24 172 Keepalived_vrrp[15447]: (VI_GATEWAY): send advert error 22 (Invalid argument)
Sep 25 00:41:24 172 Keepalived_vrrp[15447]: (VI_1): send advert error 22 (Invalid argument)
Sep 25 00:41:25 172 Keepalived_vrrp[15447]: (VI_GATEWAY): send advert error 22 (Invalid argument)
Sep 25 00:41:25 172 Keepalived_vrrp[15447]: (VI_1): send advert error 22 (Invalid argument)
Sep 25 00:41:26 172 Keepalived_vrrp[15447]: (VI_1): send advert error 22 (Invalid argument)

Note : Apart from above errors,we are not getting any other errors ( last time faced errors was not occurred).Same conf file only used.

Keepalived -v

Keepalived v2.2.1 (01/17,2021)

Copyright(C) 2001-2021 Alexandre Cassen, acassen@gmail.com

Built with kernel headers for Linux 3.10.0
Running on Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019
Distro: CentOS Linux 7 (Core)

configure options: --prefix=/root/sen

Config options: LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options: PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV6_ADVANCED_API LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_PREF FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL NET_LINUX_IF_H_COLLISION NET_LINUX_IF_ETHER_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK VRRP_VMAC IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE SO_MARK SCHED_RESET_ON_FORK

@pqarmitage pqarmitage reopened this Sep 25, 2021
@pqarmitage
Copy link
Collaborator

@senthilnathann When keepalived is running, could you please execute kill -USR1 $(cat /run/keepalived.pid), which will create /tmp/keepalived.data, /tmp/keepalived_parent.data, /tmp/keepalived_check.data (and /tmp/keepalived_bfd.data if you have any BFD configuration).

Could you please post those files on this issue (or email them to me) - you may want to remove any sensitive information before posting the files. Please also include the full system logs from the time keepalived starts to when the .data files were created. We will then be able to see what state keepalived is in.

@senthilnathann
Copy link
Author

@pqarmitage needed keepalived data files and log file separately sent to your mail-id.Please check.

@pqarmitage
Copy link
Collaborator

@senthilnathann Since keepalived.data shows that the only network interfaces on the system are lo, eth0, eth1 and bond0, I am assuming that eth0 and eth1 are both enslaved to bond0. I also note from keepalived.data that eth1 is showing as not running; e.g. its network cable is disconnected. I suspect that this is the cause of the problems.

The messages

Oct  4 15:33:55 172 Keepalived_vrrp[3201]: Remote SMTP server [192.168.5.125]:25 connected.
Oct  4 15:33:55 172 Keepalived_vrrp[3201]: Remote SMTP server [192.168.5.125]:25 connected.
Oct  4 15:34:25 172 Keepalived_vrrp[3201]: Timeout reading data to remote SMTP server [192.168.5.125]:25.
Oct  4 15:34:25 172 Keepalived_vrrp[3201]: Timeout reading data to remote SMTP server [192.168.5.125]:25.

also indicate that there is a network problem.

From 15:36:41 onwards keepalived was reporting send advert error 22 which I suspect is due to eth1 not being in running state. Seeing the full system logs around this time would help understand this. keepalived had only just started sending adverts, since it became master at 15:36:39, when the previous master either stopped running or went into fault state (causing it to send priority 0 adverts).

Can you please post/send the following information:

  1. The output of ip -d link show
  2. The full system logs (i.e. not just keepalived's logs) from 15:33:55 to 15:36:45
  3. Any log entries relating to eth1
  4. The full system logs from the other system from 15:33:55 to 15:36:45.

With the above information we should get a better idea about what is happening.

@senthilnathann
Copy link
Author

senthilnathann commented Oct 7, 2021

@pqarmitage eth1 interface was already down state ( before we upgrade a keepalived version form 2.0.7 to 2.2.1) and this error(send advert error 22 ) not occurred in 2.0.7 version,will explain the full details again,

Chronology of events :
First we have upgraded ,both keepalived version in 2.2.1 in both Master/Backup servers .First upgraded in 172.20.133.246 at 15:28:46 started as Master and then we upgraded 172.20.133.247 started as Backup at 15:31:23.
After few seconds , automatically 172.20.133.247 was swapped as Master.Pls find the logs below,

172.20.133.246 logs:
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_GATEWAY): send advert error 22 (Invalid argument)
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1): send advert error 22 (Invalid argument)
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1): send advert error 22 (Invalid argument)
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1) Master received advert from 172.20.133.247 with same priority 100 but higher IP address than ours
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1) Entering BACKUP STATE
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1) removing VIPs.
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_1) removing E-VIPs.
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: VRRP_Group(VG1) Syncing instances to BACKUP state
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_GATEWAY) Entering BACKUP STATE
Oct 4 15:31:30 172 Keepalived_vrrp[26343]: (VI_GATEWAY) removing VIPs.

172.20.133.247
Oct 4 15:31:23 172 Keepalived[26235]: Starting Keepalived v2.2.1 (01/17,2021)
Oct 4 15:31:23 172 Keepalived[26235]: Running on Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 (built for Linux 3.10.0)
Oct 4 15:31:23 172 Keepalived[26235]: Command line: '/sbin/keepalived' '-D'
Oct 4 15:31:23 172 Keepalived[26235]: Opening file '/etc/keepalived/keepalived.conf'.
Oct 4 15:31:23 172 Keepalived[26235]: Configuration file /etc/keepalived/keepalived.conf
Oct 4 15:31:23 172 Keepalived[26236]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Oct 4 15:31:23 172 Keepalived[26236]: Starting Healthcheck child process, pid=26237
Oct 4 15:31:23 172 Keepalived[26236]: Starting VRRP child process, pid=26238
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: Registering Kernel netlink reflector
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: Registering Kernel netlink command channel
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: Assigned address 172.20.133.247 for interface bond0
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: Assigned address fe80::ec4:7aff:feb0:1128 for interface bond0
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: Registering gratuitous ARP shared channel
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: (VI_GATEWAY) removing VIPs.
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: (VI_1) removing VIPs.
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: (VI_1) removing E-VIPs.
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: (VI_GATEWAY) Entering BACKUP STATE (init)
Oct 4 15:31:23 172 Keepalived_vrrp[26238]: (VI_1) Entering BACKUP STATE (init)
Oct 4 15:31:30 172 Keepalived_vrrp[26238]: (VI_GATEWAY) Receive advertisement timeout
Oct 4 15:31:30 172 Keepalived_vrrp[26238]: (VI_1) Receive advertisement timeout
Oct 4 15:31:30 172 Keepalived_vrrp[26238]: (VI_1) Entering MASTER STATE
Oct 4 15:31:30 172 Keepalived_vrrp[26238]: (VI_1) setting VIPs.
Oct 4 15:31:30 172 Keepalived_vrrp[26238]: (VI_1) setting E-VIPs.

We don't know why we got a send advert error 22 errors in 172.20.133.246 ,
For testing to reproducing the issue ,we manually stopped the keepalived in 172.20.133.246 at 15:36:39 and shared those logs to you.

No networks issues between Master/Backup servers and no systems log printed on that time .
Note : Already eth1 down state and eth0 is working fine.

ip -d link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:b0:11:2a brd ff:ff:ff:ff:ff:ff promiscuity 0
bond_slave state ACTIVE mii_status UP link_failure_count 1 perm_hwaddr 0c:c4:7a:b0:11:2a queue_id 0 addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:b0:11:2a brd ff:ff:ff:ff:ff:ff promiscuity 0
bond_slave state BACKUP mii_status DOWN link_failure_count 0 perm_hwaddr 0c:c4:7a:b0:11:2b queue_id 0 addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 0c:c4:7a:b0:11:2a brd ff:ff:ff:ff:ff:ff promiscuity 0
bond mode active-backup active_slave eth0 miimon 1000 updelay 0 downdelay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary eth0 primary_reselect always fail_over_mac none xmit_hash_policy layer2 resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535

After execute the ( kill -USR1 $(cat /run/keepalived.pid) command ,those errors not printed in logs .Master/Backup also not swapped . Why we are getting send advert error 22 errors ? .Please clarify on this .

@pqarmitage
Copy link
Collaborator

The reason you are getting the send advert error 22 errors with v2.2.1, but not with v2.0.7, is that I added to code to print the error in commit 3f038ae, which was first included in v2.0.16. Prior to that any send errors were silently ignored, so presumably they were occurring, but were just not being logged.

I will investigate further to see if I can identify what can cause the error 22s.

@pqarmitage
Copy link
Collaborator

@senthilnathann Do you ever get the send advert error 22 errors on 172.20.133.247 when it is master, or are the only occurring on 172.20.133.246?

@senthilnathann
Copy link
Author

@pqarmitage on that day (October 4th) send advert error 22 occurred only in 172.20.133.246 server alone and not occurred in 172.20.133.247 (when it is master) .
Note : But send advert error 22 occured in both server on September25 (When I first encountered the problem ).

@pqarmitage
Copy link
Collaborator

@senthilnathann Are you still getting the error 22s? If so, would it be possible to change your configuration so that the VRRP instances use the eth0 interface rather than bond0. I would like to understand if the problem relates to using a bond interface.

@senthilnathann
Copy link
Author

@pqarmitage I have changed the interface from bond0 to eth0 ,but getting same error and Master/Backup swapped frequently .

Oct 27 20:00:46 172 Keepalived_vrrp[19087]: (VI_GATEWAY) Entering BACKUP STATE
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: (VI_GATEWAY) removing VIPs.
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: VRRP_Group(VG1) Syncing instances to BACKUP state
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: (VI_1) Entering BACKUP STATE
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: (VI_1) removing VIPs.
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: (VI_1) removing E-VIPs.
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: Remote SMTP server [192.168.5.125]:25 connected.
Oct 27 20:00:46 172 Keepalived_vrrp[19087]: Remote SMTP server [192.168.5.125]:25 connected.
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1) Receive advertisement timeout
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_GATEWAY) Receive advertisement timeout
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_GATEWAY): send advert error 22 (Invalid argument)
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_GATEWAY) Entering MASTER STATE
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_GATEWAY) setting VIPs.
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_GATEWAY) Sending/queueing gratuitous ARPs on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.1.1
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: VRRP_Group(VG1) Syncing instances to MASTER state
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1): send advert error 22 (Invalid argument)
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1) Entering MASTER STATE
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1) setting VIPs.
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1) setting E-VIPs.
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: (VI_1) Sending/queueing gratuitous ARPs on eth0 for 172.20.131.127
Oct 27 20:00:51 172 Keepalived_vrrp[19087]: Sending gratuitous ARP on eth0 for 172.20.131.127

@pqarmitage
Copy link
Collaborator

@senthilnathann I am going to produce a patch for you to apply that will log what is being sent to sendmsg(), which is what is returning the error 22, so that we can see if anything changes. Will you be happy to build with that patch for test purposes?

@senthilnathann
Copy link
Author

@pqarmitage Thanks for your prompt reply. Please provide the patch ,I will build and test.

@pqarmitage
Copy link
Collaborator

@senthilnathann That was good timing; I have just completed the patch, which is attached.

The patch will write all the parameters and data of the sendmsg() call to /tmp/sendmsg.dmp (feel free to edit the patch if you want to change the location of the file).

The file might get quite large, but what we want to see is the details of successful messages, and if anything changes for the messages for which error 22 is returned.

010-debug-sendmsg.patch.txt

@senthilnathann
Copy link
Author

@pqarmitage I have attached senmsg logs ( success & invalid argument ) .Please check.
sendmsg.txt

Few minutes getting ( send advert error 22 (Invalid argument)) and Master/Backup swap happened,after 3-4 minutes that high priority server act as Master ,error not came again.

Note : Now I am using bond0 interface.

@pqarmitage
Copy link
Collaborator

@senthilnathann Many thanks for this data. This shows that keepalived is not doing anything different between the successful send of packets at 00:20:43 and the failed messages at 00:20:44. The only difference between the successfully sent packets and the ones returning error 22 is the 6th byte of the iov block, which is the id field of the IPv4 header, which is incremented for each advert sent per VRRP instance.

This therefore looks as though something strange is happening in the kernel causing the error 22.

Can you have a look at the output of ip -s -s link show bond0 and ip -s -s link show eth0. You can also cat /proc/net/dev. What we are looking for is any type of errors. What would be interesting would be to do this while you are getting thr error 22s to see what, if any, counters are increasing.

@senthilnathann
Copy link
Author

@pqarmitage PFA

iplink.txt

@pqarmitage
Copy link
Collaborator

@senthilnathann The interface information doesn't appear to show any significant errors.

I think we have now established a position that keepalived is not doing anything wrong in the sendmsg (all arguments being passed are the same), so it is difficult to see how we can go any further with investigating keepalived.

The only other thought I have had to try is can you ping the master and backup from each other when you are getting the error 22s?

A search on the internet for sendmsg einval might give some further suggestions. For example https://access.redhat.com/solutions/2985371. I will have a look at that later, but it would be helpful if you could do a search and see if anything looks like it could be possible.

@senthilnathann
Copy link
Author

@pqarmitage Thanks for your patient and great help on this issue.

Please find the issue details ,
Chronology of events :
1 ) During keepalived upgrade time ,we swapped Master/Backup . On that swapping time ARP table got full ( default threshold 1024 reached )in LB Master.
2 ) So communication between Master/Backup slightly disturbed and send advert error 22 ,also swapping happens frequently / automatically .
3 ) Once GC cleared the ARP table entries,error was stopped .
1000+ hosts tried to connect to newly took over keepalived Master,so ARP table got full.
After increased gc_threshold count and send advert error 22 not came and Master/Backup not swapped .

Once again, Thanks for your great help and support....

@pqarmitage
Copy link
Collaborator

@senthilnathann I am glad we have got to the bottom of this, and that the problem is resolved for you.

For those who don't have access to the RedHat solution, the problem occurs when using raw, udp (including multicast) or icmp sockets and there are a large number of neighbours on the network, causing the ARP table to fill up.

The recommendation is if there are 900 neighbours,

sysctl -w net.ipv4.neigh.default.gc_thresh1=1024
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

This can be made permanent by setting net.ipv4.neigh.gc_thresh1=1024 etc in, for example, /etc/sysctl.d/50-increase-ipv4-neigh.conf.

If there are more than 1000 neighbours, then the values above would need to be increased accordingly.

EINVAL is a somewhat unexpected error for this issue, and really doesn't help identify the problem. The error is returned by kernel function ip_finish_output2(). RHEL7 uses a 3.10 kernel, and the function has been updated since then, but I am not clear whether the problem still exists. Why the ARP cache is involved when sending to a multicast address I am not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants