Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keepalived V2.2.7 has bugs when eth cable switching between in and out status #2164

Closed
smithAchang opened this issue Jul 19, 2022 · 28 comments
Closed

Comments

@smithAchang
Copy link

Describe the bug
I have tested keepalived v2.0.20 and v2.2.7 at CentOS7 with kernel 3.10.957 using almost the same config except the "user_vmac_addr" option.

keepalived v2.2.7 can not restore from the previous fault status when the cable is plugged out

Comparison

v2.0.20 can work correctly when i plug out/in the network cable, even i tried the operations for some times


To Reproduce

  • when vip has existed on vmac interface and send arp anoucement, i plug out the eth cable
  • after a while, i has seen the '0' pririoty advert in logs, I plug in the eth cable for restoring the network env

Expected behavior
keepalved v2.2.* > v2.2.7 can work correctly when the eth cable is plugged between out and in status

Keepalived version
keepalived 2.2.7

Distro (please complete the following information):

  • Name: CentOS7
  • Version: 7.6 with kernel version 3.10.957
  • Architecture: x86_64

Configuration file:

global_defs {
vrrp_version 3
}

vrrp_instance lbcp {

state BACKUP
advert_int 0.1
interface eno1
priority 160
virtual_router_id 181
nopreempt
use_vmac
use_vmac_addr

virtual_ipaddress {
176.16.167.17/24 dev eno1
}
}

Notify and track scripts
without any scripts

System Log entries
Tue Jul 19 02:50:31 2022: Starting Keepalived v2.2.7 (07/15,2022), git commit +
Tue Jul 19 02:50:31 2022: Running on Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 (built for Linux 3.10.0)
Tue Jul 19 02:50:31 2022: Command line: 'keepalived/keepalived' '-f' '246.conf' '--log-console' '--log-detail' '--no-syslog'
Tue Jul 19 02:50:31 2022: '--dont-fork' '--dump-conf'
Tue Jul 19 02:50:31 2022: Opening file '246.conf'.
Tue Jul 19 02:50:31 2022: Configuration file 246.conf
Tue Jul 19 02:50:31 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Tue Jul 19 02:50:31 2022: Starting VRRP child process, pid=5368
Tue Jul 19 02:50:31 2022: Registering Kernel netlink reflector
Tue Jul 19 02:50:31 2022: Registering Kernel netlink command channel
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha begin,
Tue Jul 19 02:50:31 2022: netlink_if_address_filter haha addr_chg 0 !!!
Tue Jul 19 02:50:31 2022: haha2, VRRP_STATE_MAST:2
Tue Jul 19 02:50:31 2022: haha end !!!
Tue Jul 19 02:50:31 2022: Assigned address 10.194.28.246 for interface eno1
Tue Jul 19 02:50:31 2022: Assigned address fe80::4639:c4ff:fe94:4e37 for interface eno1
Tue Jul 19 02:50:31 2022: (lbcp): Success creating VMAC interface vrrp.181
Tue Jul 19 02:50:31 2022: netlink_link_filter nana ...
Tue Jul 19 02:50:31 2022: NOTICE: setting sysctl net.ipv4.conf.all.rp_filter from 1 to 0
Tue Jul 19 02:50:31 2022: netlink_link_filter nana ...
Tue Jul 19 02:50:31 2022: process_interface_flags_change Netlink reports ifp vrrp.181 status ...
Tue Jul 19 02:50:31 2022: Registering gratuitous ARP shared channel
Tue Jul 19 02:50:31 2022: ------< Global definitions >------
Tue Jul 19 02:50:31 2022: Network namespace = (default)
Tue Jul 19 02:50:31 2022: Network namespace ipvs = (main namespace)
Tue Jul 19 02:50:31 2022: Router ID = [unknown]
Tue Jul 19 02:50:31 2022: Default smtp_alert = unset
Tue Jul 19 02:50:31 2022: Default smtp_alert_vrrp = unset
Tue Jul 19 02:50:31 2022: No test config before reload
Tue Jul 19 02:50:31 2022: Startup complete
Tue Jul 19 02:50:31 2022: Dynamic interfaces = false
Tue Jul 19 02:50:31 2022: FIFO write vrrp states on reload = false
Tue Jul 19 02:50:31 2022: VRRP notify priority changes = false
Tue Jul 19 02:50:31 2022: VRRP IPv4 mcast group = 224.0.0.18
Tue Jul 19 02:50:31 2022: VRRP IPv6 mcast group = ff02::12
Tue Jul 19 02:50:31 2022: Gratuitous ARP delay = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP repeat = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP refresh timer = 0
Tue Jul 19 02:50:31 2022: Gratuitous ARP refresh repeat = 1
Tue Jul 19 02:50:31 2022: Gratuitous ARP lower priority delay = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP lower priority repeat = 5
Tue Jul 19 02:50:31 2022: Num adverts before down = 3
Tue Jul 19 02:50:31 2022: Gratuitous ARP for each secondary VMAC = 0s
Tue Jul 19 02:50:31 2022: Send advert after receive lower priority advert = true
Tue Jul 19 02:50:31 2022: Send advert after receive higher priority advert = false
Tue Jul 19 02:50:31 2022: Gratuitous ARP interval = 0.000000
Tue Jul 19 02:50:31 2022: Gratuitous NA interval = 0.000000
Tue Jul 19 02:50:31 2022: VRRP default protocol version = 3
Tue Jul 19 02:50:31 2022: VRRP check unicast_src = false
Tue Jul 19 02:50:31 2022: VRRP skip check advert addresses = false
Tue Jul 19 02:50:31 2022: VRRP strict mode = false
Tue Jul 19 02:50:31 2022: Max auto priority = 0
Tue Jul 19 02:50:31 2022: Min auto priority delay = 1000000 usecs
Tue Jul 19 02:50:31 2022: VRRP process priority = 0
Tue Jul 19 02:50:31 2022: VRRP don't swap = false
Tue Jul 19 02:50:31 2022: VRRP realtime priority = 0
Tue Jul 19 02:50:31 2022: VRRP realtime limit = 10000
Tue Jul 19 02:50:31 2022: Script security disabled
Tue Jul 19 02:50:31 2022: Script user 'keepalived_script' does not exist
Tue Jul 19 02:50:31 2022: Default script uid:gid 0:0
Tue Jul 19 02:50:31 2022: vrrp_netlink_cmd_rcv_bufs = 0
Tue Jul 19 02:50:31 2022: vrrp_netlink_cmd_rcv_bufs_force = 0
Tue Jul 19 02:50:31 2022: vrrp_netlink_monitor_rcv_bufs = 0
Tue Jul 19 02:50:31 2022: vrrp_netlink_monitor_rcv_bufs_force = 0
Tue Jul 19 02:50:31 2022: process_monitor_rcv_bufs = 0
Tue Jul 19 02:50:31 2022: process_monitor_rcv_bufs_force = 0
Tue Jul 19 02:50:31 2022: rx_bufs_multiples = 3
Tue Jul 19 02:50:31 2022: umask = 0177
Tue Jul 19 02:50:31 2022: ------< VRRP Topology >------
Tue Jul 19 02:50:31 2022: VRRP Instance = lbcp
Tue Jul 19 02:50:31 2022: VRRP Version = 3
Tue Jul 19 02:50:31 2022: Flags:
Tue Jul 19 02:50:31 2022: Wantstate = BACKUP
Tue Jul 19 02:50:31 2022: Number of config faults = 0
Tue Jul 19 02:50:31 2022: Use VMAC, i/f name vrrp.181, is_up = true, xmit_base = false
Tue Jul 19 02:50:31 2022: Use VMAC for VIPs on other interfaces
Tue Jul 19 02:50:31 2022: Interface = vrrp.181, vmac on eno1
Tue Jul 19 02:50:31 2022: Using src_ip = 10.194.28.246
Tue Jul 19 02:50:31 2022: Multicast address 224.0.0.18
Tue Jul 19 02:50:31 2022: Gratuitous ARP delay = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP repeat = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP refresh = 0
Tue Jul 19 02:50:31 2022: Gratuitous ARP refresh repeat = 1
Tue Jul 19 02:50:31 2022: Gratuitous ARP lower priority delay = 5
Tue Jul 19 02:50:31 2022: Gratuitous ARP lower priority repeat = 5
Tue Jul 19 02:50:31 2022: Down timer adverts = 3
Tue Jul 19 02:50:31 2022: Send advert after receive lower priority advert = true
Tue Jul 19 02:50:31 2022: Send advert after receive higher priority advert = false
Tue Jul 19 02:50:31 2022: Virtual Router ID = 181
Tue Jul 19 02:50:31 2022: Priority = 160
Tue Jul 19 02:50:31 2022: Advert interval = 100 milli-sec
Tue Jul 19 02:50:31 2022: Preempt = disabled
Tue Jul 19 02:50:31 2022: Promote_secondaries = disabled
Tue Jul 19 02:50:31 2022: last rx checksum = 0x0000, priority 0
Tue Jul 19 02:50:31 2022: last tx checksum = 0x0000, priority 0
Tue Jul 19 02:50:31 2022: Virtual IP (1):
Tue Jul 19 02:50:31 2022: 176.16.167.17/24 dev vrrp.181@eno1 scope global
Tue Jul 19 02:50:31 2022: No sockets allocated
Tue Jul 19 02:50:31 2022: Using smtp notification = no
Tue Jul 19 02:50:31 2022: Notify deleted = Fault
Tue Jul 19 02:50:31 2022: Notify priority changes = false
Tue Jul 19 02:50:31 2022: ------< Interfaces >------
Tue Jul 19 02:50:31 2022: Name = lo
Tue Jul 19 02:50:31 2022: index = 1
Tue Jul 19 02:50:31 2022: IPv4 address = 127.0.0.1
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: State = UP, RUNNING, no broadcast, loopback, no multicast
Tue Jul 19 02:50:31 2022: Seen up = 1
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 65536
Tue Jul 19 02:50:31 2022: HW Type = LOOPBACK
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: rp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Name = eno1
Tue Jul 19 02:50:31 2022: index = 2
Tue Jul 19 02:50:31 2022: IPv4 address = 10.194.28.246
Tue Jul 19 02:50:31 2022: IPv6 address = fe80::4639:c4ff:fe94:4e37
Tue Jul 19 02:50:31 2022: MAC = 44:39:c4:94:4e:37
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = UP, RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 1
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 1
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Tracking VRRP instances :
Tue Jul 19 02:50:31 2022: lbcp, weight 0
Tue Jul 19 02:50:31 2022: Name = virbr0
Tue Jul 19 02:50:31 2022: index = 3
Tue Jul 19 02:50:31 2022: IPv4 address = 192.168.122.1
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: MAC = 52:54:00:6d:a5:8d
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = UP, not RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 0
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Name = virbr0-nic
Tue Jul 19 02:50:31 2022: index = 4
Tue Jul 19 02:50:31 2022: IPv4 address = (none)
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: MAC = 52:54:00:6d:a5:8d
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = not UP, not RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 0
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Name = br-ae3c4f6fe047
Tue Jul 19 02:50:31 2022: index = 5
Tue Jul 19 02:50:31 2022: IPv4 address = 172.18.0.1
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: MAC = 02:42:22:ce:0c:2d
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = UP, not RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 0
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Name = docker0
Tue Jul 19 02:50:31 2022: index = 6
Tue Jul 19 02:50:31 2022: IPv4 address = 172.17.0.1
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: MAC = 02:42:bd:68:8e:ab
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = UP, not RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 0
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Name = vrrp.181
Tue Jul 19 02:50:31 2022: index = 8
Tue Jul 19 02:50:31 2022: IPv4 address = (none)
Tue Jul 19 02:50:31 2022: IPv6 address = (none)
Tue Jul 19 02:50:31 2022: MAC = 00:00:5e:00:01:b5
Tue Jul 19 02:50:31 2022: MAC broadcast = ff:ff:ff:ff:ff:ff
Tue Jul 19 02:50:31 2022: State = UP, RUNNING
Tue Jul 19 02:50:31 2022: Seen up = 1
Tue Jul 19 02:50:31 2022: Delayed state change running = false
Tue Jul 19 02:50:31 2022: Up debounce timer = 0us
Tue Jul 19 02:50:31 2022: Down debounce timer = 0us
Tue Jul 19 02:50:31 2022: VMAC type private, underlying interface = eno1, state = UP, RUNNING
Tue Jul 19 02:50:31 2022: I/f created by keepalived
Tue Jul 19 02:50:31 2022: MTU = 1500
Tue Jul 19 02:50:31 2022: HW Type = ETHERNET
Tue Jul 19 02:50:31 2022: NIC netlink status update
Tue Jul 19 02:50:31 2022: Reset ARP config counter 0
Tue Jul 19 02:50:31 2022: Original arp_ignore 0
Tue Jul 19 02:50:31 2022: Original arp_filter 0
Tue Jul 19 02:50:31 2022: Original promote_secondaries 0
Tue Jul 19 02:50:31 2022: Reset promote_secondaries counter 0
Tue Jul 19 02:50:31 2022: Tracking VRRP instances :
Tue Jul 19 02:50:31 2022: lbcp, weight 0
Tue Jul 19 02:50:31 2022: (lbcp) Entering BACKUP STATE (init)
Tue Jul 19 02:50:31 2022: VRRP sockpool: [ifindex( 8), family(IPv4), proto(112), fd(11,12) multicast, address(224.0.0.18)]
Tue Jul 19 02:50:31 2022: (lbcp) Receive advertisement timeout
Tue Jul 19 02:50:31 2022: (lbcp) Entering MASTER STATE
Tue Jul 19 02:50:31 2022: (lbcp) setting VIPs.
Tue Jul 19 02:50:31 2022: (lbcp) Sending/queueing gratuitous ARPs on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:31 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:31 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:31 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:31 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:31 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: (lbcp) Sending/queueing gratuitous ARPs on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17
Tue Jul 19 02:50:36 2022: Sending gratuitous ARP on vrrp.181 for 176.16.167.17

Did keepalived coredump?
without coredump

Additional context
when i plug out the cable, I find the vmac interface is in '''lowerlayerdown''' status when tested in v2.0.20;
But in down status when tested in v2.2.7

@pqarmitage
Copy link
Collaborator

Do you experience the problem with v2.2.7 if you do NOT specify use_vmac_addr? Since the VIP is configured on the same interface as the vrrp instance, I don't think there is any need to specify use_vmac_addr, since the VIP will be configured on the VMAC anyway.

If this needs to be investigated further can you please post the output of ip -d link show when eno1 is unplugged and after it is plugged in again, for both v2.0.20 and v2.2.7. Can you please also provide the log output from before eno1 is unplugged until after it is plugged in again, for both versions of keepalived.

@smithAchang
Copy link
Author

I must config 'use_vmac_addr' option becuase i may have many vips on various interfaces .

But in my test case, i only config one vip :)

The test result is the same when i remove the 'use_vmac_addr' option for keepalived 2.2.7

By the way , keepalived 2.0.20 does not support 'use_vmac_addr' option

keepalived.zip

@pqarmitage
Copy link
Collaborator

I have tried your configuration using keepalived v2.2.7 and unfortunately I cannot reproduce the problem.

I see in 2.2.7_iplinkshow.txt that vrrp.181 remains in down state after the cable is plugged in again, hence keepalived does not log that Netlink reports vrrp.181 is up, and the VRRP instance remains in fault state.

The only way I can see to progress this is to capture the netlink messages and see what is happening.

Can you please do the following (as root):

ip link add  nlmon0 type nlmon
ip link set dev nlmon0 up

And then, before keepalived starts:
tcpdump -i nlmon0 -w netlink-{2.0.20|2.2.7}.pcap
and run that until you terminate keepalived.

If you can send the pcap files, and also the matching keepalived logs, I will have a look to see if there is any difference.

At the end, if you execute ip link del nlmon0 your system will be back in its original state.

@smithAchang
Copy link
Author

The operations are the same as before
And also the result is as before !

netlink_reports.zip

@pqarmitage
Copy link
Collaborator

@smithAchang I need the logs to see the times at which things happened, so that I can work out which are the associated netlink messages.

@smithAchang
Copy link
Author

sorry,
I taken my vacation in the past days.
PcapAndLog.zip

@pqarmitage
Copy link
Collaborator

@smithAchang Apologies for the delay in looking at this.

The reason that lbcp remains in fault state in the v2.2.7 logs is that netlink reports at 02:29:54 both eno1 and vrrp.181 going down. At 02:30:21 eno1 is reported coming up but there is no report of vrrp.181 coming up, and so lbcp remains in fault state since vrrp.181 interface is still down.

With 2.0.20 there are reports that both eno1 and vrrp.181 go down, but at 02:27:06 both eno1 and vrrp.181 are reported coming up, and so lbcp can exit fault state.

In the netlink pcap files, for v2.0.20 frame 94 reports eno1 down and from 96 reports vrrp.181 lower layer down. Frame 171 reports eno1 up and frame 173 reports vrrp.181 up.

With v2.2.7 frame 94 reports eno1 down and frame 96 reports vrrp.181 lower layer down. Then 5 seconds later frame 214 reports vrrp.181 (?administratively) down. Frame 236 reports eno1 up, but there is no message reporting vrrp.181 up.

So the question is, in the v2.2.7 case, why is vrrp.181 going into a down operstate (and the device flags in the netlink message also clear the UP bit)? With v2.2.7 if you unplug the cable and then plug it back in immediately, does lbcp go to backup and then master state (i.e. eno1 comes back up before the operstate Down message 5 seconds after vrrp.181 goes into lower layer down state)?

Since I do not experience the problem with v2.2.7 it seems unlikely that keepalived is sending the link down message for vrrp.181, so the question must be where is it coming from? I presume that, after the cable is plugged back in and vrrp.181 is still down, if you execute ip link set vrrp.181 up that lbcp then recovers to backup and master state.

The only thing I can think of doing is if keepalived receives a link down message for one of its VMAC interfaces, then it sends a link up message. I don't like this approach since it is working around a problem rather than fixing a problem. Please let me know what you think.

@smithAchang
Copy link
Author

yes

  1. when i plug the cable immediately before 5s timeout, v2.2.7 can recover from fault state;but v2.0.20 does not care how long the cable is plugged out.
  2. when v2.2.7 can not recover from fault state, I manually set up the link state to up , v2.2.7 can recover from fault state
  3. when I see the /var/log/message, i find the key word 'differ 4 seconds action'

--------------------------------- v2.2.7
Aug 21 21:05:01 CalttaTrunk_TestKWGo systemd: Started Session 27 of user root.
Aug 21 21:05:01 CalttaTrunk_TestKWGo crond: sendmail: Cannot open mail:25
Aug 21 21:05:13 CalttaTrunk_TestKWGo kernel: e1000e: eno1 NIC Link is Down

Aug 21 21:05:13 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): link disconnected (deferring action for 4 seconds)
Aug 21 21:05:13 CalttaTrunk_TestKWGo NetworkManager[941]: (vrrp.181): link disconnected (deferring action for 4 seconds)
Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): link disconnected (calling deferred action)
Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: (vrrp.181): link disconnected (calling deferred action)

Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): device state change: activated -> unavailable (reason 'carrier-changed') [100 20 40]
Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: NetworkManager state is now CONNECTED_LOCAL
Aug 21 21:05:18 CalttaTrunk_TestKWGo dbus[951]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: (vrrp.181): device state change: activated -> unavailable (reason 'carrier-changed') [100 20 40]
Aug 21 21:05:18 CalttaTrunk_TestKWGo systemd: Starting Network Manager Script Dispatcher Service...
Aug 21 21:05:18 CalttaTrunk_TestKWGo dbus[951]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug 21 21:05:18 CalttaTrunk_TestKWGo systemd: Started Network Manager Script Dispatcher Service.
Aug 21 21:05:18 CalttaTrunk_TestKWGo nm-dispatcher: Dispatching action 'down' for eno1
Aug 21 21:05:18 CalttaTrunk_TestKWGo nm-dispatcher: Dispatching action 'down' for vrrp.181
Aug 21 21:05:18 CalttaTrunk_TestKWGo NetworkManager[941]: (vrrp.181): device state change: unavailable -> unmanaged (reason 'none') [20 10 0]
Aug 21 21:05:30 CalttaTrunk_TestKWGo PackageKit: refresh-cache transaction /1887_cdbccabe from uid 0 finished with failed after 120126ms

----------------------------- v2.0.21

Aug 21 21:09:10 CalttaTrunk_TestKWGo kernel: e1000e: eno1 NIC Link is Down

Aug 21 21:09:10 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): link disconnected (deferring action for 4 seconds)
Aug 21 21:09:10 CalttaTrunk_TestKWGo NetworkManager[941]: (vrrp.181): link disconnected
Aug 21 21:09:14 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): link disconnected (calling deferred action)

Aug 21 21:09:14 CalttaTrunk_TestKWGo NetworkManager[941]: (eno1): device state change: activated -> unavailable (reason 'carrier-changed') [100 20 40]
Aug 21 21:09:14 CalttaTrunk_TestKWGo NetworkManager[941]: NetworkManager state is now CONNECTED_LOCAL
Aug 21 21:09:14 CalttaTrunk_TestKWGo dbus[951]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Aug 21 21:09:14 CalttaTrunk_TestKWGo systemd: Starting Network Manager Script Dispatcher Service...
Aug 21 21:09:14 CalttaTrunk_TestKWGo dbus[951]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Aug 21 21:09:14 CalttaTrunk_TestKWGo systemd: Started Network Manager Script Dispatcher Service.
Aug 21 21:09:14 CalttaTrunk_TestKWGo nm-dispatcher: Dispatching action 'down' for eno1

I think, the host is the same , the environment is the same and the test operations is the same , so the input is with no any difference, but the output has difference,

if there exist any differences between the vmac interface creating process and with some different flags options ???

@pqarmitage
Copy link
Collaborator

OK, that's very helpful. We now know that it is NetworkManager that is causing the problem; as you say we now need to identify why, and what is different between keepalived v2.0.20 and v2.2.7.

Looking at the output of git log --oneline v2.0.20..v2.2.7 keepalived/vrrp/vrrp_vmac.c I cannot see any obvious commit that changes the VMAC creation code.

With v2.2.7, before you unplug the cable but after keepalived is running, can you execute nmcli device set vrrp.181 managed no and see if that stops the problem occurring.

What would be most helpful is if you could do a git bisect to build various versions of keepalived to identify which commit triggered the problem; it should take no more than 10 iterations to identify the commit.

@smithAchang
Copy link
Author

good

I think it is because the difference leading to the error. The testing environment is the same, but the NetworkManager component keeps unchanged!

I has tested the same case using keepalived v2.1.5 , keepalived v2.1.5 can recover from the fault state when network cable is plugged out.

but keepalived v2.2.0 can not recover

Also, I has tested keepalived v2.27 in my Ubuntu20.04 host,it also can recover from the fault state。

The bug maybe is produced by some fragile codes

bin/keepalived -v
Keepalived v2.2.0 (01/09,2021)

Copyright(C) 2001-2021 Alexandre Cassen, <acassen@gmail.com>

Built with kernel headers for Linux 3.10.0
Running on Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
Distro: CentOS Linux 7 (Core)

configure options:

Config options:  LVS VRRP VRRP_AUTH OLD_CHKSUM_COMPAT FIB_ROUTING

System options:  PIPE2 SIGNALFD INOTIFY_INIT1 VSYSLOG EPOLL_CREATE1 IPV6_ADVANCED_API RTAX_CC_ALGO RTAX_QUICKACK FRA_OIFNAME IFA_FLAGS IP_MULTICAST_ALL NET_LINUX_IF_H_COLLISION NET_LINUX_IF_ETHER_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION VRRP_VMAC IFLA_LINK_NETNSID CN_PROC SOCK_NONBLOCK SOCK_CLOEXEC O_PATH GLOB_BRACE INET6_ADDR_GEN_MODE SO_MARK SCHED_RESET_ON_FORK

in my host, i can not execute nmcli device set vrrp.181 managed no

 nmcli device help
Usage: nmcli device { COMMAND | help }

COMMAND := { status | show | connect | disconnect | delete | wifi }

  status

  show [<ifname>]

  connect <ifname>

  disconnect <ifname> ...

  delete <ifname> ...

  wifi [list [ifname <ifname>] [bssid <BSSID>]]

  wifi connect <(B)SSID> [password <password>] [wep-key-type key|phrase] [ifname <ifname>]
                         [bssid <BSSID>] [name <name>] [private yes|no] [hidden yes|no]

  wifi rescan [ifname <ifname>] [[ssid <SSID to scan>] ...]

@smithAchang
Copy link
Author

By the way, If I add a macvlan device manually in the testing host

ip link add eno1 dev macvlan1 type macvlan
ip link set macvlan1 up

when i plugged out the cable, i got the macvlan1 link status is LOWLAYERDOWN and the NetManager log has no 'deferring action for 4 seconds' for macvlan1 link disconnected event

Maybe it is really the macvlan creating process of keepalived v2.2.7 leading to the bug... ,

Please check in detail , thx :)

@pqarmitage
Copy link
Collaborator

@smithAchang I am quite happy to accept that it is something that keepalived is doing that is causing NetworkManager to down vrrp.181, but the problem is identifying what.

On my Centos 7 VM, nmcli does support the nmcli device set IF managed no command, I therefore suspect that your version of NetworkManager is out of date - mine is NetworkManager-1.18.8-2.el7_9. Can you update your version of NetworkManager? who knows, it might even stop NetworkManager downing vrrp.181.

As I understand it you have now narrowed it down to the problem does not exist in keepalived v2.1.5 but it does exist in keepalived v2.2.0. I have looked at the one line summaries of all the commits between the two versions and nothing stands out as changing the way macvlans are created.

I think the only way we can track down what commit caused the problem, and hence find a solution, is if you do a git bisect of the keepalived code between versions 2.1.5 and 2.2.0, and then try each version of keepalived produced to see if it exhibits the problem. It should take no more than 8 iterations to find the relevant commit. Normally I would do this, but since I cannot reproduce the problem on this occasion I can't do the git bisect.

Once you have found the commit that breaks keepalived then we can work on a solution.

@smithAchang
Copy link
Author

En :)

In my host , I have found the revision range .

cc6b6e5 ..3dcd13c

3dcd13c -- test fail
716661d -- compile error
9bd1897 -- compile error
cc6b6e5 -- test ok
630f813 -- test ok

@smithAchang
Copy link
Author

if I try to fix the compile error, the narrowed revision range is
3dcd13c -- test fail
716661d -- compile error --> test ok

@smithAchang
Copy link
Author

if reset vrrp.c to 716661d and fix the simple compile error, i test ok

if i merge all codes of vrrp.c from 3dcd13c commit ,but exclude 3490~3543 line , test will be ok ,

@smithAchang
Copy link
Author

I upgraded the NetworkManager component to v1.18, redone the operations, the test failed

If I executed nmcli device set vrrp.181 managed no before unplugging the cable, the test is OK

the keepalived version is based on 3dcd13c commit

@pqarmitage
Copy link
Collaborator

pqarmitage commented Aug 25, 2022

@smithAchang You say above that if you exclude 3490~3543 line, it is ok. Can you please clarify what these lines are. It would be most helpful if you could submit a patch with the change you have made, and also state which commit it is based on.

@pqarmitage
Copy link
Collaborator

@smithAchang I see now that lines 3490~3543 are in vrrp.c (I hadn't quite read your comment correctly).

With lines 3490~3543 still in the code (i.e. a non working version of keepalived), can you please try 2 configuration changes, one at a time, to see if they make any difference:

  1. Remove use_vmac_addr
  2. Don't specify dev eno1 after the VIP.

Both of the above are unnecessary in the configuration you have above.

I have run commit #3dcd13c with your configuration and in vrrp.c at line 3506 added
log_message(LOG_INFO, "if_sorted %d if %s", if_sorted, ip_addr->ifp->ifname);
This logs that if_sorted is 1, and so I don't see how the code from lines 3490~3543 makes any difference.

@pqarmitage
Copy link
Collaborator

From my tests, without lines 34903543 176.16.167.17/24 is configured on eno1, with lines 34903543 the address is configured on vrrp.181.

With keepalived v2.0.20 the VIP is configured on eno1, from commit #3dcd13c onwards the VIP is configured on vrrp.181. If I remove the dev eno1 from VIP, then all versions configure the VIP on vrrp.181.

The question is, which interface do you actually want the VIP to be configured on. If it is configured on eno1, then there is no point in specifying use_vmac. So although commit #3dcd13c changes keepalived to effectively ignore dev IF where IF is the interface of the vrrp instance, if use_vmac is also specified, then which interface should be used for the VIP? I see a lot of configurations where people completely unnecessarily specify dev IF where IF is the interface of the vrrp instance, and if use_vmac is specified, I am not convinced that the VIP should not be configured on the VMAC. On the other hand, there should be a way to configure the VIP to be on the configured interface rather than the VMAC.

So the main thing is we have found the difference. Then next issue is why does NetworkManager down the macvlan interface if an IP address is configured on it, but doesn't down the interface if there is no IP address.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/configuring-networkmanager-to-ignore-certain-devices_configuring-and-managing-networking gives a way of making an interface unmanaged. Essentially you want a file /etc/NetworkManager/conf.d/99-unmanaged-devices.conf file with the following content:

[keyfile]
unmanaged-devices=interface-name:vrrp.181

Whether this works if the interface doesn't exist when NetworkManager starts up I don't know.

An alternative is

[keyfile]
unmanaged-devices=type:macvlan

which will make all macvlans unmanaged.

What I will look at doing is modifying the keepalived code so that it calls NetworkManager to set the macvlans it creates to be unmanaged by NetworkManager. I think this is the right thing to do regardless of any of the other points above.

@smithAchang
Copy link
Author

my testing config is simplified

I have mulitiple NIC cards when in production environment and may have multiple VIPs with different vmac
so I need config 'use_mac_addr' in global and set 'vip/prefix dev phsical_NIC_card_name' for each VIP

when config file is produced by init codes in release, I do not treat VIP that may be working on vrrp interface specially

@pqarmitage
Copy link
Collaborator

Could you try your configuration without the dev eno1 against the VIP with keepalived v2.0.20. The VIP should then be put on vrrp.181and I expect that you will see the problem that you have been seeing with v2.2.7.

@smithAchang
Copy link
Author

ok , i will test your config in next week!
thx :)

@pqarmitage
Copy link
Collaborator

@smithAchang Attached is a patch which is prepared against the current HEAD of the master branch, but also applies to v2.2.7.

You will need to install package NetworkManager-libnm-devel, since the patch uses the NetworkManager libnm library.

If you use your original configuration, and without any updates to the NetworkManager configuration, then I believe with the patched code your problem should be resolved. What the patch does is it makes keepalived tell NetworkManager that the macvlan interfaces it creates are to e unmanaged by NetworkManager (the same as executing nmcli device set vrrp.181 managed off).
010b-set-vmac-unmanaged.patch.txt

@smithAchang
Copy link
Author

very sorry :(

I can not reproduce the bug ...

Maybe the host is upgraded the NetworkManager component and power off in the last week , today I power on the machine and some factor is changed!

removing the 'dev eno1' option

  • both v2.0.20 and v2.1.5 can recover from fault status ; VIP is floated on vrrp.181 macvlan , not as before on eno1 interface .
  • for git codes of commit v2.1.5-130-g3dcd13c, keepalived can recover from fault status; VIP is floated on vrrp.181 macvlan as before
  • v2.2.7 is ok too

adding the 'dev eno1' option again

  • v2.1.5-130-g3dcd13c is ok
  • v2.2.7 is ok too

additions

I use 'dev xxx' option to indicate keepalived floats vip on special physical NIC card, even if I use VMAC option.

I can see mulitiple vrrp.[id]@[interface] macvlans in the host having mulitiple NIC cards.

so my question is how can I config VIP floated on different physical card, thx :)

my configuration mode is ok ???

@pqarmitage
Copy link
Collaborator

@smithAchang You have previously said that you upgraded NetworkManager in order to get nmcli device set IF managed off, so it sounds as though the problem was with your old NetworkManager not setting macvlans as unmanaged, whereas it now does so. I think if you run nmcli device status you will now see that the macvlans are set as unmanaged. You could try executing nmcli device set vrrp.181 managed on and then see if the problem comes back when you unplug the cable.

Do you know what version of NetworkManager you were running before you upgraded it last week? If you don't know, yum history following by yum history info ID where ID is the relevant one from the yum history output, will tell you.

@pqarmitage
Copy link
Collaborator

If you want the VIPs to be configured on the physical interfaces rather than macvlan interfaces, then remove the keyword use_vmac_addr. The disadvantage of this is that when a backup takes over as master, the MAC addresses associated with the VIPs will change; the whole purpose of using macvlans is so that the MAC addresses do not change. Unless you have good reason not to, I would recommend keeping use_vmac_addr so that the VIPs configured on other interfaces do not change the MAC addresses when a backup takes over as master.

At the moment, following commit 3dcd13c even if you specify dev IF for a VIP and the VRRP instance is configured to use a VMAC (either on the interface of the VRRP instance, or using use_vmac_addr), then keepalived will configure the VIP on the VMAC rather than the base interface). I think this is probably the right way to do it since if you specify use_vmac_addr and want the VIP on a different interface, you can't specify the VMAC name since keepalived will create that itself. What you can do is not specify use_vmac_addr and if you do want some of the VIPs on VMACs, the add use_vmac against the VIP itself.

@pqarmitage
Copy link
Collaborator

I have now merged commit 6211ead which adds the code for setting maclans as unmanaged by NetworkManager, but it is disabled by default and only enabled by the configure option --enable-nm. The logic for not enabling it by default is that the problem only occurs with quite old versions of NetworkMananger.

@smithAchang Are you happy for this issue to be closed now?

@smithAchang
Copy link
Author

smithAchang commented Aug 30, 2022

it is ok to close the issue! Thank you very much :)

before upgrading the NetworkManager ,his version is NetworkManager-1:1.0.6
'
yum history info 33
Loaded plugins: auto-update-debuginfo
Repository epel is listed more than once in the configuration
Repository epel-debuginfo is listed more than once in the configuration
Repository epel-source is listed more than once in the configuration
Transaction ID : 33
Begin time : Wed Aug 24 23:43:55 2022
Begin rpmdb : 1764:74f95ea1aeb001876dbdb135beb5398fe209eab7
End time : 23:44:17 2022 (22 seconds)
End rpmdb : 1765:a70c479fa53b78f08331d6c3bc2069379e07f023
User : root
Return-Code : Success
Command Line : --disablerepo=* --enablerepo=c7-media upgrade NetworkManager
Transaction performed with:
Installed rpm-4.11.3-43.el7.x86_64 @base
Installed yum-3.4.3-167.el7.centos.noarch @base
Installed yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch @updates
Packages Altered:
Updated NetworkManager-1:1.0.6-27.el7.x86_64 @anaconda
Obsoleted NetworkManager-1:1.0.6-27.el7.x86_64 @anaconda
Obsoleting NetworkManager-1:1.18.8-1.el7.x86_64 @c7-media
Updated NetworkManager-adsl-1:1.0.6-27.el7.x86_64 @anaconda
Update 1:1.18.8-1.el7.x86_64 @c7-media
Updated NetworkManager-glib-1:1.0.6-27.el7.x86_64 @anaconda
Update 1:1.18.8-1.el7.x86_64 @c7-media
Updated NetworkManager-libnm-1:1.0.6-27.el7.x86_64 @anaconda
Update 1:1.18.8-1.el7.x86_64 @c7-media
Obsoleting NetworkManager-ppp-1:1.18.8-1.el7.x86_64 @c7-media
Updated NetworkManager-team-1:1.0.6-27.el7.x86_64 @anaconda
Update 1:1.18.8-1.el7.x86_64 @c7-media
Updated NetworkManager-tui-1:1.0.6-27.el7.x86_64 @anaconda
Update 1:1.18.8-1.el7.x86_64 @c7-media
Scriptlet output:
1 Created symlink from /etc/systemd/system/multi-user.target.wants/NetworkManager-wait-online.service to /usr/lib/systemd/system/NetworkManager-wait-online.service.
history info
'

Before I run keepalived, 'nmcli device status' show no any macvlan
I start keepalived v2.2.7 but before unplugging the cable, 'nmcli device status' shows the vrrp.181 macvlan is connected

when the cable is unplugged, 'nmcli device status' shows the vrrp.181 macvlan is unmanaged

The test is ok too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants