Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRRP child process() died: Respawning #390

Closed
jslocinski opened this issue Aug 5, 2016 · 3 comments
Closed

VRRP child process() died: Respawning #390

jslocinski opened this issue Aug 5, 2016 · 3 comments

Comments

@jslocinski
Copy link

jslocinski commented Aug 5, 2016

Hey,
faced a problem with keepalived when running for more instances (on separate vlan subinterface each instance).
I am trying to run about 50-200 instances, which results in SIGABRT for childproces (on which threshold - depends on the configuration).

Version: v1.2.22 commit gbde660e

  1. running 100 instances (no vmac) - all fine
  2. running 150 instances (no-vmac) - crash
  3. running 100 instances (vmac) - crash

In logs I see:
Keepalived_vrrp[]: Netlink: Received message overrun (No buffer space available)

Seems like memory issue, so tried to increase buffers system-wide via sysctl - didn't help. Logs and backtrace below. No matter if advert_int is 1s or 30s or more. Just crash is later (normally at the time of sending GARPs).

150 instances, no-vmac:
Aug 5 12:47:35 test1 Keepalived[89733]: Starting Keepalived v1.2.22 (06/28,2016), git commit v1.2.22-15-gbde660e
Aug 5 12:47:35 test1 Keepalived[89734]: Starting Healthcheck child process, pid=89736
Aug 5 12:47:35 test1 Keepalived[89734]: Starting VRRP child process, pid=89737
Aug 5 12:47:35 test1 Keepalived_healthcheckers[89736]: Registering Kernel netlink reflector
Aug 5 12:47:35 test1 Keepalived_healthcheckers[89736]: Registering Kernel netlink command channel
Aug 5 12:47:35 test1 Keepalived_healthcheckers[89736]: Opening file '/run/conf/keepalived.conf'.
Aug 5 12:47:35 test1 Keepalived_vrrp[89737]: Registering Kernel netlink reflector
Aug 5 12:47:35 test1 Keepalived_vrrp[89737]: Registering Kernel netlink command channel
Aug 5 12:47:35 test1 Keepalived_vrrp[89737]: Registering gratuitous ARP shared channel
Aug 5 12:47:35 test1 Keepalived_vrrp[89737]: Opening file '/run/conf/keepalived.conf'.
Aug 5 12:47:35 test1 Keepalived_healthcheckers[89736]: Using LinkWatch kernel netlink reflector...
Aug 5 12:47:35 test1 Keepalived_vrrp[89737]: Using LinkWatch kernel netlink reflector...
Aug 5 12:47:35 test1 Keepalived[89734]: pid 89737 exited due to signal 6
Aug 5 12:47:35 test1 Keepalived[89734]: VRRP child process(89737) died: Respawning
Aug 5 12:47:35 test1 Keepalived[89734]: Starting VRRP child process, pid=89761
Aug 5 12:47:35 test1 Keepalived_vrrp[89761]: Registering Kernel netlink reflector
Aug 5 12:47:35 test1 Keepalived_vrrp[89761]: Registering Kernel netlink command channel
Aug 5 12:47:35 test1 Keepalived_vrrp[89761]: Registering gratuitous ARP shared channel
Aug 5 12:47:35 test1 Keepalived_vrrp[89761]: Opening file '/run/conf/keepalived.conf'.
Aug 5 12:47:35 test1 Keepalived_vrrp[89761]: Using LinkWatch kernel netlink reflector...
Aug 5 12:47:35 test1 Keepalived[89734]: pid 89761 exited due to signal 6
Aug 5 12:47:35 test1 Keepalived[89734]: VRRP child process(89761) died: Respawning
Aug 5 12:47:35 test1 Keepalived[89734]: Starting VRRP child process, pid=89770
(...)

Core was generated by `/usr/sbin/keepalived -D --core-dump -f /run/conf/keepalived.conf'.
Program terminated with signal 6, Aborted.
#0 0x00007f122823a125 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0 0x00007f122823a125 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f122823d325 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f1228233311 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x0000000000430b20 in keepalived_malloc ()
#4 0x00000000004336ad in list_add ()
#5 0x0000000000423861 in vrrp_dispatcher_init ()
#6 0x000000000043360d in launch_scheduler ()
#7 0x000000000041aee9 in start_vrrp_child ()
#8 0x000000000041afe6 in ?? ()
#9 0x000000000043360d in launch_scheduler ()
#10 0x000000000040ab23 in main ()

100 instances, vmac:
Aug 5 12:58:53 test1 Keepalived[105533]: Starting Keepalived v1.2.22 (06/28,2016), git commit v1.2.22-15-gbde660e
Aug 5 12:58:53 test1 Keepalived[105534]: Starting Healthcheck child process, pid=105535
Aug 5 12:58:53 test1 Keepalived[105534]: Starting VRRP child process, pid=105536
...
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP_Instance(41) Entering BACKUP STATE
..
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP_Instance(4100) Entering BACKUP STATE
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP sockpool: [ifindex(3112), proto(112), unicast(0), fd(11,12)]
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP sockpool: [ifindex(3113), proto(112), unicast(0), fd(13,14)]
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP sockpool: [ifindex(3114), proto(112), unicast(0), fd(15,16)]
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: VRRP sockpool: [ifindex(3115), proto(112), unicast(0), fd(17,18)]
...
Aug 5 12:58:56 test1 Keepalived_vrrp[105536]: Netlink: Received message overrun (No buffer space available)
...
Aug 5 12:59:00 test1 Keepalived_vrrp[105536]: VRRP_Instance(414) Transition to MASTER STATE
Aug 5 12:59:00 test1 Keepalived_vrrp[105536]: VRRP_Instance(49) Transition to MASTER STATE
...
Aug 5 12:59:01 test1 Keepalived_vrrp[105536]: VRRP_Instance(414) Entering MASTER STATE
Aug 5 12:59:01 test1 Keepalived_healthcheckers[105535]: Netlink reflector reports IP 169.254.14.254 added
...
Aug 5 12:59:01 test1 Keepalived_vrrp[105536]: Sending gratuitous ARP on vrrp.64.1 for 169.254.64.254
Aug 5 12:59:01 test1 Keepalived_vrrp[105536]: Sending gratuitous ARP on vrrp.64.1 for 169.254.64.254
Aug 5 12:59:01 test1 Keepalived_vrrp[105536]: Sending gratuitous ARP on vrrp.64.1 for 169.254.64.254
Aug 5 12:59:01 test1 Keepalived[105534]: pid 10553 exited due to signal 6
Aug 5 12:59:01 test1 Keepalived[105534]: VRRP child process(105536) died: Respawning
Aug 5 12:59:01 test1 Keepalived[105534]: Starting VRRP child process, pid=106399
...
Aug 5 12:59:01 test1 Keepalived_vrrp[106399]: Netlink reflector reports IP 91.121.124.15 added
Aug 5 12:59:01 test1 Keepalived_vrrp[106399]: Netlink reflector reports IP 37.187.231.159 added
Aug 5 12:59:01 test1 Keepalived_vrrp[106399]: Netlink reflector reports IP 169.254.1.254 added
Aug 5 12:59:01 test1 Keepalived_vrrp[106399]: Netlink reflector reports IP 169.254.2.254 added
...
Aug 5 12:59:06 test1 Keepalived_vrrp[106399]: Sending gratuitous ARP on vrrp.92.1 for 169.254.92.254
Aug 5 12:59:06 test1 Keepalived_vrrp[106399]: Sending gratuitous ARP on vrrp.92.1 for 169.254.92.254
Aug 5 12:59:06 test1 Keepalived[105534]: pid 10639 exited due to signal 6
Aug 5 12:59:06 test1 Keepalived[105534]: VRRP child process(106399) died: Respawning
Aug 5 12:59:06 test1 Keepalived[105534]: Starting VRRP child process, pid=106511
...

Core was generated by `/usr/sbin/keepalived -D --core-dump -f /run/conf/keepalived.conf'.
Program terminated with signal 6, Aborted.
#0 0x00007fcc27571125 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0 0x00007fcc27571125 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fcc27574325 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fcc2756a311 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x0000000000430b20 in keepalived_malloc ()
#4 0x0000000000431aa1 in ?? ()
#5 0x000000000043206d in thread_add_read ()
#6 0x00000000004235c6 in ?? ()
#7 0x000000000043360d in launch_scheduler ()
#8 0x000000000041aee9 in start_vrrp_child ()
#9 0x000000000041afe6 in ?? ()
#10 0x000000000043360d in launch_scheduler ()
#11 0x000000000040ab23 in main ()

(gdb) quit

[root@test1 ~] stat /core
Modify: 2016-08-05 12:59:11.289464114 +0200

Increasing OS rmem via sysctl didn't help.

Keepalive template to generate test instances:

vrrp_instance 4__ID__ {
    state BACKUP
    interface bondX.__ID__
    use_vmac vrrp.__ID__.1
    virtual_router_id 1
    priority 120
    advert_int 5
    preempt_delay 60

    virtual_ipaddress {
        169.254.__ID__.254/24
    }
    mcast_src_ip 169.254.__ID__.254
    unicast_src_ip 169.254.__ID__.254
}

interface test setup: (type of interface doesn't matter since I tested on linux bond, eth with same results)

ip link add bondX type dummy
ip link set up dev bondX

and in the loop:

ip link add link bondX name bondX.$id type vlan id $id
ip link set up dev bondX.$id

and generate the config from template.

PS. for about 200 instances with vmac I got respawning so often, keepalived child processes went up to few thousands (oO).

Do you think playing with SO_RCVBUF would help?

Regards!

@pqarmitage
Copy link
Collaborator

Leaving aside the Receive message overrun message for now, since I presume that is not causing keepalived to terminate, the issue with keepalived is an assertion failure in keepalived_malloc(). Since you have the keepalived_malloc() function, it would appear that you have have run configure with the --enable-debug option, which will use more memory than building without that option (and it will also make keepalived run slightly slower).

The assert failure in keepalived is assert(number_alloc_list < MAX_ALLOC_LIST), where MAX_ALLOC_LIST is defined to be 2048. This is the maximum number of concurrent un-free()''d malloc()s when keepalived is build with --enable-debug. Due to the number of vrrp instances, and also configuring vmacs increases the number of malloc()'d blocks, you are hitting the MAX_ALLOC_LIST limit.

If you want to continue building with --enable-debug, then increase MAX_ALLOC_LIST in lib/memory.h. Otherwise build without --enable-debug. The assert should then no longer occur.

Included in keepalived v1.2.23 is a patch that separates out the memory checking from the --enable-debug option, by introducing an --enable-mem-check option, so you could move to 1.2.23 and still build with --enable-debug, and without --enable-mem-check, and then you won't have the MAX_ALLOC_LIST limit. As ever, moving to the latest release makes it easier to support any issues you may have.

@acassen
Copy link
Owner

acassen commented Aug 7, 2016

100% agreed, Please give it a try without debug stuff... Memory debug was just in while debugging leak, it is there just for coding purpose. I am closing this issue since this is the root cause IMHO. If not please reopen it while reporting.

Regs,

@acassen acassen closed this as completed Aug 7, 2016
@jslocinski
Copy link
Author

Yeap, you've right. removal of --enable-debug solved the issue we had while testing many instances.

Regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants