-
-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keepalived crash when reloading service #365
Comments
Unfortunately the coredump itself doesn't help a lot without the executable build in debug mode. Could you please build a debug version of keepalived - After you get the coredump, run gdb as The output that that produces would be most helpful, especially since this appears to be an intermittent problem. |
Hi I tried with
I must add that in that lab I'm using 100 of interfaces with 200 of ips on it. After starting keepalived logs ~300 "Netlink reflector reports IP" messages:
|
please just build it normally without --enable-debug, juste regular ./configure then make debug; run keepalived as Quentin explained with -D and try reproduce coredump. follow Quentin gdb steps in order to report. |
OK I did as you suggested. Keepalived crashed after two reloads:
gdb shows:
some logs:
|
in if_setsockopt_priority(...)... WTF?... hmm need to setup a docker test env... this seems to be related to docker... (1) in vrrp_if.c:if_setsockopt_priority(...) put a printf("SD VALUE : %d\n", *sd); just at the begining. And report. (2) After try to return *sd; just at the begining. (3) stracing VRRP pid would help a lot too : strace -p <pid_vrrp_child> regs |
(1) - added in sourcecode, but I don't quite know where you want me to check the results. Segfault in logs:
(3) strace
|
re, and bt in gdb still spot if_setsockopt_priority(...) ? to have more infos in gdb backtrace you might build keepalived with "make debug" (just type "make debug" after ./configure) command line that way -ggdb will be use and then will be able to produce debug outputs with symbols. |
I have been able to get keepalived to segfault using the configs above, and making keepalived reload the alternate configs every 10 seconds, although the problem is highly intermittent. The stack trace I get (generally) is: although once I have had the first line as: Clearly in the first example base_ifp = 0x0 is the problem. I suspect in the second case base_ifp is invalid. The issue is why is reset_interface_parameters being called with an invalid pointer. |
acassen: I'm using "make debug" version already.
|
issue can be reproduced on physical machine, so its not docker related bug. |
Issue acassen#365 identified that the vrrp child process intermittently segfaults when reloading configuration. Investigation has shown that vrrp->ifp->ifindex (and vrrp->ifp->ifname) are being overwritten, and hence in netlink_link_del_vmac() function if_get_by_ifindex() is returning null (since it is passed an invalid ifindex), and this return value is passed unchecked to reset_interface_parameters(), which then causes the segfault. This commit is a temporary workaround to test the return value of if_get_by_ifindex() and if it is NULL, then reset_interface_parameters() isn't called, but an error message is logged. Further investigation is needed to ascertain why the fields in the interface struct are being overwritten. Signed-off-by: Quentin Armitage <quentin@armitage.org.uk>
Update: vrrp->ifp->ifindex (and vrrp->ifp->ifname) are being overwritten with other data (vrrp->ifp looks still to be valid). The consequence of this is that in netlink_link_del_vmac() if_get_by_ifindex() is returning NULL. This is then passed (unchecked) to reset_interface_parameters(), which is where the segfault is occurring (or at least the segfault that I am seeing). I have submitted a pull request which should stop the segfault and log an error message "No interface found for ifindex nn (NAME), probably due to corruption". It would be very helpful if you could try this and confirm whether it stops the segfault occurring for you, and if you get the error message. Further work is needed to identify the cause of the memory corruption, but this may take some time. |
Pull request #370 has a resolution for one cause of the segfault when reloading, and it is this issue that is causing the vast majority of segfaults that I am seeing. I am now also seeing a segfault in thread_child_handler() in the line I suspect that the m->child list is getting corrupted. I will check the code for any similar issues of memory being freed during a reload, but pointers to that memory still existing. |
Hey, After some time of working the while-loop with reload proper-empty config, I see that at some point vmac interface is not deleted anymore. No error, no segfault. But since then, I can't also do a clean service stop (always one process with vmac interface will stay in the system) I couldn't reproduce this on previous versions (1.2.22 stable which was doing segfault instead). PS. I don't have second peer. Testing on veth link, but using only one side of it Few cycles of proper-empty config logs below. At the end I added logs + strace/lsof when tried to stop broken service. Comments in the logs.
|
From my last comment, I am occasionally seeing indefinite looping in the vrrp child process in function thread_child_handler(), as well as segfaults there. I think this is another instance of malloc'd memory being either corrupted in some way, or being used after it has been freed. When the indefinite looping occurs, then the vrrp child process no longer reports anything; and I find it is consuming near 100% CPU time. If you kill that process with a SEGV, then it should dump core, and you can look at it with gdb to see where it was executing (or you can attach to the process with gdb and see where it is without dumping core). I experience it being somewhere around line 778 of scheduler.c when looping, and indeed if it segfaults. After killing the process, the parent will restart it, and it should carry on as before, until the vrrp process gets stuck in a loop again. I'm looking into this problem at the moment, but again, this sort of problem is not easy to track down. If/when I found out more I'll update this issue report. |
The problem is that launch_scheduler() was setting the value of the parameter to be passed to thread_child_handler() to the original value of the thread_master_t *master. When reloading, that memory was free()'d and a new block of malloc()'d memory was assigned to master; however that value being passed to thread_child_handler() wasn't updated. If the old memory was subsequently returned to a different malloc() call, then the memory would be overwritten, hence causing thread_child_handler() to segfault, or to enter an infinite loop. Usually, since the malloc() for the new allocation was called immediately after the free() of the original allocation, the same address was returned, and so there wasn't a problem, but occasionally a different address would be returned, and then things would start to go wrong. Once a different address was returned, any new child processes would be added to the new allocated memory, but when a child terminated thread_child_handler() would search the old memory, and so not find the child. @jslocinski mentions "some other weird behaviour is observed (but only when used with notify_master/backup/fault scripts as in the example to reproduce the issue)". I haven't been able to observe that, but it would be really helpful if you could try the tests again and confirm if commit #371 resolves the weird behaviour as well as the segfaults/infinite loops. If all is resolved, it would be helpful if you could close this issue. |
@acassen still performing some tests, but it seems that your latest patches resolves all: the segfault issue, loop with 100% cpu (thus the service reload problem), and notify scripts. Good work! Thank you! |
Its all fine now. Thanks! |
Hi
after upgrading to keepalived 1.2.22 I discovered that keepalived crash from time to time when I remove configuration and reload service.
version affected: 1.2.22
version not affected: 1.2.19
Steps to reproduce:
create two config files:
add/remove vrrp and reload config:
coredump file attached
core.zip
The text was updated successfully, but these errors were encountered: