Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ftl crash #949

Closed
minsikau opened this issue Nov 30, 2020 · 32 comments
Closed

ftl crash #949

minsikau opened this issue Nov 30, 2020 · 32 comments

Comments

@minsikau
Copy link

Looks like the Raspberry ip address failed on the ethernet port and then the
FTL failed. We woke in the morning and had access to the local network but not the internet. the router had a good connectetion to the wan, Pihole web was also operating but had no DNS.

On initial install i had to select "other ports" in monitoring settings, system listen on all interfaces 1 hop as the only way to make it work and actually listen, then it behaved as expected.

This is begining from the debug file,
and then the FTL crash section again from the debug file.

*** [ INITIALIZING ] Sourcing setup variables
[i] Sourcing /etc/pihole/setupVars.conf...

*** [ DIAGNOSING ]: Core version
[i] Core: v5.2 (https://discourse.pi-hole.net/t/how-do-i-update-pi-hole/249)
[i] Remotes: origin https://github.com/pi-hole/pi-hole.git (fetch)
origin https://github.com/pi-hole/pi-hole.git (push)
[i] Branch: master
[i] Commit: v5.2-0-gfee1b8b

*** [ DIAGNOSING ]: Web version
[i] Web: v5.2 (https://discourse.pi-hole.net/t/how-do-i-update-pi-hole/249)
[i] Remotes: origin https://github.com/pi-hole/AdminLTE.git (fetch)
origin https://github.com/pi-hole/AdminLTE.git (push)
[i] Branch: master
[i] Commit: v5.2-0-g2c2d9f5

*** [ DIAGNOSING ]: FTL version
[✓] FTL: v5.3.1

*** [ DIAGNOSING ]: lighttpd version
[i] 1.4.53

*** [ DIAGNOSING ]: php version
[i] 7.3.19

*** [ DIAGNOSING ]: Operating system
[i] dig return code: 0
[i] dig response: "Raspbian=9,10 Ubuntu=16,18,20 Debian=9,10 Fedora=31,32 CentOS=7,8"
[✓] Distro: Raspbian
[✓] Version: 10

*** [ DIAGNOSING ]: SELinux
[i] SELinux not detected

*** [ DIAGNOSING ]: FirewallD
[i] Firewalld service inactive

*** [ DIAGNOSING ]: Processor
[✓] armv6l

*** [ DIAGNOSING ]: Networking
[✗] No IPv4 address(es) found on the enxb827ebbcc05f interface.

[✗] No IPv6 address(es) found on the enxb827ebbcc05f interface.

[i] Default IPv4 gateway: 192.168.1.1

*** [ DIAGNOSING ]: Ports in use
*:514 rsyslogd (IPv4)
*:514 rsyslogd (IPv6)
*:22 sshd (IPv4)
*:22 sshd (IPv6)
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL

*** [ DIAGNOSING ]: Name resolution (IPv4) using a random blocked domain and a known ad-serving domain
[✓] hardyskills.com is 0.0.0.0 via localhost (127.0.0.1)
[✓] hardyskills.com is 0.0.0.0 via Pi-hole (192.168.1.2)
[✓] doubleclick.com is 142.250.66.206 via a remote, public DNS server (8.8.8.8)

*** [ DIAGNOSING ]: Discovering active DHCP servers (takes 10 seconds)
Scanning all your interfaces for DHCP servers
Timeout: 10 seconds

  • Received 300 bytes from eth0:192.168.1.1
    Offered IP address: 192.168.1.109
    Server IP address: 192.168.1.1
    Relay-agent IP address: N/A
    DHCP options:
    Message type: DHCPOFFER (2)
    server-identifier: 192.168.1.1
    lease-time: 120 ( 2m )
    renewal-time: 60 ( 1m )
    rebinding-time: 105 ( 1m 45s )
    netmask: 255.255.255.0
    broadcast: 192.168.1.255
    dns-server: 192.168.1.2
    wpad-server: (length 1)
    router: 192.168.1.1
    --- end of options ---

DHCP packets received on interface eth0: 1
DHCP packets received on interface lo: 0

*** [ DIAGNOSING ]: Pi-hole processes
[✓] lighttpd daemon is active
[✓] pihole-FTL daemon is active

*** [ DIAGNOSING ]: Pi-hole-FTL full status
● pihole-FTL.service - LSB: pihole-FTL daemon
Loaded: loaded (/etc/init.d/pihole-FTL; generated)
Active: active (exited) since Tue 2020-12-01 08:04:05 ACDT; 32min ago
Docs: man:systemd-sysv-generator(8)
Process: 402 ExecStart=/etc/init.d/pihole-FTL start (code=exited, status=0/SUCCESS)

Dec 01 08:03:18 raspberrypi systemd[1]: Starting LSB: pihole-FTL daemon...
Dec 01 08:03:19 raspberrypi pihole-FTL[402]: Not running
Dec 01 08:03:21 raspberrypi su[431]: (to pihole) root on none
Dec 01 08:03:21 raspberrypi su[431]: pam_unix(su:session): session opened for user pihole by (uid=0)
Dec 01 08:04:05 raspberrypi pihole-FTL[402]: FTL started!
Dec 01 08:04:05 raspberrypi systemd[1]: Started LSB: pihole-FTL daemon.

-rw-r--r-- 1 pihole pihole 89021 Dec 1 08:34 /var/log/pihole-FTL.log
-----head of pihole-FTL.log------
[2020-12-01 03:45:40.275 3712M] Resizing "FTL-dns-cache" from 86016 to (5632 * 16) == 90112 (/dev/shm: 2.5MB used, 93.4MB total)
[2020-12-01 04:01:22.997 3712M] Reloading DNS cache
[2020-12-01 04:01:23.169 3712M] Blocking status is enabled
[2020-12-01 04:01:23.185 3712M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-12-01 04:01:23.185 3712M] ----------------------------> FTL crashed! <----------------------------
[2020-12-01 04:01:23.186 3712M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-12-01 04:01:23.186 3712M] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-12-01 04:01:23.186 3712M] and include in your report already the following details:
[2020-12-01 04:01:23.187 3712M] FTL has been running for 137192 seconds
[2020-12-01 04:01:23.187 3712M] FTL branch: master
[2020-12-01 04:01:23.188 3712M] FTL version: v5.3.1
[2020-12-01 04:01:23.188 3712M] FTL commit: e1db31d
[2020-12-01 04:01:23.188 3712M] FTL date: 2020-11-28 21:59:12 +0100
[2020-12-01 04:01:23.189 3712M] FTL user: started as pihole, ended as pihole
[2020-12-01 04:01:23.190 3712M] Compiled for armv6hf (compiled on CI) using arm-linux-gnueabihf-gcc (crosstool-NG crosstool-ng-1.22.0-88-g8460611) 4.9.3
[2020-12-01 04:01:23.190 3712M] Process details: MID: 3712
[2020-12-01 04:01:23.191 3712M] PID: 3712
[2020-12-01 04:01:23.191 3712M] TID: 3712
[2020-12-01 04:01:23.191 3712M] Name: pihole-FTL
[2020-12-01 04:01:23.192 3712M] Received signal: Segmentation fault
[2020-12-01 04:01:23.192 3712M] at address: (nil)
[2020-12-01 04:01:23.192 3712M] with code: SEGV_MAPERR (Address not mapped to object)
[2020-12-01 04:01:23.226 3712M] Backtrace:
[2020-12-01 04:01:23.228 3712M] B[0000]: /usr/bin/pihole-FTL(+0x4a2f8) [0x49d2f8]
[2020-12-01 04:01:23.818 3712M] L[0000]: /root/project/src/signals.c:194
[2020-12-01 04:01:23.850 3712M] B[0001]: /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0xb6d95130]
[2020-12-01 04:01:23.850 3712M] ------ Listing content of directory /dev/shm ------
[2020-12-01 04:01:23.851 3712M] File Mode User:Group Filesize Filename
[2020-12-01 04:01:23.854 3712M] rwxrwxrwx root:root 260 .
[2020-12-01 04:01:23.855 3712M] rwxr-xr-x root:root 4K ..
[2020-12-01 04:01:23.856 3712M] rw------- pihole:pihole 4K FTL-per-client-regex
[2020-12-01 04:01:23.868 3712M] rw------- pihole:pihole 90K FTL-dns-cache
[2020-12-01 04:01:23.869 3712M] rw------- pihole:pihole 20K FTL-overTime
[2020-12-01 04:01:23.870 3712M] rw------- pihole:pihole 2M FTL-queries
[2020-12-01 04:01:23.872 3712M] rw------- pihole:pihole 29K FTL-upstreams

-----tail of pihole-FTL.log------
[2020-12-01 08:04:05.169 475M] Error: Found unknown status 12 in long term database!
[2020-12-01 08:04:05.170 475M] Timestamp: 1606757184
[2020-12-01 08:04:05.170 475M] Continuing anyway...
[2020-12-01 08:04:05.180 475M] Error: Found unknown status 12 in long term database!
[2020-12-01 08:04:05.181 475M] Timestamp: 1606757212
[2020-12-01 08:04:05.181 475M] Continuing anyway...
[2020-12-01 08:04:05.204 475M] Error: Found unknown status 12 in long term database!
[2020-12-01 08:04:05.204 475M] Timestamp: 1606757396
[2020-12-01 08:04:05.205 475M] Continuing anyway...
[2020-12-01 08:04:05.206 475M] Error: Found unknown status 12 in long term database!
[2020-12-01 08:04:05.206 475M] Timestamp: 1606757405
[2020-12-01 08:04:05.207 475M] Continuing anyway...
[2020-12-01 08:04:05.238 475M] Imported 26301 queries from the long-term database
[2020-12-01 08:04:05.250 475M] -> Total DNS queries: 26301
[2020-12-01 08:04:05.251 475M] -> Cached DNS queries: 4875
[2020-12-01 08:04:05.251 475M] -> Forwarded DNS queries: 10544
[2020-12-01 08:04:05.251 475M] -> Blocked DNS queries: 10496
[2020-12-01 08:04:05.252 475M] -> Unknown DNS queries: 0
[2020-12-01 08:04:05.262 475M] -> Unique domains: 1586
[2020-12-01 08:04:05.262 475M] -> Unique clients: 9
[2020-12-01 08:04:05.273 475M] -> Known forward destinations: 2
[2020-12-01 08:04:05.274 475M] Successfully accessed setupVars.conf
[2020-12-01 08:04:05.394 519M] PID of FTL process: 519
[2020-12-01 08:04:05.398 519M] INFO: FTL is running as user pihole (UID 999)
[2020-12-01 08:04:05.433 519/T520] Listening on port 4711 for incoming IPv4 telnet connections
[2020-12-01 08:04:05.447 519/T522] Listening on Unix socket
[2020-12-01 08:04:05.457 519/T521] Listening on port 4711 for incoming IPv6 telnet connections
[2020-12-01 08:04:05.466 519M] Reloading DNS cache
[2020-12-01 08:04:05.484 519M] Blocking status is enabled
[2020-12-01 08:04:05.889 519/T523] Compiled 0 whitelist and 0 blacklist regex filters for 9 clients in 24.7 msec
[2020-12-01 08:05:29.189 519M] Resizing "FTL-strings" from 40960 to (45056 * 1) == 45056 (/dev/shm: 2.1MB used, 93.4MB total)
[2020-12-01 08:06:30.888 519M] Resizing "FTL-dns-cache" from 4096 to (512 * 16) == 8192 (/dev/shm: 2.1MB used, 93.4MB total)
[2020-12-01 08:15:34.185 519M] Resizing "FTL-dns-cache" from 8192 to (768 * 16) == 12288 (/dev/shm: 2.1MB used, 93.4MB total)
[2020-12-01 08:34:11.560 519M] Resizing "FTL-strings" from 45056 to (49152 * 1) == 49152 (/dev/shm: 2.1MB used, 93.4MB total)
[2020-12-01 08:34:51.008 519M] Resizing "FTL-dns-cache" from 12288 to (1024 * 16) == 16384 (/dev/shm: 2.1MB used, 93.4MB total)

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 1, 2020

I've had the same crash twice today; FTL log gist is here

Edit: Also, the pi's networking was fine throughout this. I'd be surprised if it's related, but I have also been having ISP issues today so my modem's been up and down and the switch the pi is on was restarted a few times earlier today (but neither crash lined up with the switch restart)

@DL6ER
Copy link
Member

DL6ER commented Dec 1, 2020

@minsikau

This part:

[i] Default IPv4 gateway: 192.168.1.1

Pinging 192.168.1.1...
[✗] Gateway did not respond. (https://discourse.pi-hole.net/t/why-is-a-default-gateway-important-for-pi-hole/3546)

looks suspicious. However, the following part worked:

[✓] doubleclick.com is 142.250.66.206 via a remote, public DNS server (8.8.8.8)

So I suspect your router simply does not reply to internal pings and this is not really suggesting a network issue.

Unfortunately, your bug trace does not contain the necessary details for me to know what was the cause of the issue. Can you somehow reproduce/force this crash to happen? I just tried restarting my Pi-hole with no network cable attached but I don't see a crash myself.


@BlueSpaceCanary This may still have to do with intermittend Internet failures, however, I haven't seen something like this even when coding for Pi-hole while in the train (which is, at least in Germany, the definition of unreliable Internet).

[2020-12-01 04:01:23.226 3712M] Backtrace:
[2020-12-01 04:01:23.228 3712M] B[0000]: /usr/bin/pihole-FTL(+0x4a2f8) [0x49d2f8]
[2020-12-01 04:01:23.818 3712M] L[0000]: /root/project/src/signals.c:194
[2020-12-01 04:01:23.850 3712M] B[0001]: /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0xb6d95130]

doesn't tell us much, did both crashed only show these lines as backtrace?

@BlueSpaceCanary
Copy link

Yep, unfortunately that's all there was for both crashes

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 1, 2020

@DL6ER just had a third, and given that they seem like they're going to be pretty regular so it shouldn't take too long to catch one, I'm wondering whether there's an easy way for me to have FTL run in gdb or lldb so next time I can maybe grab a bit more useful information for you?

@yubiuser
Copy link
Member

yubiuser commented Dec 2, 2020

There is a way
https://docs.pi-hole.net/ftldns/debugging/

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 2, 2020

@DL6ER caught a crash for you. Just the crash itself + backtrace:

Thread 1 "pihole-FTL" received signal SIG34, Real-time event 34.

Thread 4 "database" received signal SIGABRT, Aborted.
[Switching to Thread 0x750b5460 (LWP 28349)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x76d3e230 in __GI_abort () at abort.c:79
#2  0x76d8e51c in __libc_message (action=action@entry=do_abort, fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:181
#3  0x76d95044 in malloc_printerr (str=<optimized out>) at malloc.c:5341
#4  0x76d97098 in _int_free (av=0x73600010, p=0x73622ca8, have_lock=0) at malloc.c:4193
#5  0x00451a12 in free_regex () at /root/project/src/regex.c:334
#6  read_regex_from_database () at /root/project/src/regex.c:451
#7  0x0044ae14 in FTL_reload_all_domainlists () at /root/project/src/datastructure.c:484
#8  0x0045c00a in DB_thread (val=<optimized out>) at /root/project/src/database/database-thread.c:81
#9  0x76e7b494 in start_thread (arg=0x750b5460) at pthread_create.c:486
#10 0x76dfe578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@BlueSpaceCanary
Copy link

I'll leave the gdb session open, so let me know if there's any vars whose values I should grab for you :) I'm not sure, but I think this one happened when I toggled a group state (i.e. re-enabling a few regex domain blocks via a group). I definitely hadn't noticed any DNS issues before that, and immediately after doing it things were broken.

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 3, 2020

Guess I might need to grab a debug build for that 😅

(gdb) up 5
#5  0x00451a12 in free_regex () at /root/project/src/regex.c:334
334     /root/project/src/regex.c: No such file or directory.
(gdb) info locals
regex = 0x44ae15 <FTL_reload_all_domainlists+68>
regexid = <optimized out>
clientID = <optimized out>
regexid = <optimized out>
regex = <optimized out>
index = <optimized out>

EDIT: Oh, just saw a8266a7 which looks like it might be relevant given the Thread 1 "pihole-FTL" received signal SIG34, Real-time event 34. right before the crash. I'll update and see if the crashes continue

@DL6ER
Copy link
Member

DL6ER commented Dec 3, 2020

Real-time Signals are expected and not a crash. They are used to tell FTL to reload domains from the database and similar things. Just type continue.

See no. 4 of https://docs.pi-hole.net/ftldns/debugging/

To avoid the <optimized out> you'd need to compile without optimizations (-O0).

@BlueSpaceCanary
Copy link

@DL6ER right after the SIG34 in the pihole-FTL thread, the database thread received a SIGABRT :)

I did a debug build and am running that now, but I did run into a couple errors. The first was just needing to grab one more library that wasn't in https://docs.pi-hole.net/ftldns/compile/#debian-ubuntu-raspbian (libreadline-dev). The second was

/home/bsc/FTL/src/dhcp-discover.c: In function ‘send_dhcp_discover’:
/home/bsc/FTL/src/dhcp-discover.c:184:2: error: statement with no effect [-Werror=unused-value]

so I went ahead and stuck -Wno-unused-value in my debug flags. I assume these are things y'all are aware of, just want to make sure I document anything I'm changing in the build process.

If I catch another crash I'll drop info in here

@BlueSpaceCanary
Copy link

Caught another, also immediately after toggling group state:

Thread 1 "pihole-FTL" received signal SIG34, Real-time event 34.

Thread 4 "database" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7507a460 (LWP 9834)]
0x00610068 in sqlite3DeleteTriggerStep (db=0x0, pTriggerStep=0x1de7b) at /home/bsc/FTL/src/database/sqlite3.c:134876
134876      pTriggerStep = pTriggerStep->pNext;
(gdb) bt
#0  0x00610068 in sqlite3DeleteTriggerStep (db=0x0, pTriggerStep=0x1de7b) at /home/bsc/FTL/src/database/sqlite3.c:134876
#1  0x006112a8 in sqlite3DeleteTrigger (db=0x0, pTrigger=0x73301600) at /home/bsc/FTL/src/database/sqlite3.c:135403
#2  0x005ed094 in sqlite3SchemaClear (p=0x7334d810) at /home/bsc/FTL/src/database/sqlite3.c:114635
#3  0x00598ba8 in sqlite3BtreeClose (p=0x7334e1b8) at /home/bsc/FTL/src/database/sqlite3.c:66842
#4  0x0063b3d8 in sqlite3LeaveMutexAndCloseZombie (db=0x73323d60) at /home/bsc/FTL/src/database/sqlite3.c:160287
#5  0x0063b2c0 in sqlite3Close (db=0x73323d60, forceZombie=0) at /home/bsc/FTL/src/database/sqlite3.c:160230
#6  0x0063b2f8 in sqlite3_close (db=0x73323d60) at /home/bsc/FTL/src/database/sqlite3.c:160243
#7  0x004f57d4 in gravityDB_close () at /home/bsc/FTL/src/database/gravity-db.c:919
#8  0x004f2e80 in gravityDB_reopen () at /home/bsc/FTL/src/database/gravity-db.c:70
#9  0x004d5814 in FTL_reload_all_domainlists () at /home/bsc/FTL/src/datastructure.c:477
#10 0x004f2cd0 in DB_thread (val=0x0) at /home/bsc/FTL/src/database/database-thread.c:81
#11 0x76e3f494 in start_thread (arg=0x7507a460) at pthread_create.c:486
#12 0x76dc2578 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:73 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Since it seems to be somewhat reproduceable on my machine and I'm seeing segfaults and sigaborts in different spots, I'll try running FTL in valgrind this evening and poking at it

@DL6ER
Copy link
Member

DL6ER commented Dec 3, 2020

/home/bsc/FTL/src/dhcp-discover.c: In function ‘send_dhcp_discover’:
/home/bsc/FTL/src/dhcp-discover.c:184:2: error: statement with no effect [-Werror=unused-value]

Which version of the compiler are you using? There was a bug in the optimizations strategies of some older gcc where this (admitedly useless) feature was necessary for the entire function to work at all. That's the reason of this for being there.


Thanks for the SIGSEGV backtrace. This is happening deeeeply down in the SQLite3 engine so there isn't much we can do about it. I fully agree with your idea to run through valgrind if crashes take place in different location all of the time. Smells like memory corruption (even when I'm surprised you're the only one affected by it).

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 3, 2020

% gcc --version
gcc (Raspbian 8.3.0-6+rpi1) 8.3.0

I won't have time til this evening, but I'll see if I can tickle another crash out of it. After just starting it up, the only thing I see outside sqlite or libarmmem is a bunch of Conditional jump or move depends on uninitialised value(s)s in different spots in tre_parse(), all of them from a stack allocation in compile_regex()

EDIT: Hm, I made a couple hacks to get the admin console to believe that FTL is running, but even so, when running under valgrind I don't seem to be able to change the configured state of FTL. Toggling groups or adlists, using Disable, etc has no effect on the server when running in valgrind, but the underlying config does change and killing & restarting FTL causes it to pick up the new state. I'll see if I can figure out why, since I need to toggle those in FTL to try to produce a crash

@DL6ER
Copy link
Member

DL6ER commented Dec 4, 2020

After just starting it up, the only thing I see outside sqlite or libarmmem is a bunch of Conditional jump or move depends on uninitialised value(s)s in different spots in tre_parse(), all of them from a stack allocation in compile_regex()

This shouldn't be something for us to worry about.


when running under valgrind I don't seem to be able to change the configured state of FTL

Oh yes, that's expected. valgrind impersonates the process it runs. You can force FTL to update/reload by sending the same signals to valgrind's inner machinery. In your case, sending the signal to memcheck should be the right thing to do:

sudo pkill -HUP memcheck

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 4, 2020

Ok, just repeatedly toggling a group off, reloading, and then toggling it back on and reloading again does seem to reliably produce a crash eventually, though it sometimes takes a bit.

Valgrind log: https://gist.github.com/BlueSpaceCanary/a3bf6180d1caed088d1afa5e1ac6833c

Valgrind suppressions just in case: https://gist.github.com/BlueSpaceCanary/0e0d33784335d0f9416c56f5d99b17e7

pihole-FTL.log from startup to crash: https://gist.github.com/BlueSpaceCanary/d192252938b6ba2d92f2e0c805c8e8a9

The crash happened after re-enabling a group with a few regexes, which is missing at the very end of the log before the crash

@Bernie
Copy link

Bernie commented Dec 5, 2020

I am getting a segmentation fault continuously for every 255 seconds. I had initially thought it was a configuration error, and then cleared all config and let it start up as a fresh installation and am still receiving this crash:

[2020-12-05 16:21:05.724 367M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-12-05 16:21:05.725 367M] ---------------------------->  FTL crashed!  <----------------------------
[2020-12-05 16:21:05.725 367M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2020-12-05 16:21:05.725 367M] Please report a bug at https://github.com/pi-hole/FTL/issues
[2020-12-05 16:21:05.725 367M] and include in your report already the following details:
[2020-12-05 16:21:05.725 367M] FTL has been running for 255 seconds
[2020-12-05 16:21:05.725 367M] FTL branch: master
[2020-12-05 16:21:05.725 367M] FTL version: v5.3.2
[2020-12-05 16:21:05.726 367M] FTL commit: 0790cf7
[2020-12-05 16:21:05.726 367M] FTL date: 2020-12-02 21:28:39 +0000
[2020-12-05 16:21:05.726 367M] FTL user: started as root, ended as root
[2020-12-05 16:21:05.726 367M] Compiled for armv7hf (compiled on CI) using arm-linux-gnueabihf-gcc (Debian 6.3.0-18) 6.3.0 20170516
[2020-12-05 16:21:05.726 367M] Process details: MID: 367
[2020-12-05 16:21:05.726 367M]                  PID: 367
[2020-12-05 16:21:05.727 367M]                  TID: 367
[2020-12-05 16:21:05.727 367M]                  Name: pihole-FTL
[2020-12-05 16:21:05.727 367M] Received signal: Segmentation fault
[2020-12-05 16:21:05.727 367M]      at address: 0xb6b5e4c8
[2020-12-05 16:21:05.727 367M]      with code:  SEGV_MAPERR (Address not mapped to object)
[2020-12-05 16:21:05.728 367M] Backtrace:
[2020-12-05 16:21:05.728 367M] B[0000]: pihole-FTL(+0x3a460) [0x4b0460]
[2020-12-05 16:21:05.735 367M] L[0000]: N/A (0x3a460)
[2020-12-05 16:21:05.735 367M] B[0001]: /lib/arm-linux-gnueabihf/libc.so.6(+0x25cb0) [0xb6d2bcb0]
[2020-12-05 16:21:05.735 367M] B[0002]: /lib/arm-linux-gnueabihf/libpthread.so.0(pthread_cancel+0x15) [0xb6e0ba1e]
[2020-12-05 16:21:05.736 367M] B[0003]: pihole-FTL(main+0x1af) [0x4a13b8]
[2020-12-05 16:21:05.742 367M] L[0003]: N/A (0x2b3b8)
[2020-12-05 16:21:05.742 367M] B[0004]: /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x97) [0xb6d1d524]
[2020-12-05 16:21:05.742 367M] ------ Listing content of directory /dev/shm ------
[2020-12-05 16:21:05.742 367M] File Mode User:Group  Filesize Filename
[2020-12-05 16:21:05.743 367M] rwxrwxrwx root:root 260 .
[2020-12-05 16:21:05.743 367M] rwxr-xr-x root:root 320 ..
[2020-12-05 16:21:05.743 367M] rw------- root:root 4K FTL-per-client-regex
[2020-12-05 16:21:05.744 367M] rw------- root:root 4K FTL-dns-cache
[2020-12-05 16:21:05.744 367M] rw------- root:root 20K FTL-overTime
[2020-12-05 16:21:05.744 367M] rw------- root:root 229K FTL-queries
[2020-12-05 16:21:05.744 367M] rw------- root:root 29K FTL-upstreams
[2020-12-05 16:21:05.745 367M] rw------- root:root 340K FTL-clients
[2020-12-05 16:21:05.745 367M] rw------- root:root 66K FTL-domains
[2020-12-05 16:21:05.745 367M] rw------- root:root 4K FTL-strings
[2020-12-05 16:21:05.745 367M] rw------- root:root 12 FTL-settings
[2020-12-05 16:21:05.746 367M] rw------- root:root 152 FTL-counters
[2020-12-05 16:21:05.746 367M] rw------- root:root 28 FTL-lock
[2020-12-05 16:21:05.746 367M] ---------------------------------------------------
[2020-12-05 16:21:05.746 367M] Please also include some lines from above the !!!!!!!!! header.
[2020-12-05 16:21:05.746 367M] Thank you for helping us to improve our FTL engine!
[2020-12-05 16:21:05.747 367M] FTL terminated!
[2020-12-05 16:21:33.347 364M] Using log file /var/log/pihole-FTL.log
[2020-12-05 16:21:33.347 364M] ########## FTL started! ##########

I am running this in docker.

EDIT: Actually, I think this crash was related to a different issue I was encountering. My docker container was continuously terminating and restarting the instance due to a dns configuraiton issue (pi-hole/docker-pi-hole#342). Solving this issue appears to have solved the frequent crashing for me.

@DL6ER
Copy link
Member

DL6ER commented Dec 7, 2020

@BlueSpaceCanary Thanks, I just had time to look at your gists but nothing strikes out to me right now. Can you run with --track-origins=yes ? I'll have some more time this evening so I'll also try to reproduce myself, again.

@bjaudon
Copy link

bjaudon commented Dec 7, 2020

I’m also experiencing a FTL segfault with my pi and GCP instances.

https://tricorder.pi-hole.net/x2iunvjii5

https://tricorder.pi-hole.net/lg7kg5blwl

@BlueSpaceCanary
Copy link

@DL6ER sure thing, here's one with --track-origins=yes, and just in case also a huge one both with that enabled & the suppression lists disabled

@DL6ER
Copy link
Member

DL6ER commented Dec 7, 2020

@bjaudon Your two crash reports (one on x86_64, one on ARMv6) show two crashes at two different locations, however, both happening when trying to access memory on disk. I'm putting this on my "look for this as well" list.

@BlueSpaceCanary Unfortunately, this still doesn't show anything revealing of significance. I'm a bit confused why there are so many incorrect Use of uninitialised value of size 4 and Conditional jump or move depends on uninitialised value(s). In all these locations (I checked all of them) we have safe-guard in place that do not allow execution of the corresponding code regions if memory is uninitialized.
I went through each file in your first link, I think the suppression you chose was wisely chosen so I've not even tried to carefully go through the second. It'd likely keep me busy until the day after tomorrow (or longer).
This is a bit unfortunate because this seems to suggest a memory error in terms of maybe a dangling pointer is not the issue here. As we're nonethless seeing a few crashes from users on various systems and all bug traces seem to suggest locations where crashes either cannot happen, like

if((strstr(conflinebuffer, keystr)) == NULL)

(both conflinebuffer and keystr are checked against NULL before, guaranteed null-terminated and locally heap-allocated)

Still trying to reproduce locally, it seems like nothing else really works, unfortunately.

@BlueSpaceCanary
Copy link

Worst come to worst, if you're not able to find anything this week I can brush off my C and try to track it down when I'm off work over the holidays since I at least have an easy reproduction to test against.

(I'm also memtesting my pi to definitively rule out a hardware issue on the off chance there's a sticking bit which somehow predictably ends up in the physical addresses backing FTL's address space)

@DL6ER
Copy link
Member

DL6ER commented Dec 8, 2020

I'm still really hoping to find this quickly. Chances are good as long as I find anything giving me something crashing. I tried various regular expressions, sending signals in quick succession and a few more. I found something, I'm still checking it, though.

Could you run

pihole checkout ftl fix/tre-regex-forks

and see if the crashes are resolved? I do not really have high hopes on that this fixing of a real edge-case in the regex process will also resolve what you're seeing, but we will only know once we try.

@BlueSpaceCanary
Copy link

@DL6ER
Copy link
Member

DL6ER commented Dec 8, 2020

Okay, next try: The crash you're observing happens in the SQLite3 engine, so let's try to rule out a hidden database corruption on disk. You may loose history data, however, we'll first simply put the database somewhere else so the data is still there. If this turns out to be the solution and you want the data back, we can reimport if from the old into the new database later.

Could you try the following?

sudo service pihole-FTL stop
sudo mv /etc/pihole/pihole-FTL.db /etc/pihole/pihole-FTL.db_backup
sudo service pihole-FTL start

@BlueSpaceCanary
Copy link

Still gives the same sqlite crash. I also tried disabling writing to the db with DBFILE= or giving a nonexistent path where the pihole user can't write (so it couldn't create one), but both cases still crashed (in sqlite code, even). Same thing with both the FTL and gravity DBs moved out of the way.

I've noticed when I run in valgrind I sometimes get this double free and sometimes don't. I haven't caught it in gdb, but I wonder if something is sometimes overwriting one of the list addresses so they end up pointing to the same memory?

Also noticed some Invalid write of size 4 [...] Address 0x0 is not stack'd, malloc'd or (recently) free'ds in getline from parse_FTLconf, which I think is because when conflinebuffer is freed and set to null, size isn't set to 0. From man getline:

If *lineptr is set to NULL and *n is set 0 before the call, then
getline() will allocate a buffer for storing the line. This buffer
should be freed by the user program even if getline() failed.

Setting size to zero & building locally didn't solve the crashes, but it has made it really hard to reproduce the crash specifically while running under valgrind 🤷 (it also did seem to eliminate the warnings about writing to null)

@DL6ER
Copy link
Member

DL6ER commented Dec 8, 2020

Ah, thanks, very good catch! I see I added zeroing the number in the getline routines in setupVars.conf but forgot to do so in config.c - interestingly enough, git blame reveals that it is like that for at least 3 years. Nevertheless, it is wrong. One could also argue that we should just keep the memory forever and never free it because (a) it is only a few bytes and (b) frequently used, but, for not, we'll just add the explicit zeroing of the size.

Concerning the double-free: This is strange as well, however, FTL has its own mechanism to detect (and avoid!) this in-vivo so it shouldn't be the cause for any crashes. This region of the code also doesn't have to do with the database and it's not really looking like a dangling pointer to me: There is no offset into the block so it is likely "just" some code freeing something twice:

Address 0x4df2da0 is 0 bytes inside a block of size 24 free'd

FTL/src/regex.c

Line 334 in 0790cf7

free(regex);

Looking closely at this code, the only way this can be triggered seems to be you had some regex list was empty and you add an entry. Then FTL would try to free the regex list even when it didn't exist before. Very good catch, again, however, still not the bug we're trying to squash here. :-)

PRs to fix these two are on their way. Thanks! I'm curious to see what you find next 🙂


Edit Wait, there is an if inplace that should prevent exactly what you're seeing:

FTL/src/regex.c

Lines 332 to 336 in 0790cf7

if(regex != NULL)
{
free(regex);
regex = NULL;
}

Does this mean we're freeing this somewhere else where it is not explicitly set to NULL? I don't think so, especially because your valgrind experiment showed that it was the same free before (which did explicitly set to NULL).

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 10, 2020

After some trial and error checking different things with vgdb, I've at least got a partial answer. In read_regex_table(), gravityDB_count(tableID) is occasionally incorrectly returning 0 instead of 1. That leads to read_regex_table() returning early, which means black_regex or white_regex stays a freed pointer (since free_regex() only sets its local regex_data* to null, not the globals) and then the next time free_regex() gets called, that gets freed again.

I haven't spent any time in gravity-db.c at all yet, but I'll try to catch a few problem calls to gravityDB_count() with vgdb tonight or tomorrow and see what I can find

@DL6ER
Copy link
Member

DL6ER commented Dec 10, 2020

Thanks, yeah, we definitely want to set the global regex pointer to NULL here. Still, not explaining the SQLite3 issue. Even when it should bail out, a double-free is probably "harmless" if it doesn't crash, however, it will be undefined what is happening, so good to fix this as well #971


In read_regex_table(), gravityDB_count(tableID) is occasionally incorrectly returning 0 instead of 1.

Have you checked that tableID is really what you expect it to be? If there are any optimizations left enabled, the code may insert short-cuts to skip over some of the tableIDs which is opaque to the debugger (at least to possibly defined breakpoints).

To check if this is nevertheless the case, you could check if FTL also report an incorrect number of regexes for the respective types. Watch out for a line like

Compiled 2 whitelist and 1 blacklist regex filters for 7 clients in 6.4 msec

in your /var/log/pihole-FTL.log

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 10, 2020

Yep, things lined up with what I'd expect. Here's an example segment of a gdb session (with some stepping through gravityDB's calls into sqlite snipped out). The next call into free_regex() triggers the double free warning.

piholeFTL.log does show 0 whitelist & 0 blacklist regex filters:

[2020-12-09 20:03:45.732 23888/T23918] Compiled 0 whitelist and 0 blacklist regex filters for 16 clients in 84120.2 msec

which seems arguably right because while the regexes are in gravityDB, this was toggling the group to disable them. But if that log line is supposed to report present-but-disabled expressions as well it's definitely wrong since my test setup has 1 of each, and toggling the group back on shows them:

[2020-12-09 20:07:51.353 23888/T23918] Compiled 1 whitelist and 1 blacklist regex filters for 16 clients in 224149.0 msec

The binary I'm running is built with -DCMAKE_BUILD_TYPE=DEBUG, and I double checked & all the flags.make files are indeed using -O0 -g3 so it shouldn't be optimization issues

EDIT: Well, I tested a quick hack to just preempt the double free by nulling the appropriate global in free_regex() and I don't seem to be able to reproduce my crash anymore in either debug builds or optimized builds

@BlueSpaceCanary
Copy link

BlueSpaceCanary commented Dec 10, 2020

Actually, rereading what you said, gravityDB_count() seems like it's doing the right thing? So maybe the double free really was the full root issue I was hitting

@BlueSpaceCanary
Copy link

@DL6ER still no crashes for me on the dev branch. Thanks so much for your time and for getting a fix checked in quickly once we'd found the cause!

@DL6ER
Copy link
Member

DL6ER commented Dec 24, 2020

The next version of FTL has been released. Please update and run

pihole checkout master

to get back on-track. Thanks for helping us to make Pi-hole better for us all!

If you have any issues, please either reopen this ticket or (preferably) create a new ticket describing the issues in further detail and only reference this ticket. This will help us to help you best.

@DL6ER DL6ER closed this as completed Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants