Hang in traverse - how to debug? #555

james-cook · 2022-02-08T21:40:27Z

Running master 2.10.1 and develop plus a few patches and I am getting a hang during initial traverse with both versions.

Traversing (316260 usable files / 0 + 0 ignored files / folders)
Always at the same file count.
After this, the drives spin down and nothing appears to be happening. The process continues to run, waiting for something. CPU usage is low.

I checked RAM use - no problems there. 4GB RAM, ca. 2GB used. (Raspberry Pi 4B)

With -vvv set, the last message I (always) see is:
WARNING: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
Would it be this file causing the problem? (probably more likely something afterwards)

Running gdb (develop version compiled with DEBUG=1) when I get to the "same place" I do a CTRL-C (not sure if this is useful)

INFO: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
^C
Thread 1 "rmlint" received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
37      ../sysdeps/unix/sysv/linux/arm/syscall.S: No such file or directory.
(gdb) bt
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb) thread apply all bt

Thread 3 (Thread 0xb5c513a0 (LWP 19170)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2a80 in g_cond_wait_until () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f81c in rm_mds_factory ()

Thread 2 (Thread 0xb64523a0 (LWP 19169)):
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
#1  0x0002c61c in rm_sys_open ()
#2  0x0002dba4 in rm_util_is_nonstripped ()
#3  0x00028834 in rm_traverse_file ()
#4  0x00029900 in rm_traverse_directory ()
#5  0x0001f868 in rm_mds_factory ()

Thread 1 (Thread 0xb6829010 (LWP 19164)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb)

I am not sure if these backtraces help - I am VERY rusty at debugging.
Any hints on how to hunt this problem down appreciated :)

The text was updated successfully, but these errors were encountered:

cebtenzzre · 2022-02-08T22:07:57Z

For that backtrace, only thread 2 is doing anything (the others are blocked in g_cond_wait), so the next steps would be thread 2, fin and see if the open ever finishes. If it doesn't, the kernel is stuck on something (dmesg would help). If it does finish, rmlint may be stuck in an infinite loop in traverse.c.
The filename "\230\305\002" is odd, it may indicate memory corruption - ASAN or valgrind might find a root cause.

james-cook · 2022-02-08T22:16:51Z

Just did...

(gdb) thread 2
[Switching to thread 2 (Thread 0xb64523a0 (LWP 19169))]
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
44      ../sysdeps/unix/sysv/linux/open.c: No such file or directory.
(gdb) fin
Run till exit from #0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44

and now waiting(!)

cebtenzzre · 2022-02-08T22:45:27Z

info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

james-cook · 2022-02-08T23:06:38Z

I pressed CTRL-C in the fin (above).

info proc shows:

(gdb) info proc
process 19164
cmdline = '/home/pi/source/rmlint/rmlint/rmlint -T all -emptyfiles -emptydirs -vvv -S mda -s -1TB /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/'
cwd = '/srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb'
exe = '/home/pi/source/rmlint/rmlint/rmlint'

 ls -l /proc/19164/cwd
lrwxrwxrwx 1 pi pi 0 Feb  8 22:59 /proc/19164/cwd -> /srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb

That's "just" the directory where I call rmlint from (cwd of the call to rmlint), not the path of the file in question (?)

stat /proc/19164/cwd/$'\230\305\002'
stat: cannot stat '/proc/19164/cwd/'$'\230\305\002': No such file or directory

cebtenzzre · 2022-02-08T23:16:44Z

Then either GDB is confused or the path is corrupt. If it is definitely blocked in the kernel (something like swi 0x0 just above the pointed to line in disas) then dmesg is the best way to know what could be wrong. A filesystem with a missing source (unplugged disk or NFS timeout) is usually the only reason that would happen.

james-cook · 2022-02-08T23:43:00Z

I have to humbly ask for a translation if possible ;)

(something like swi 0x0 just above the pointed to line in disas)

cwd - gdb does report where I call the program from. Would cwd be expected to change to each directory as rmlint traverses?

The original files were copied from a failing disk. This could be(?) a root of the problem perhaps(?)

cebtenzzre · 2022-02-08T23:45:46Z

In GDB, run disas and scroll up to the pointed to location. The surrounding assembly instructions will tell you if the process has switched to kernel mode.

james-cook · 2022-02-09T00:00:27Z

Is this the relevant pointer?:

(gdb) disas
Dump of assembler code for function __libc_open:
   0xb6b81b8c <+0>:     push    {r1, r2, r3}
   0xb6b81b90 <+4>:     mov     r1, r0
   0xb6b81b94 <+8>:     ldr     r3, [pc, #276]  ; 0xb6b81cb0 <__libc_open+292>
   0xb6b81b98 <+12>:    ldr     r12, [pc, #276] ; 0xb6b81cb4 <__libc_open+296>
   0xb6b81b9c <+16>:    add     r3, pc, r3
   0xb6b81ba0 <+20>:    push    {r4, r5, r6, r7, lr}
   0xb6b81ba4 <+24>:    sub     sp, sp, #16
   0xb6b81ba8 <+28>:    ldr     r5, [r3, r12]
   0xb6b81bac <+32>:    ldr     r2, [sp, #36]   ; 0x24
   0xb6b81bb0 <+36>:    mov     r0, r3
   0xb6b81bb4 <+40>:    ldr     r0, [r5]
   0xb6b81bb8 <+44>:    ands    r3, r2, #64     ; 0x40
   0xb6b81bbc <+48>:    str     r0, [sp, #12]
   0xb6b81bc0 <+52>:    bne     0xb6b81c20 <__libc_open+148>
   0xb6b81bc4 <+56>:    ldr     r0, [pc, #236]  ; 0xb6b81cb8 <__libc_open+300>
   0xb6b81bc8 <+60>:    bics    r0, r0, r2
   0xb6b81bcc <+64>:    beq     0xb6b81c20 <__libc_open+148>
   0xb6b81bd0 <+68>:    ldr     r0, [pc, #228]  ; 0xb6b81cbc <__libc_open+304>
   0xb6b81bd4 <+72>:    add     r0, pc, r0
   0xb6b81bd8 <+76>:    ldr     r0, [r0]
   0xb6b81bdc <+80>:    cmp     r0, #0
   0xb6b81be0 <+84>:    bne     0xb6b81c30 <__libc_open+164>
   0xb6b81be4 <+88>:    mvn     r0, #99 ; 0x63
   0xb6b81be8 <+92>:    ldr     r7, [pc, #208]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81bec <+96>:    svc     0x00000000
   0xb6b81bf0 <+100>:   cmn     r0, #4096       ; 0x1000
   0xb6b81bf4 <+104>:   mov     r4, r0
   0xb6b81bf8 <+108>:   bhi     0xb6b81c74 <__libc_open+232>
   0xb6b81bfc <+112>:   ldr     r2, [sp, #12]
   0xb6b81c00 <+116>:   ldr     r3, [r5]
   0xb6b81c04 <+120>:   mov     r0, r4
   0xb6b81c08 <+124>:   cmp     r2, r3
   0xb6b81c0c <+128>:   bne     0xb6b81cac <__libc_open+288>
   0xb6b81c10 <+132>:   add     sp, sp, #16
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81c14 <+136>:   pop     {r4, r5, r6, r7, lr}
   0xb6b81c18 <+140>:   add     sp, sp, #12
   0xb6b81c1c <+144>:   bx      lr
   0xb6b81c20 <+148>:   add     r0, sp, #40     ; 0x28
   0xb6b81c24 <+152>:   ldr     r3, [sp, #40]   ; 0x28
   0xb6b81c28 <+156>:   str     r0, [sp, #8]
   0xb6b81c2c <+160>:   b       0xb6b81bd0 <__libc_open+68>
   0xb6b81c30 <+164>:   str     r2, [sp, #36]   ; 0x24
   0xb6b81c34 <+168>:   str     r1, [sp, #4]
   0xb6b81c38 <+172>:   str     r3, [sp]
   0xb6b81c3c <+176>:   bl      0xb6b80c8c <__pthread_enable_asynccancel>
   0xb6b81c40 <+180>:   ldr     r7, [pc, #120]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81c44 <+184>:   ldr     r1, [sp, #4]
   0xb6b81c48 <+188>:   ldr     r2, [sp, #36]   ; 0x24
   0xb6b81c4c <+192>:   ldr     r3, [sp]
   0xb6b81c50 <+196>:   mov     r6, r0
   0xb6b81c54 <+200>:   mvn     r0, #99 ; 0x63
=> 0xb6b81c58 <+204>:   svc     0x00000000
   0xb6b81c5c <+208>:   cmn     r0, #4096       ; 0x1000
   0xb6b81c60 <+212>:   mov     r4, r0
   0xb6b81c64 <+216>:   bhi     0xb6b81c90 <__libc_open+260>
   0xb6b81c68 <+220>:   mov     r0, r6
   0xb6b81c6c <+224>:   bl      0xb6b80d2c <__pthread_disable_asynccancel>
   0xb6b81c70 <+228>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c74 <+232>:   ldr     r3, [pc, #72]   ; 0xb6b81cc4 <__libc_open+312>
   0xb6b81c78 <+236>:   rsb     r2, r0, #0
   0xb6b81c7c <+240>:   mvn     r4, #0
   0xb6b81c80 <+244>:   ldr     r3, [pc, r3]
   0xb6b81c84 <+248>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81c88 <+252>:   str     r2, [r0, r3]
   0xb6b81c8c <+256>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c90 <+260>:   ldr     r3, [pc, #48]   ; 0xb6b81cc8 <__libc_open+316>
   0xb6b81c94 <+264>:   rsb     r2, r0, #0
   0xb6b81c98 <+268>:   mvn     r4, #0
   0xb6b81c9c <+272>:   ldr     r3, [pc, r3]
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81ca0 <+276>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81ca4 <+280>:   str     r2, [r0, r3]
   0xb6b81ca8 <+284>:   b       0xb6b81c68 <__libc_open+220>
   0xb6b81cac <+288>:   bl      0xb6b74490 <__stack_chk_fail@plt>
   0xb6b81cb0 <+292>:   andeq   r5, r1, r12, asr r4
   0xb6b81cb4 <+296>:   andeq   r0, r0, r4, ror r1
   0xb6b81cb8 <+300>:   subeq   r4, r0, r0
   0xb6b81cbc <+304>:   andeq   r7, r1, r4, asr r6
   0xb6b81cc0 <+308>:   andeq   r0, r0, r2, asr #2
   0xb6b81cc4 <+312>:   andeq   r5, r1, r8, ror #9
   0xb6b81cc8 <+316>:   andeq   r5, r1, r12, asr #9
End of assembler dump.
(gdb)

cebtenzzre · 2022-02-09T00:47:55Z

Yep, it's definitely in a kernel syscall. Running ni in GDB probably hangs too. You can get the kernel stacks with find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'. I also have mentioned dmesg several times because that's where the kernel logs if it something gets stuck.

james-cook · 2022-02-09T09:41:37Z

Thanks for all of your help :)

Stack

find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'

/proc/19164/task/19164/stack
/proc/19164/task/19169/stack
/proc/19164/task/19170/stack
$ sudo cat /proc/19164/task/19164/stack
$ sudo cat /proc/19164/task/19169/stack
$ sudo cat /proc/19164/task/19170/stack

i.e. the files are empty...(?)

dmesg

dmesg -Tk shows:

00 00
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 CDB: opcode=0x88 88 00 00 00 00 01 ec e8 01 00 00 00 01 00
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 CDB: opcode=0x8a 8a 00 00 00 00 02 51 40 09 78 00 00 00 08
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x8a 8a 00 00 00 00 02 6d c0 09 d8 00 00 00 20
00 00
[T

Does this look like issue? I do see the same message sporadically in the past week.
Looking at these entries last night I just thought a uas error would be "more general* and not just be happening "on the same file" - (same rmlint command).

I have had uas problems (even leading to device offlining) with other raspberrypis. It is a known problem area. Downgrading to usb2(!) can help, as can simply rebooting and unplugging/replugging the drives(!) and kernel updates help too.

I checked the kernel version on this machine and see it is very old.
So, today I'll back-up the sdcard(os) and update the kernel.

gdb

ni does hang as you mentioned

cebtenzzre · 2022-02-09T18:18:59Z

Dunno why there are no kernel stacks in /proc (maybe it's an optional feature or your kernel has limited stack unwinding?) but I would definitely attribute the hang to those kernel messages. Not an rmlint issue.

james-cook · 2022-02-09T19:38:15Z

Will close this as it really appears to be my system - thanks 😊 for the great input and help

james-cook · 2022-02-13T15:52:53Z

Investigating this:

I switched the harddrive to be directly connected to the raspberrypi and no longer via a hub
I then used a find <the directory from the rmlint command> -exec dd -if <the file> -of </dev/null> kind of command overnight.
This command completed and there were no kernel errors/warnings.
I ran rmlint (develop + patches) to run and it again failed just after 300000 files in traverse
I ran the same command again in gdb but could not get a useful stack at the point of failure (as before)

So I went back to "old school" debugging and placed a logging output in traverse.c:

386d385
<         rm_log_info_line(_("Current file is %s"), p->fts_path);

And now I see that rmlint chokes on AppContainerUserCertRead:

INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/CRLs
INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/AppContainerUserCertRead

This is a pipe/FIFO:

from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My $ ls -al
total 16
drwxrwxrwx 4 root root 4096 Sep 13 17:11 .
drwxrwxrwx 4 root root 4096 Jan  2  2020 ..
prwxrwxrwx 1 root root    0 Nov 28  2019 AppContainerUserCertRead
drwxrwxrwx 2 root root 4096 Nov 28  2019 CRLs
drwxrwxrwx 2 root root 4096 Nov 28  2019 CTLs

You mentioned excluding such files from rmlint:
info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

Originally posted by @cebtenzzre in #555 (comment)

cebtenzzre · 2022-02-13T18:15:17Z

I am able to reproduce the issue like this:

$ mkfifo foo
$ chmod +x foo
$ rmlint -T ns .
(rmlint hangs)

I am not surprised that you are one of the first people to hit this, because it requires that the non-default unstripped binary search is enabled, and that it encounters a FIFO, and that the FIFO is marked executable (apparently because it came from a Windows system). Here's the fix:

 lib/utilities.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/utilities.c b/lib/utilities.c
index 30d82607..00299728 100644
--- a/lib/utilities.c
+++ b/lib/utilities.c
@@ -467,7 +467,7 @@ bool rm_util_is_nonstripped(_UNUSED const char *path, _UNUSED RmStat *statp) {
 #if HAVE_LIBELF
     g_return_val_if_fail(path, false);
 
-    if(statp && (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
+    if(!S_ISREG(statp->st_mode) || (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
         return false;
     }

Only attempt to open regular files. It is entirely possible that we encounter an executable FIFO, which causes open() to block indefinitely if it is not opened for writing by another process. Fixes sahib#555

james-cook · 2022-02-14T23:51:42Z

Your very fast patch fixed the issue for me :)
So I'll close again.
Thanks

Only attempt to open regular files. It is entirely possible that we encounter an executable FIFO, which causes open() to block indefinitely if it is not opened for writing by another process. Fixes sahib#555

james-cook closed this as completed Feb 9, 2022

james-cook reopened this Feb 13, 2022

cebtenzzre mentioned this issue Feb 13, 2022

utilities: Prevent deadlock in rm_util_is_nonstripped #557

Closed

james-cook closed this as completed Feb 14, 2022

cebtenzzre added bug has-pr labels Aug 8, 2022

cebtenzzre reopened this Aug 8, 2022

cebtenzzre linked a pull request Aug 9, 2022 that will close this issue

utilities: Prevent deadlock in rm_util_is_nonstripped #557

Closed

cebtenzzre removed the has-pr label Aug 9, 2022

cebtenzzre closed this as completed in 907c67c Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in traverse - how to debug? #555

Hang in traverse - how to debug? #555

james-cook commented Feb 8, 2022 •

edited

Loading

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022 •

edited

Loading

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 9, 2022 •

edited

Loading

cebtenzzre commented Feb 9, 2022

james-cook commented Feb 9, 2022 •

edited

Loading

cebtenzzre commented Feb 9, 2022

james-cook commented Feb 9, 2022

james-cook commented Feb 13, 2022 •

edited

Loading

cebtenzzre commented Feb 13, 2022

james-cook commented Feb 14, 2022

Hang in traverse - how to debug? #555

Hang in traverse - how to debug? #555

Comments

james-cook commented Feb 8, 2022 • edited Loading

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022 • edited Loading

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 8, 2022

cebtenzzre commented Feb 8, 2022

james-cook commented Feb 9, 2022 • edited Loading

cebtenzzre commented Feb 9, 2022

james-cook commented Feb 9, 2022 • edited Loading

Stack

dmesg

gdb

cebtenzzre commented Feb 9, 2022

james-cook commented Feb 9, 2022

james-cook commented Feb 13, 2022 • edited Loading

cebtenzzre commented Feb 13, 2022

james-cook commented Feb 14, 2022

james-cook commented Feb 8, 2022 •

edited

Loading

james-cook commented Feb 8, 2022 •

edited

Loading

james-cook commented Feb 9, 2022 •

edited

Loading

james-cook commented Feb 9, 2022 •

edited

Loading

james-cook commented Feb 13, 2022 •

edited

Loading