Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in traverse - how to debug? #555

Closed
james-cook opened this issue Feb 8, 2022 · 15 comments
Closed

Hang in traverse - how to debug? #555

james-cook opened this issue Feb 8, 2022 · 15 comments
Labels

Comments

@james-cook
Copy link

james-cook commented Feb 8, 2022

Running master 2.10.1 and develop plus a few patches and I am getting a hang during initial traverse with both versions.

Traversing (316260 usable files / 0 + 0 ignored files / folders)
Always at the same file count.
After this, the drives spin down and nothing appears to be happening. The process continues to run, waiting for something. CPU usage is low.

I checked RAM use - no problems there. 4GB RAM, ca. 2GB used. (Raspberry Pi 4B)

With -vvv set, the last message I (always) see is:
WARNING: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
Would it be this file causing the problem? (probably more likely something afterwards)

Running gdb (develop version compiled with DEBUG=1) when I get to the "same place" I do a CTRL-C (not sure if this is useful)

INFO: Added big file /srv/path/Scans BU Dec 2014/Africa_Turmoil_WEB.pdf
^C
Thread 1 "rmlint" received signal SIGINT, Interrupt.
syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
37      ../sysdeps/unix/sysv/linux/arm/syscall.S: No such file or directory.
(gdb) bt
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb) thread apply all bt

Thread 3 (Thread 0xb5c513a0 (LWP 19170)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2a80 in g_cond_wait_until () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f81c in rm_mds_factory ()

Thread 2 (Thread 0xb64523a0 (LWP 19169)):
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
#1  0x0002c61c in rm_sys_open ()
#2  0x0002dba4 in rm_util_is_nonstripped ()
#3  0x00028834 in rm_traverse_file ()
#4  0x00029900 in rm_traverse_directory ()
#5  0x0001f868 in rm_mds_factory ()

Thread 1 (Thread 0xb6829010 (LWP 19164)):
#0  0xb6af5aec in syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1  0xb6ce2908 in g_cond_wait () at /usr/lib/arm-linux-gnueabihf/libglib-2.0.so.0
#2  0x0001f6b8 in rm_mds_finish ()
#3  0x0002a4f0 in rm_traverse_tree ()
#4  0x0001adf4 in rm_cmd_main ()
#5  0x00015b18 in main ()
(gdb)

I am not sure if these backtraces help - I am VERY rusty at debugging.
Any hints on how to hunt this problem down appreciated :)

@cebtenzzre
Copy link
Collaborator

For that backtrace, only thread 2 is doing anything (the others are blocked in g_cond_wait), so the next steps would be thread 2, fin and see if the open ever finishes. If it doesn't, the kernel is stuck on something (dmesg would help). If it does finish, rmlint may be stuck in an infinite loop in traverse.c.
The filename "\230\305\002" is odd, it may indicate memory corruption - ASAN or valgrind might find a root cause.

@james-cook
Copy link
Author

Just did...

(gdb) thread 2
[Switching to thread 2 (Thread 0xb64523a0 (LWP 19169))]
#0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44
44      ../sysdeps/unix/sysv/linux/open.c: No such file or directory.
(gdb) fin
Run till exit from #0  0xb6b81c58 in __libc_open (file=0xb64503b8 "\230\305\002", oflag=131072)
    at ../sysdeps/unix/sysv/linux/open.c:44

and now waiting(!)

@cebtenzzre
Copy link
Collaborator

info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

@james-cook
Copy link
Author

james-cook commented Feb 8, 2022

I pressed CTRL-C in the fin (above).

info proc shows:

(gdb) info proc
process 19164
cmdline = '/home/pi/source/rmlint/rmlint/rmlint -T all -emptyfiles -emptydirs -vvv -S mda -s -1TB /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/'
cwd = '/srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb'
exe = '/home/pi/source/rmlint/rmlint/rmlint'
 ls -l /proc/19164/cwd
lrwxrwxrwx 1 pi pi 0 Feb  8 22:59 /proc/19164/cwd -> /srv/dev-disk-by-label-OMV2/shd2/rmlints/redo-ALL.TOSH/with.new.rmlint/with.gdb

That's "just" the directory where I call rmlint from (cwd of the call to rmlint), not the path of the file in question (?)

stat /proc/19164/cwd/$'\230\305\002'
stat: cannot stat '/proc/19164/cwd/'$'\230\305\002': No such file or directory

@cebtenzzre
Copy link
Collaborator

Then either GDB is confused or the path is corrupt. If it is definitely blocked in the kernel (something like swi 0x0 just above the pointed to line in disas) then dmesg is the best way to know what could be wrong. A filesystem with a missing source (unplugged disk or NFS timeout) is usually the only reason that would happen.

@james-cook
Copy link
Author

I have to humbly ask for a translation if possible ;)

(something like swi 0x0 just above the pointed to line in disas)

cwd - gdb does report where I call the program from. Would cwd be expected to change to each directory as rmlint traverses?

The original files were copied from a failing disk. This could be(?) a root of the problem perhaps(?)

@cebtenzzre
Copy link
Collaborator

In GDB, run disas and scroll up to the pointed to location. The surrounding assembly instructions will tell you if the process has switched to kernel mode.

@james-cook
Copy link
Author

james-cook commented Feb 9, 2022

Is this the relevant pointer?:

(gdb) disas
Dump of assembler code for function __libc_open:
   0xb6b81b8c <+0>:     push    {r1, r2, r3}
   0xb6b81b90 <+4>:     mov     r1, r0
   0xb6b81b94 <+8>:     ldr     r3, [pc, #276]  ; 0xb6b81cb0 <__libc_open+292>
   0xb6b81b98 <+12>:    ldr     r12, [pc, #276] ; 0xb6b81cb4 <__libc_open+296>
   0xb6b81b9c <+16>:    add     r3, pc, r3
   0xb6b81ba0 <+20>:    push    {r4, r5, r6, r7, lr}
   0xb6b81ba4 <+24>:    sub     sp, sp, #16
   0xb6b81ba8 <+28>:    ldr     r5, [r3, r12]
   0xb6b81bac <+32>:    ldr     r2, [sp, #36]   ; 0x24
   0xb6b81bb0 <+36>:    mov     r0, r3
   0xb6b81bb4 <+40>:    ldr     r0, [r5]
   0xb6b81bb8 <+44>:    ands    r3, r2, #64     ; 0x40
   0xb6b81bbc <+48>:    str     r0, [sp, #12]
   0xb6b81bc0 <+52>:    bne     0xb6b81c20 <__libc_open+148>
   0xb6b81bc4 <+56>:    ldr     r0, [pc, #236]  ; 0xb6b81cb8 <__libc_open+300>
   0xb6b81bc8 <+60>:    bics    r0, r0, r2
   0xb6b81bcc <+64>:    beq     0xb6b81c20 <__libc_open+148>
   0xb6b81bd0 <+68>:    ldr     r0, [pc, #228]  ; 0xb6b81cbc <__libc_open+304>
   0xb6b81bd4 <+72>:    add     r0, pc, r0
   0xb6b81bd8 <+76>:    ldr     r0, [r0]
   0xb6b81bdc <+80>:    cmp     r0, #0
   0xb6b81be0 <+84>:    bne     0xb6b81c30 <__libc_open+164>
   0xb6b81be4 <+88>:    mvn     r0, #99 ; 0x63
   0xb6b81be8 <+92>:    ldr     r7, [pc, #208]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81bec <+96>:    svc     0x00000000
   0xb6b81bf0 <+100>:   cmn     r0, #4096       ; 0x1000
   0xb6b81bf4 <+104>:   mov     r4, r0
   0xb6b81bf8 <+108>:   bhi     0xb6b81c74 <__libc_open+232>
   0xb6b81bfc <+112>:   ldr     r2, [sp, #12]
   0xb6b81c00 <+116>:   ldr     r3, [r5]
   0xb6b81c04 <+120>:   mov     r0, r4
   0xb6b81c08 <+124>:   cmp     r2, r3
   0xb6b81c0c <+128>:   bne     0xb6b81cac <__libc_open+288>
   0xb6b81c10 <+132>:   add     sp, sp, #16
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81c14 <+136>:   pop     {r4, r5, r6, r7, lr}
   0xb6b81c18 <+140>:   add     sp, sp, #12
   0xb6b81c1c <+144>:   bx      lr
   0xb6b81c20 <+148>:   add     r0, sp, #40     ; 0x28
   0xb6b81c24 <+152>:   ldr     r3, [sp, #40]   ; 0x28
   0xb6b81c28 <+156>:   str     r0, [sp, #8]
   0xb6b81c2c <+160>:   b       0xb6b81bd0 <__libc_open+68>
   0xb6b81c30 <+164>:   str     r2, [sp, #36]   ; 0x24
   0xb6b81c34 <+168>:   str     r1, [sp, #4]
   0xb6b81c38 <+172>:   str     r3, [sp]
   0xb6b81c3c <+176>:   bl      0xb6b80c8c <__pthread_enable_asynccancel>
   0xb6b81c40 <+180>:   ldr     r7, [pc, #120]  ; 0xb6b81cc0 <__libc_open+308>
   0xb6b81c44 <+184>:   ldr     r1, [sp, #4]
   0xb6b81c48 <+188>:   ldr     r2, [sp, #36]   ; 0x24
   0xb6b81c4c <+192>:   ldr     r3, [sp]
   0xb6b81c50 <+196>:   mov     r6, r0
   0xb6b81c54 <+200>:   mvn     r0, #99 ; 0x63
=> 0xb6b81c58 <+204>:   svc     0x00000000
   0xb6b81c5c <+208>:   cmn     r0, #4096       ; 0x1000
   0xb6b81c60 <+212>:   mov     r4, r0
   0xb6b81c64 <+216>:   bhi     0xb6b81c90 <__libc_open+260>
   0xb6b81c68 <+220>:   mov     r0, r6
   0xb6b81c6c <+224>:   bl      0xb6b80d2c <__pthread_disable_asynccancel>
   0xb6b81c70 <+228>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c74 <+232>:   ldr     r3, [pc, #72]   ; 0xb6b81cc4 <__libc_open+312>
   0xb6b81c78 <+236>:   rsb     r2, r0, #0
   0xb6b81c7c <+240>:   mvn     r4, #0
   0xb6b81c80 <+244>:   ldr     r3, [pc, r3]
   0xb6b81c84 <+248>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81c88 <+252>:   str     r2, [r0, r3]
   0xb6b81c8c <+256>:   b       0xb6b81bfc <__libc_open+112>
   0xb6b81c90 <+260>:   ldr     r3, [pc, #48]   ; 0xb6b81cc8 <__libc_open+316>
   0xb6b81c94 <+264>:   rsb     r2, r0, #0
   0xb6b81c98 <+268>:   mvn     r4, #0
   0xb6b81c9c <+272>:   ldr     r3, [pc, r3]
--Type <RET> for more, q to quit, c to continue without paging--
   0xb6b81ca0 <+276>:   bl      0xb6b844e0 <__aeabi_read_tp>
   0xb6b81ca4 <+280>:   str     r2, [r0, r3]
   0xb6b81ca8 <+284>:   b       0xb6b81c68 <__libc_open+220>
   0xb6b81cac <+288>:   bl      0xb6b74490 <__stack_chk_fail@plt>
   0xb6b81cb0 <+292>:   andeq   r5, r1, r12, asr r4
   0xb6b81cb4 <+296>:   andeq   r0, r0, r4, ror r1
   0xb6b81cb8 <+300>:   subeq   r4, r0, r0
   0xb6b81cbc <+304>:   andeq   r7, r1, r4, asr r6
   0xb6b81cc0 <+308>:   andeq   r0, r0, r2, asr #2
   0xb6b81cc4 <+312>:   andeq   r5, r1, r8, ror #9
   0xb6b81cc8 <+316>:   andeq   r5, r1, r12, asr #9
End of assembler dump.
(gdb)

@cebtenzzre
Copy link
Collaborator

Yep, it's definitely in a kernel syscall. Running ni in GDB probably hangs too. You can get the kernel stacks with find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'. I also have mentioned dmesg several times because that's where the kernel logs if it something gets stuck.

@james-cook
Copy link
Author

james-cook commented Feb 9, 2022

Thanks for all of your help :)

Stack

find /proc/19164/task -name "stack" | sudo xargs -I{} sh -c 'echo {}; cat {}'

/proc/19164/task/19164/stack
/proc/19164/task/19169/stack
/proc/19164/task/19170/stack
$ sudo cat /proc/19164/task/19164/stack
$ sudo cat /proc/19164/task/19169/stack
$ sudo cat /proc/19164/task/19170/stack

i.e. the files are empty...(?)

dmesg

dmesg -Tk shows:

00 00
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD
[Tue Feb  8 22:21:36 2022] sd 3:0:0:0: [sdd] tag#12 CDB: opcode=0x88 88 00 00 00 00 01 ec e8 01 00 00 00 01 00
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#19 CDB: opcode=0x8a 8a 00 00 00 00 02 51 40 09 78 00 00 00 08
00 00
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[Tue Feb  8 22:21:41 2022] sd 3:0:0:0: [sdd] tag#18 CDB: opcode=0x8a 8a 00 00 00 00 02 6d c0 09 d8 00 00 00 20
00 00
[T

Does this look like issue? I do see the same message sporadically in the past week.
Looking at these entries last night I just thought a uas error would be "more general* and not just be happening "on the same file" - (same rmlint command).

I have had uas problems (even leading to device offlining) with other raspberrypis. It is a known problem area. Downgrading to usb2(!) can help, as can simply rebooting and unplugging/replugging the drives(!) and kernel updates help too.

I checked the kernel version on this machine and see it is very old.
So, today I'll back-up the sdcard(os) and update the kernel.

gdb

ni does hang as you mentioned

@cebtenzzre
Copy link
Collaborator

Dunno why there are no kernel stacks in /proc (maybe it's an optional feature or your kernel has limited stack unwinding?) but I would definitely attribute the hang to those kernel messages. Not an rmlint issue.

@james-cook
Copy link
Author

Will close this as it really appears to be my system - thanks 😊 for the great input and help

@james-cook
Copy link
Author

james-cook commented Feb 13, 2022

Investigating this:

  • I switched the harddrive to be directly connected to the raspberrypi and no longer via a hub
  • I then used a find <the directory from the rmlint command> -exec dd -if <the file> -of </dev/null> kind of command overnight.
  • This command completed and there were no kernel errors/warnings.
  • I ran rmlint (develop + patches) to run and it again failed just after 300000 files in traverse
  • I ran the same command again in gdb but could not get a useful stack at the point of failure (as before)

So I went back to "old school" debugging and placed a logging output in traverse.c:

386d385
<         rm_log_info_line(_("Current file is %s"), p->fts_path);

And now I see that rmlint chokes on AppContainerUserCertRead:

INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/CRLs
INFO: Current file is /srv/dev-disk-by-label-OMV2/shd2/from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My/AppContainerUserCertRead

This is a pipe/FIFO:

from.ext.hdd.M-family.TOSH/2020-01-05-p13/Users/XXXXX/AppData/Roaming/Microsoft/SystemCertificates/My $ ls -al
total 16
drwxrwxrwx 4 root root 4096 Sep 13 17:11 .
drwxrwxrwx 4 root root 4096 Jan  2  2020 ..
prwxrwxrwx 1 root root    0 Nov 28  2019 AppContainerUserCertRead
drwxrwxrwx 2 root root 4096 Nov 28  2019 CRLs
drwxrwxrwx 2 root root 4096 Nov 28  2019 CTLs

You mentioned excluding such files from rmlint:
info proc gives you the PID, which you can use like ls -l /proc/45834/cwd to get the current directory of rmlint. If there is actually a FIFO called \230\305\002 in that directory (stat /proc/PID/cwd/$'\230\305\002'), then rmlint needs to better avoid FIFOs with stat() or O_NONBLOCK. Otherwise there's an issue in the kernel and dmesg can help.

Originally posted by @cebtenzzre in #555 (comment)

@james-cook james-cook reopened this Feb 13, 2022
@cebtenzzre
Copy link
Collaborator

I am able to reproduce the issue like this:

$ mkfifo foo
$ chmod +x foo
$ rmlint -T ns .
(rmlint hangs)

I am not surprised that you are one of the first people to hit this, because it requires that the non-default unstripped binary search is enabled, and that it encounters a FIFO, and that the FIFO is marked executable (apparently because it came from a Windows system). Here's the fix:

 lib/utilities.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/utilities.c b/lib/utilities.c
index 30d82607..00299728 100644
--- a/lib/utilities.c
+++ b/lib/utilities.c
@@ -467,7 +467,7 @@ bool rm_util_is_nonstripped(_UNUSED const char *path, _UNUSED RmStat *statp) {
 #if HAVE_LIBELF
     g_return_val_if_fail(path, false);
 
-    if(statp && (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
+    if(!S_ISREG(statp->st_mode) || (statp->st_mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0) {
         return false;
     }
 

cebtenzzre added a commit to cebtenzzre/rmlint that referenced this issue Feb 13, 2022
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
@james-cook
Copy link
Author

Your very fast patch fixed the issue for me :)
So I'll close again.
Thanks

cebtenzzre added a commit to cebtenzzre/rmlint that referenced this issue Aug 7, 2022
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
@cebtenzzre cebtenzzre reopened this Aug 8, 2022
@cebtenzzre cebtenzzre linked a pull request Aug 9, 2022 that will close this issue
@cebtenzzre cebtenzzre removed the has-pr label Aug 9, 2022
cebtenzzre added a commit to cebtenzzre/rmlint that referenced this issue Aug 12, 2022
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
cebtenzzre added a commit to cebtenzzre/rmlint that referenced this issue Sep 18, 2022
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
intelfx pushed a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
intelfx pushed a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
Only attempt to open regular files. It is entirely possible that we
encounter an executable FIFO, which causes open() to block indefinitely
if it is not opened for writing by another process.

Fixes sahib#555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants