lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

daywalker90 · 2024-09-10T08:11:34Z

I've seen this multiple times now. lightning_hsmd sits at 100% CPU and i have to restart cln to get it back to normal. Strangely this is not happening with my testnet nodes running on the same machine and the same binaries.

Impa10r · 2024-09-10T13:22:52Z

Concur. But I also see both lightning_hsmd and lightning_gossi keep running at 100% after I shut down lightingd:

$ sudo systemctl stop lightningd
$ sudo systemctl status lightningd
○ lightningd.service - Core Lightning daemon
     Loaded: loaded (/etc/systemd/system/lightningd.service; disabled; vendor preset: enabled)
     Active: inactive (dead) since Tue 2024-09-10 15:14:36 CEST; 5min ago
    Process: 17183 ExecStart=/usr/bin/lightningd (code=killed, signal=TERM)
   Main PID: 17183 (code=killed, signal=TERM)
      Tasks: 2 (limit: 9438)
     Memory: 101.4M
        CPU: 19min 34.980s
     CGroup: /system.slice/lightningd.service
             ├─17252 /usr/bin/../libexec/c-lightning/lightning_hsmd
             └─17278 /usr/bin/../libexec/c-lightning/lightning_gossipd

sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17224 (sql) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17226 (bookkeeper) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17230 (peerswap) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17252 (lightning_hsmd) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 17278 (lightning_gossi) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Unit process 73560 (circular) remains running after unit stopped.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: Stopped Core Lightning daemon.
sep 10 15:14:36 vlad-VirtualBox systemd[1]: lightningd.service: Consumed 8min 33.185s CPU time.
sep 10 15:14:36 vlad-VirtualBox lightningd[17215]: Lost connection to the RPC socket.
sep 10 15:14:36 vlad-VirtualBox lightningd[17220]: Lost connection to the RPC socket.

$ top
top - 15:18:08 up  3:06,  2 users,  load average: 3,62, 2,47, 1,64
Tasks: 295 total,   5 running, 288 sleeping,   0 stopped,   2 zombie
%Cpu(s):  0,4 us,  0,5 sy,  0,0 ni, 99,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :   7941,8 total,   3416,4 free,    827,8 used,   3697,6 buff/cache
MiB Swap:   2048,0 total,   1215,6 free,    832,4 used.   6816,5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  17252 vlad      20   0    5620   3328   3328 R 100,0   0,0   3:41.55 lightning_hsmd
  17278 vlad      20   0   20148   8072   3592 R 100,0   0,1   3:33.24 lightning_gossi
      1 root      20   0  167692  12244   8148 S   0,0   0,2   0:13.00 systemd
      2 root      20   0       0      0      0 S   0,0   0,0   0:00.05 kthreadd
      3 root      20   0       0      0      0 S   0,0   0,0   0:00.00 pool_workqueue_release
      4 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_g
      5 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-rcu_p
      6 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-slub_
      7 root       0 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/R-netns

They prevent my Ubuntu 22 from shutting down!

rustyrussell · 2024-09-11T03:38:17Z

Hmmmm, the best way to debug this is:

sudo perf record --call-graph=dwarf -p `pidof lightning_hsmd`

Then sudo perf report will let you explore what is taking all the time.

Obviously this only works when hsmd is consuming lots of CPU.

Without knowing what it's doing, this is hard to debug! Presumably something is hammering it with requests: if we can figure out what the request is, that might give us a clue!

junity42 · 2024-09-11T06:29:18Z

Same problem here, one cpu 100% load by lightning_hsmd and sometimes gossip too. I have to manually kill the stuck processes, it won't stop when I simply restart clightning process with systemctl.
I've run the perf commands, here is the result, I hope it helps:

daywalker90 · 2024-09-11T22:44:54Z

rustyrussell · 2024-09-11T23:58:21Z

Thanks! Ok, one more request, what does sudo strace -p $(pidof lightning_hsmd) -o /tmp/trace give? It will run forever, so just let it run for a few seconds then Control-C.

From this report, and another one, it's slamming the read/poll loop. I might have to add some debug for this. It doesn't happen for me :(

junity42 · 2024-09-12T06:48:04Z

My instance is running on a rpi4, the 100% cpu bug happens after approximately half a day of ok running, every day it seems.
I run the strace, results are attached.
trace.txt

daywalker90 · 2024-09-12T11:27:58Z

Another one with both commands:

trace.zip

grubles · 2024-09-12T21:47:45Z

Noticed my lightning_hsmd was running at 100% today.

Samples: 204K of event 'cycles:P', Event count (approx.): 192624081792                                 
  Children      Self  Command         Shared Object      Symbol                                        
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_main@@GLIBC_2.34             
+   99.80%     0.00%  lightning_hsmd  libc.so.6          [.] __libc_start_call_main                    
+   99.80%     0.00%  lightning_hsmd  lightning_hsmd     [.] main                                      
+   99.80%     1.03%  lightning_hsmd  lightning_hsmd     [.] io_loop                                   
+   95.34%     1.38%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_vectored_common               
+   93.92%     1.60%  lightning_hsmd  [kernel.kallsyms]  [k] system_call_exception                     
+   89.70%     0.29%  lightning_hsmd  lightning_hsmd     [.] daemon_poll                               
+   88.26%     0.50%  lightning_hsmd  libc.so.6          [.] __poll                                    
+   85.50%     0.12%  lightning_hsmd  [kernel.kallsyms]  [k] sys_poll                                  
+   85.36%     6.22%  lightning_hsmd  [kernel.kallsyms]  [k] do_sys_poll                               
+   42.76%    13.25%  lightning_hsmd  [kernel.kallsyms]  [k] sock_poll                                 
+   32.89%     0.00%  lightning_hsmd  [unknown]          [k] 0000000000000000                          
+   31.10%    14.36%  lightning_hsmd  [kernel.kallsyms]  [k] unix_poll                                 
+   24.14%     0.36%  lightning_hsmd  [kernel.kallsyms]  [k] poll_freewait                             
+   16.69%     6.36%  lightning_hsmd  [kernel.kallsyms]  [k] __pollwait                                
+   15.77%    15.76%  lightning_hsmd  [kernel.kallsyms]  [k] fput                                      
+   10.78%    10.78%  lightning_hsmd  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave                    
+    8.86%     0.12%  lightning_hsmd  lightning_hsmd     [.] io_ready                                  
+    8.86%     0.18%  lightning_hsmd  lightning_hsmd     [.] do_plan                                   
+    8.33%     0.25%  lightning_hsmd  lightning_hsmd     [.] do_read_wire                              
+    8.13%     0.21%  lightning_hsmd  lightning_hsmd     [.] do_read_wire_header                       
+    7.87%     0.46%  lightning_hsmd  libc.so.6          [.] read                                      
+    7.33%     0.39%  lightning_hsmd  [kernel.kallsyms]  [k] remove_wait_queue                         
+    6.51%     0.63%  lightning_hsmd  [kernel.kallsyms]  [k] add_wait_queue                            
+    5.33%     5.33%  lightning_hsmd  [kernel.kallsyms]  [k] __fget_light                              
+    5.20%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] ksys_read                                 
+    5.16%     0.07%  lightning_hsmd  [kernel.kallsyms]  [k] get_free_pages_noprof                     
+    4.92%     0.05%  lightning_hsmd  [kernel.kallsyms]  [k] alloc_pages_mpol_noprof                   
+    4.73%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] vfs_read                                  
+    4.39%     0.15%  lightning_hsmd  [kernel.kallsyms]  [k] __alloc_pages_noprof                      
+    4.13%     0.64%  lightning_hsmd  [kernel.kallsyms]  [k] get_page_from_freelist                    
+    3.15%     0.09%  lightning_hsmd  [kernel.kallsyms]  [k] post_alloc_hook                           
+    3.07%     3.06%  lightning_hsmd  [kernel.kallsyms]  [k] kernel_init_pages                         
+    2.83%     0.54%  lightning_hsmd  [kernel.kallsyms]  [k] arch_local_irq_restore                    
+    2.64%     0.22%  lightning_hsmd  [kernel.kallsyms]  [k] sock_read_iter

Our low-level ccan/io IO routines return three values: -1: error. 0: call me again, I'm not finished. 1: I'm done, go onto the next thing. In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors. However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU. The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines *did not*. Fixes: ElementsProject#7655 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell · 2024-09-13T00:31:45Z

OK, this did it! I found the case where we get a 0 read and ignore EOF. Thankyou, everyone, for your patience: I've added this to the v24.08.1 milestone, which will obviously need to be released soon.

Our low-level ccan/io IO routines return three values: -1: error. 0: call me again, I'm not finished. 1: I'm done, go onto the next thing. In the last release, we tweaked the sematics of "-1": we now opportunistically call a routine which returns 0 once more, in case there's more data. We use errno to distinguish between "EAGAIN" which means there wasn't any data, and real errors. However, if the underlying read() returns 0 (which it does when the peer has closed the other end) the value of errno is UNDEFINED. If it happens to be EAGAIN, we will call it again, rather than closing. This causes us to spin: in particular people reported hsmd consuming 100% of CPU. The ccan/io read code handled this by setting errno to 0 in this case, but our own wire low-level routines *did not*. Fixes: ElementsProject#7655 Changelog-Fixed: Fixed intermittant bug where hsmd (particularly, but also lightningd) could use 100% CPU. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

ShahanaFarooqui assigned rustyrussell Sep 10, 2024

ShahanaFarooqui added hsmd performance labels Sep 10, 2024

ShahanaFarooqui added this to the v24.08.1 milestone Sep 10, 2024

rustyrussell mentioned this issue Sep 13, 2024

subds: fix case where we keep retrying on EOF. #7661

Merged

ShahanaFarooqui closed this as completed in #7661 Sep 13, 2024

ShahanaFarooqui closed this as completed in 5bd3d51 Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

daywalker90 commented Sep 10, 2024

Impa10r commented Sep 10, 2024 •

edited

Loading

rustyrussell commented Sep 11, 2024

junity42 commented Sep 11, 2024

daywalker90 commented Sep 11, 2024

rustyrussell commented Sep 11, 2024

junity42 commented Sep 12, 2024

daywalker90 commented Sep 12, 2024 •

edited

Loading

grubles commented Sep 12, 2024

rustyrussell commented Sep 13, 2024

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

lightning_hsmd at 100% CPU on cln 24.08 mainnet #7655

Comments

daywalker90 commented Sep 10, 2024

Impa10r commented Sep 10, 2024 • edited Loading

rustyrussell commented Sep 11, 2024

junity42 commented Sep 11, 2024

daywalker90 commented Sep 11, 2024

rustyrussell commented Sep 11, 2024

junity42 commented Sep 12, 2024

daywalker90 commented Sep 12, 2024 • edited Loading

grubles commented Sep 12, 2024

rustyrussell commented Sep 13, 2024

Impa10r commented Sep 10, 2024 •

edited

Loading

daywalker90 commented Sep 12, 2024 •

edited

Loading